Before we start here, we need to kick off our network data server.

Go over to a terminal window and attach to the `data_server` container with:
`docker attach data_server`.

You'll end up at a terminal prompt.

To start the server, all you need to do is type: `python3.6 server.py`

The server should report its address and port number; you can now return to this notebook and start working through the cells.

Fist things first, our imports.

We start with our `SparkContext` and `SparkSession`, then import the data types we need for our network flow data.

Finally we import a `StreamingContext` which will create our streaming data frame, and we translate the CSV to JSON on the server side, so we need the `json` module to parse our JSON.

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.streaming import StreamingContext

import json

We create a few utility functions with some minor error checking here. These confirm that we are taking in JSON, and then we add the JSON to our data frame. 

When we check the JSON, we also pass through only the `DstDevice` value where the `DstPort == 80`. It's a crude but effective way to get us to a data structure that is easy to test our hypothesis against.

Note that in the `convert_json2df` function, we account for the empty stream by creating a null data frame.

Finally, if we uncomment the `df.show()` you can see the data come in as it is processed.

In [None]:
def check_json(js, col):
    try:
        data = json.loads(js)
        return [data.get(i) for i in col]
    except Exception as e:
        print(f"returning an empty json")
        return []

def convert_json2df(rdd, col, null):
    ss = SparkSession(rdd.context)
    if rdd.isEmpty():
        df = ss.createDataFrame(null, schema=col)
    else:
        df = ss.createDataFrame(rdd, schema=col)
    df.show()

Now we define our fields - note that in this streaming context we can't just rely on the parser to interpret the structure and header. We need to do that ourselves.

`cols` is used to check that we have actual JSON.
`colStruct` is used to define our structure.

In [None]:
cols = ['Time', 'Duration', 'SrcDevice', 'DstDevice', 'Protocol', 'SrcPort', 
        'DstPort', 'SrcPackets', 'DstPackets', 'SrcBytes', 'DstBytes']

nullFiller = [(0, 0, '', '', 0, '', '', 0, 0, 0, 0)]

colStruct = StructType([
    StructField('time', LongType(), True),
    StructField('duration', LongType(), True),
    StructField('srcdevice', StringType(), True),
    StructField('dstdevice', StringType(), True),
    StructField('protocol', LongType(), True),
    StructField('srcport', StringType(), True),
    StructField('dstport', StringType(), True),
    StructField('srcpackets', LongType(), True),
    StructField('dstpackets', LongType(), True),
    StructField('srcbytes', LongType(), True),
    StructField('dstbytes', LongType(), True)
])

Now we kick off our streaming. We open a socket text stream that connects to our server. At this point you should begin seeing the server send data. We process our JSON, and then convert it to a data frame.

Finally we run our query on the data, and update it as we go.

In [None]:
# Create a local StreamingContext with two working thread and batch interval of 5 seconds
sc = SparkContext("local[2]", "NetworkApp")
ssc = StreamingContext(sc, 5)

lines = ssc.socketTextStream("172.18.0.2", 9009) \
        .map(lambda x: check_json(x, cols)) \
        .foreachRDD(lambda x: convert_json2df(x, colStruct, nullFiller))
#webservers = dstDevices.groupby('dstdevice').count().sort(desc('count'))
#webservers.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

Find web servers: They're destination devices that have port 80. Which ones are the most active?