Getting Started With Structured Streaming
---------------------------------------

Now that we have worked through some examples with batch data, it's time to turn our attention to Spark's Structured Streaming.

Before you start, it's a good idea to read through the following articles:
* [Structured Streaming in Apache Spark](https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html)
* [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)

These both give you a sense of how one can use streaming - where we can see an advantage, as well as some of the challenges we'll face.

Make no mistake, either - streaming can be hard to get right. In fact, we'll warn you up front that this walkthrough can be challenging to get right. But, it's worth it!

Before we dive into the code, let's step back and take the 10,000-foot view of our process. Keep this notebook open while you run through the streaming walkthrough and exercise - it should help you keep perspective on all the new things you'll learn here.

Part 1: The Process
------------------

Our end goal today is to stream data across a socket, and process that data in real time with Spark. There are a ton of streaming data sources out there (think Twitter streaming, Citibike data, NYC MTA train arrivals, and so on). Those are all excellent data sources (and you may even want to try your hand at processing one of those datasets for your exercise, or even your final project).

However, all of those streaming datasets present one major challenge - they are all realtime, which means, how can we **really** know that we've got our data right?

To get around this problem, we're going to build our own mini-environment, and stream a known dataset through our own network. If this sounds daunting, don't worry - we've got you. You should use the provided code, data, and instructions to run our walkthrough.

Our Dataset: Network Data from Los Alamos National Lab
---------------------------------------------------

Los Alamos has posted a huge amount of network data online in csv files for you to use. For the purposes of this walkthrough, they're WAY too large for our needs. So we've sampled them way, way down to about 100,000 flows for two days of data that are included here. If you're interested, the data is located [here](https://csr.lanl.gov/data/2017.html)

Each line in the file represents a "conversation" between two computers. The fields captured are:
* Time: the start time of the conversation (in a proprietary timestamp format)
* Duration: the length of the conversation (in seconds)
* SrcDevice: name of the device that initiated the conversation.
* DstDevice: name of the device that was requested.
* Protocol: network protocol used (TCP, UDP, etc)
* SrcPort: network port (0-65,536) on the originating device
* DstPort: network port (0-65,536) on the destination device
* SrcPackets: network packet count sent from the source to the destination.
* DstPackets: network packet count sent from the destination to the source.
* SrcBytes: byte count sent from the source to the destination.
* DstBytes: byte count sent from the destination to the source.

As you can imagine, there's a lot here. However, we're going to try to answer a relatively simple question:
**From this data, can we identify which devices are the web servers?**

We'll use some knowledge about network behavior to tackle this - and we'll simplify things even more. All we're going to do is rely on the fact that web servers typically communicate on **port 80**. So, if a computer requests port 80 as the `dstPort` in a flow, it's likely that the destination device (`dstDevice`) is a web server. If we see that computer name come up repeatedly in our request list, then there's a good chance that device is a web server.

So for our streaming exercise, we need to build a count query that processes streams as they come in and updates the count of web servers, then reports back to us what it sees.

How are we going to do this?

For our work today, we'll try to be very methodical. This is how we'll get there:
1. We'll start off by setting up a small network of two Docker containers. One will serve our datasets, the other will receive and process them. 
2. Before we start streaming data across our network, we'll take advantage of the fact that we have a static dataset. We'll go back to our knowledge of batch processing, and work out the query on our dataset to answer our web server question.
3. Now, finally, we'll start streaming data across the network. We'll process the data in real-time, and update our count of web servers. By the time the data finishes processing, we should have identical results with our batch processing.


Of course, this is a significant generalization and simplification of what we can do with structured streaming in Spark. Our goal is to demonstrate the principal, and allow you to see what you can do with it. In a production environment, there would likely be a processing layer between Spark and the incoming data. Typically this is managed in Kafka, which is great at ingesting multiple sources of data, and sending them to Spark in a known format. This helps with processing streaming data - which often is very fragile.

OK - enough talk of preparation - on to networking Docker containers!