Testing our Web Server Hypothesis on Batch Data
--------------------------------------------

Before we dive into streaming our datasets, we first want to examine the dataset to find which devices are web servers. We'll take a similar approach to what we did with the streaming data walkthrough earlier.

In [1]:
# these imports allow us to set up our Python connection to the Spark server.
# it also allows us to load a dataframe.
from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql.functions import desc, col

In [2]:
CSV_PATH = "./lanl/day02_chunk.csv"
APP_NAME = "Web Server Hypothesis Test"
SPARK_URL = "local[*]"

In [3]:
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()

df = spark.read.options(header = 'true', inferschema = 'true').csv(CSV_PATH)

In [4]:
df.select('DstDevice') \
    .where(col('DstPort').isin([80, 443])) \
    .groupby('DstDevice') \
    .count() \
    .sort(desc('count')) \
    .show(10)

+-------------------+-------+
|          DstDevice|  count|
+-------------------+-------+
|         Comp186884|1107920|
|         Comp576843| 895874|
|EnterpriseAppServer| 433684|
|         Comp611862| 273393|
|         Comp184712| 246123|
|         Comp501516| 216761|
|         Comp393033| 169524|
|         Comp146745| 132400|
|         Comp916004|  84070|
|         Comp574103|  82949|
+-------------------+-------+
only showing top 10 rows



So - there it is - our top 10 devices that look to be web servers.
Let's move on to streaming.