### Anomaly Detection in Server Logs
#### CS 5614: Big Data Engineering
#### By: Vanessa Eichensehr and Bradley Freedman


**Project Objective:**  

Build a machine learning-based model that detects anomalies on a high volume high velocity log base. 

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-kafka-streaming").\
        master("spark://spark-master:7077").\
        config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0"). \
        config("spark.executor.memory", "512m").\
        getOrCreate()

25/05/04 10:24:43 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
25/05/04 10:24:43 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
25/05/04 10:24:44 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master


**Step 1:** 

Create a streaming DataFrame in Spark that reads data from a Kafka topic named "topic_test" and starts
processing from the beginning of the topic's log using the earliest available offset. 

Uses kafka:9093 as the bootstrap server.

In [None]:
df_streamed_raw = (spark.readStream.format("kafka").option("kafka.bootstrap.servers", "kafka:9093").option("subscribe", "topic_test").load())

In [3]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

# convert byte stream to string
df_streamed_kv = (df_streamed_raw
    .withColumn("key", df_streamed_raw["key"].cast(StringType()))
    .withColumn("value", df_streamed_raw["value"].cast(StringType())))

test_query = (df_streamed_kv 
              .writeStream \
              .format("memory") # output to memory \
              .outputMode("update") # only write updated rows to the sink \
              .queryName("test_query_table")  # Name of the in memory table \
              .start())

25/04/22 00:17:01 WARN StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-eec0e83f-155c-4662-a10b-b8683c6aba5d. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
                                                                                

If done properly, the following cell should display the values being produced from your producer:

In [4]:
spark.sql("select * from test_query_table").show()

+---+--------------------+----------+---------+-------+--------------------+-------------+
|key|               value|     topic|partition| offset|           timestamp|timestampType|
+---+--------------------+----------+---------+-------+--------------------+-------------+
|509|{"ip_address": "1...|topic_test|        0|2207562|2025-04-22 00:17:...|            0|
|510|{"ip_address": "1...|topic_test|        0|2207563|2025-04-22 00:17:...|            0|
|511|{"ip_address": "2...|topic_test|        0|2207564|2025-04-22 00:17:...|            0|
|512|{"ip_address": "3...|topic_test|        0|2207565|2025-04-22 00:17:...|            0|
|513|{"ip_address": "3...|topic_test|        0|2207566|2025-04-22 00:17:...|            0|
|514|{"ip_address": "3...|topic_test|        0|2207567|2025-04-22 00:17:...|            0|
|515|{"ip_address": "3...|topic_test|        0|2207568|2025-04-22 00:17:...|            0|
|516|{"ip_address": "1...|topic_test|        0|2207569|2025-04-22 00:17:...|            0|

In [5]:
test_query.stop()

#### The following cells contain code that take the streamed dataframe and formats it properly into a table. If any of the given cells fails, there might be a formatting issue with one of your previous solutions. 

In [None]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, BooleanType, LongType, IntegerType

event_schema = StructType([
    StructField("ip_address", StringType()),
    StructField("date_time", StringType()),
    StructField("request_type", StringType()),
    StructField("request_arg", StringType()),
    StructField("status_code", StringType()),
    StructField("response_size", StringType()),
    StructField("referrer", StringType()),
    StructField("user_agent", StringType())
])

# Parse the events from JSON format
df_parsed = (df_streamed_kv
           # Sets schema for event data
           .withColumn("value", from_json("value", event_schema))
          )

In [None]:
df_formatted = (df_parsed.select(
    col("key").alias("event_key")
    ,col("topic").alias("event_topic")
    ,col("timestamp").alias("event_timestamp")
    ,col("value.ip_address").alias("ip_address")
    ,col("value.date_time").alias("date_time")
    ,col("value.request_type").alias("request_type")
    ,col("value.request_arg").alias("request_arg")
    ,col("value.status_code").alias("status_code")
    ,col("value.response_size").cast(IntegerType()).alias("response_size")
    ,col("value.referrer").alias("referrer")
    ,col("value.user_agent").alias("user_agent")
))

In [None]:
# Write the parsed data to console
query = (df_formatted.writeStream.format("console").outputMode("append").trigger(processingTime='5 seconds').start())

25/04/22 00:17:45 WARN StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-01f41fcd-a17c-480f-99a5-8c6ccf85f1d2. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.


-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----------+---------------+----------+---------+------------+-----------+-----------+-------------+--------+----------+
|event_key|event_topic|event_timestamp|ip_address|date_time|request_type|request_arg|status_code|response_size|referrer|user_agent|
+---------+-----------+---------------+----------+---------+------------+-----------+-----------+-------------+--------+----------+
+---------+-----------+---------------+----------+---------+------------+-----------+-----------+-------------+--------+----------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---------+-----------+--------------------+--------------+--------------------+------------+--------------------+-----------+-------------+--------------------+--------------------+
|event_key|event_topic|     event_timestamp|    ip_address|           date_time|request_type| 

In [None]:
# Print the name of active streams (This may be useful during debugging)
for s in spark.streams.active:
    print(f"ID:{s.id} | NAME:{s.name}")

ID:ee86bca4-0408-42df-9f7e-ce9482a0d274 | NAME:None


In [None]:
query.stop()

**Best Guess at Next Steps**

Load labeled training data

In [None]:
df = spark.read.csv("archive/synthetic_with_anomalies.csv", header=True, inferSchema=True)

Pre-process and engineer features (might need to encode or use string indexer, depending on model)

Split data into test and train sets

Train the ML model(s) -- binary and multi-class?

save the pipeline and model 

there is a way to apply the ML model on a stream of data and output to console

might need to broadcast this data elsewhere so can show a UI..