# PySpark Project Step-by-Step: Part 2

This notebook will walk you through 2 more steps in the ML lifecycle - **Feature Engineering** and **Model Fitting & Evaluation**.<br>
* In the feature engineering part you'll see how to perform common aggregates using analytical functions.
* In the modelling part you'll see how to prepare your data for modelling in PySpark, and how to fit a model using MLLib.
* Finally, we'll see how we can evaluate the model we've built.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

In [None]:
spark = (
    SparkSession.builder.appName("iot")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("ERROR")

## Read Data

In [None]:
df = spark.read.parquet("processed.pq").withColumn(
    "is_bad", F.when(F.col("label") != "Benign", 1).otherwise(0)
)
df.show(5)

## Feature Engineering

Since we have a time-component to this data, we can engineer all sorts of rolling features. The ones that I'll cover here are:
* Number of times we've seen this source IP in the last minute
* Number of times we've seen this destination IP in the last minute
* Number of times we've seen this source PORT in the last minute
* Number of times we've seen this destination PORT in the last minute

To calculate these features, we'll need to use analytical functions. 

In [None]:
def mins_to_secs(mins):
    return mins * 60


def generate_window(window_in_minutes: int, partition_by: str, timestamp_col: str):
    window = (
        Window()
        .partitionBy(F.col(partition_by))
        .orderBy(F.col(timestamp_col).cast("long"))
        .rangeBetween(-mins_to_secs(window_in_minutes), -1)
    )

    return window


def generate_rolling_aggregate(
    col: str,
    partition_by: str | None = None,
    operation: str = "count",
    timestamp_col: str = "dt",
    window_in_minutes: int = 1,
):
    if partition_by is None:
        partition_by = col

    match operation:
        case "count":
            return F.count(col).over(
                generate_window(
                    window_in_minutes=window_in_minutes,
                    partition_by=col,
                    timestamp_col=timestamp_col,
                )
            )
        case "sum":
            return F.sum(col).over(
                generate_window(
                    window_in_minutes=window_in_minutes,
                    partition_by=col,
                    timestamp_col=timestamp_col,
                )
            )
        case "avg":
            return F.avg(col).over(
                generate_window(
                    window_in_minutes=window_in_minutes,
                    partition_by=col,
                    timestamp_col=timestamp_col,
                )
            )
        case _:
            raise ValueError(f"Operation {operation} is not defined")

### Generate Rolling Count Features

Due to the nicely defined functions above, generating rolling averages and counts is a piece of cake!

In [None]:
df = df.withColumns({
    "source_ip_count_last_min": generate_rolling_aggregate(col="source_ip", operation="count", timestamp_col="dt", window_in_minutes=1),
    "source_ip_count_last_30_mins": generate_rolling_aggregate(col="source_ip", operation="count", timestamp_col="dt", window_in_minutes=30),
    "source_port_count_last_min": generate_rolling_aggregate(col="source_port", operation="count", timestamp_col="dt", window_in_minutes=1),
    "source_port_count_last_30_mins": generate_rolling_aggregate(col="source_port", operation="count", timestamp_col="dt", window_in_minutes=30),
    "dest_ip_count_last_min": generate_rolling_aggregate(col="dest_ip", operation="count", timestamp_col="dt", window_in_minutes=1),
    "dest_ip_count_last_30_mins": generate_rolling_aggregate(col="dest_ip", operation="count", timestamp_col="dt", window_in_minutes=30),
    "dest_port_count_last_min": generate_rolling_aggregate(col="dest_port", operation="count", timestamp_col="dt", window_in_minutes=1),
    "dest_port_count_last_30_mins": generate_rolling_aggregate(col="dest_port", operation="count", timestamp_col="dt", window_in_minutes=30),
    "source_ip_avg_pkts_last_min": generate_rolling_aggregate(col="orig_pkts", partition_by="source_ip", operation="avg", timestamp_col="dt", window_in_minutes=1),
    "source_ip_avg_pkts_last_30_mins": generate_rolling_aggregate(col="orig_pkts", partition_by="source_ip", operation="avg", timestamp_col="dt", window_in_minutes=30),
    "source_ip_avg_bytes_last_min": generate_rolling_aggregate(col="orig_ip_bytes", partition_by="source_ip", operation="avg", timestamp_col="dt", window_in_minutes=1),
    "source_ip_avg_bytes_last_30_mins": generate_rolling_aggregate(col="orig_ip_bytes", partition_by="source_ip", operation="avg", timestamp_col="dt", window_in_minutes=30),
})

In [None]:
df.show(5)

Now,execute and save the resulting table into a new parquet file

In [None]:
df.write.mode("overwrite").parquet("feature_engineered.pq")

In [None]:
df_fe = spark.read.parquet("feature_engineered.pq")

Let's compare the speed of calling the old `df` vs the new `df_fe`...

In [None]:
df_fe.show(10)

Such a drastic difference is because when you call `df.show()` it's going to execute all of the very expensive operations we did. Instead, it's better to construct a new dataframe for the analysis.

## Preprocessing

In [None]:
df_fe.columns[:5]

In [None]:
numerical_features = [
    "duration",
    "orig_bytes",
    "resp_bytes",
    "orig_pkts",
    "orig_ip_bytes",
    "resp_pkts",
    "resp_ip_bytes",
    "source_ip_count_last_min",
    "source_ip_count_last_30_mins",
    "source_port_count_last_min",
    "source_port_count_last_30_mins",
    # "dest_ip_count_last_min",
    # "dest_ip_count_last_30_mins",
    # "dest_port_count_last_min",
    # "dest_port_count_last_30_mins",
    "source_ip_avg_pkts_last_min",
    "source_ip_avg_pkts_last_30_mins",
    "source_ip_avg_bytes_last_min",
    "source_ip_avg_bytes_last_30_mins",
]
categorical_features = ["proto", "service", "conn_state", "history"]
categorical_features_indexed = [c + "_index" for c in categorical_features]

input_features = numerical_features + categorical_features_indexed

### Remove rare categories

In [None]:
df_fe.select([F.count_distinct(c) for c in categorical_features]).show()

In [None]:
categorical_valid_values = {}

for c in categorical_features:
    # Find frequent values
    categorical_valid_values[c] = (
        df_fe.groupby(c)
        .count()
        .filter(F.col("count") > 100)
        .select(c)
        .toPandas()
        .values.ravel()
    )

    df_fe = df_fe.withColumn(
        c,
        F.when(F.col(c).isin(list(categorical_valid_values[c])), F.col(c)).otherwise(
            F.lit("Other").alias(c)
        ),
    )

In [None]:
df_fe.select([F.count_distinct(c) for c in categorical_features]).show()

## Train/Test Split
Train test split will need to be done using the source IP address, otherwise we risk leaking data. The best way to do this is by splitting the IP addresses at random, and then filtering the data frame according to the IP address.

In [None]:
df_fe.groupby("source_ip").agg(F.sum(F.col("is_bad")).alias("bad_sum")).orderBy("bad_sum", ascending=False).show(5)

In [None]:
# Training non-malicious IPs (80%)
train_ips = (
    df_fe.where(
        ~F.col("source_ip").isin(["192.168.100.103", "192.168.2.5", "192.168.2.1"])
    )
    .select(F.col("source_ip"), F.lit(1).alias("is_train"))
    .dropDuplicates()
    .sample(0.8)
)


df_fe = df_fe.join(train_ips, "source_ip", "left")

# Add 1 malicious IP to training and testing data
df_train = df_fe.where((F.col("is_train") == 1) | (F.col("source_ip") == "192.168.100.103"))
df_test = df_fe.where((F.col("is_train") != 1) | (F.col("source_ip") == "192.168.2.5"))

## Pipeline

In [None]:
ind = StringIndexer(inputCols=categorical_features, outputCols=categorical_features_indexed, handleInvalid='skip')
va = VectorAssembler(inputCols=input_features, outputCol="features", handleInvalid='skip' )
rf = RandomForestClassifier(featuresCol="features", labelCol="is_bad", numTrees=100)

pipeline = Pipeline(stages=[ind, va, rf])

## Fit and Predict

In [None]:
pipeline = pipeline.fit(df_train)
test_preds = pipeline.transform(df_test)

## Evaluate

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

roc = BinaryClassificationEvaluator(labelCol="is_bad", metricName="areaUnderROC")
print("ROC AUC", roc.evaluate(test_preds))

pr = BinaryClassificationEvaluator(labelCol="is_bad", metricName="areaUnderPR")
print("PR AUC", pr.evaluate(test_preds))

In [None]:
import pandas as pd

pd.DataFrame(
    {
        "importance": list(pipeline.stages[-1].featureImportances),
        "feature": pipeline.stages[-2].getInputCols(),
    }
).sort_values("importance", ascending=False)

## Export

In [None]:
pipeline.stages[-1].save("rf_basic")

In [None]:
pipeline.save("pipeline_basic")