### Smart Detection of Bot/Malware Generated Network Traffic - Model on Spark 

Malware traffic is often hard to detect as it uses real users' PC or browsers in order to generate fraudulent activity and Spam. The purpose of this project is to build a simple supervised model that will be trained to detect malware based traffic in a network traffic log or capture. When the model flags an IP as generating malware based spam and fraudulent activity  it can be listed for quarantine or further analysis. 


The Dataset used here is part of a larger dataset (named CTU-13) which records 4 hours of network traffic in a computer network of a university department in the CTU University, Czech Republic. The researchers that created the dataset infected one of the computers in the network in a malware that generates ClickFraud and Spam activity. The traffic was recorded by a traffic analytics tool which captured malware-based activity generated by the infected PC in addition to normal traffic. Since the infected computer is known, the data is labeled and the purpose of the project is to present a supervised classification model.

### Feature Extraction, Creation, and Normalization

In the data exploration stage we have identified the relevant features. In this combined notebok we will first prepare the features, and then build and evaluate our model. 

In [1]:
import pyspark.sql
import numpy as np 
from pyspark.sql.functions import col

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190813233752-0000
KERNEL_ID = 5427d62f-bacf-4735-bd71-f5715de2131e


In [2]:
#We start by loading the CSV file that we are going to use

import ibmos2spark
# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-e2505fd8-30ac-494c-84fd-b6d402f6066e',
    'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token',
    'api_key': 'Vflbn-q3SEKstW6WU0EqiO02GjNQo6bIb2UFbwUQ6ov5'
}

configuration_name = 'os_01e0daecbfe749718a79c33cdabd6510_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('Sample2Capture.csv', 'default-donotdelete-pr-fuwiqkd4yylvy3'))

df.printSchema()

root
 |-- StartTime: string (nullable = true)
 |-- Dur: string (nullable = true)
 |-- Proto: string (nullable = true)
 |-- SrcAddr: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Dir: string (nullable = true)
 |-- DstAddr: string (nullable = true)
 |-- Dport: string (nullable = true)
 |-- State: string (nullable = true)
 |-- sTos: string (nullable = true)
 |-- dTos: string (nullable = true)
 |-- TotPkts: string (nullable = true)
 |-- TotBytes: string (nullable = true)
 |-- SrcBytes: string (nullable = true)
 |-- Label: string (nullable = true)



In [3]:
# select the interesting features discovered in the previous stage and add the label to the dataset.
# we know the label according to the IP address of the infected PC

df.createOrReplaceTempView('df') 

infected_addr = "147.32.84.165"

sql = f"""
SELECT Dur, Proto, Sport, Dir, Dport, State, sTos, dTos, TotPkts,TotBytes,SrcBytes, 
CASE WHEN SrcAddr = '{infected_addr}' THEN 1 ELSE 0 END AS Bot
from df

"""
df_current = spark.sql(sql)



In [4]:
# convert relevant datatypes from string to integers and fill in some nulls in two columns

from pyspark.sql.types import FloatType

df_current = df_current.fillna({'sTos':'-1','dTos':'-1'})
df_current = df_current.withColumn("Dur", df_current["Dur"].cast(FloatType()))
df_current = df_current.withColumn("TotPkts", df_current["TotPkts"].cast(FloatType()))
df_current = df_current.withColumn("TotBytes", df_current["TotBytes"].cast(FloatType()))
df_current = df_current.withColumn("SrcBytes", df_current["SrcBytes"].cast(FloatType()))

df_current.createOrReplaceTempView('df_current')

df_current.printSchema()

root
 |-- Dur: float (nullable = true)
 |-- Proto: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Dir: string (nullable = true)
 |-- Dport: string (nullable = true)
 |-- State: string (nullable = true)
 |-- sTos: string (nullable = false)
 |-- dTos: string (nullable = false)
 |-- TotPkts: float (nullable = true)
 |-- TotBytes: float (nullable = true)
 |-- SrcBytes: float (nullable = true)
 |-- Bot: integer (nullable = false)



In [5]:
# index all the relevant categorical features.

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer


state_indexer = StringIndexer(inputCol="State", outputCol="StateIndex")
proto_indexer = StringIndexer(inputCol="Proto", outputCol="ProtoIndex")
dir_indexer   = StringIndexer(inputCol="Dir", outputCol="DirIndex")
sTos_indexer   = StringIndexer(inputCol="sTos", outputCol="sTosIndex")
dTos_indexer   = StringIndexer(inputCol="dTos", outputCol="dTosIndex")




In [6]:
# Add the data indexers to a pipeline and transform the data accordingly

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[state_indexer, proto_indexer, dir_indexer,sTos_indexer,dTos_indexer])
df_indexed =  pipeline.fit(df_current).transform(df_current)


In [7]:
# create the vector assembler with all the relevant data and the normalizer
vectorAssembler = VectorAssembler(inputCols=["StateIndex","ProtoIndex","DirIndex", "Dur","TotBytes","SrcBytes","sTosIndex","dTosIndex"],outputCol="features")
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)


#### The features are   prepared. Now we are  going to create a balanced training data set and a test data set  

In [8]:
#first remove unused cols
columns_to_drop = ['Proto', 'Sport','Dir','Dport','State','sTos','dTos']
df_indexed = df_indexed.drop(*columns_to_drop)

# split and balance the data
splits = df_indexed.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test = splits[1]

#now balance the classes in the training data set
bot_data  = df_train.filter(df_train.Bot == 1) #get all the bot data
#get all the normal data and sample %1 to approximate the size of the bot traffic sample
normal_data_sampled  = df_train.filter(df_train.Bot == 0).sample(False,.01) 
#union the downsampled normal data and the bot data
df_train_balanced = bot_data.union(normal_data_sampled) 

In [9]:
print(f"Total Data:{df_indexed.count()}")
print(f"Train Set:{df_train.count()}")
print(f"Test Set:{df_test.count()}")
print(f"Bot Req in Test:{bot_data.count()}")
print(f"Normal Req Sampled:{normal_data_sampled.count()}")
print(f"Total Balanced Train Set:{df_train_balanced.count()}")


Total Data:372715
Train Set:298473
Test Set:74242
Bot Req in Test:2175
Normal Req Sampled:3004
Total Balanced Train Set:5179


#### Build and evaluate a model based on spark's gradient boosted trees algorithm


In [10]:
from pyspark.ml.classification import GBTClassifier, LinearSVC
classifier = GBTClassifier(labelCol="Bot", featuresCol="features_norm", maxIter=10)
#another option that didnt perform well is: LinearSVC(maxIter=10, regParam=0.1, featuresCol='features_norm', labelCol='Bot')

pipeline = Pipeline(stages=[vectorAssembler,normalizer,classifier])
model = pipeline.fit(df_train_balanced)
prediction = model.transform(df_train_balanced)

In [11]:
#prediction.show()
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction").setLabelCol("Bot")
evaluator.evaluate(prediction)



0.9865475151904741

In [12]:
#evaluate the model against the test data
prediction = model.transform(df_test)
evaluator.evaluate(prediction)

0.9841129484891367

In [13]:
#check the confusion matrix

false_negatives = prediction.filter("Bot == 1 and prediction=0.0").count()
true_positives  = prediction.filter("Bot == 1 and prediction=1.0").count()

tp = true_positives
total = true_positives + false_negatives
pr = true_positives / (true_positives + false_negatives) * 100

print(f"{tp} bot requests out of {total} were correctly identified ({pr}%)")

false_positives = prediction.filter("Bot == 0 and prediction=1").count()
true_negatives  = prediction.filter("Bot == 0 and prediction=0").count()

fp = false_positives
total = false_positives + true_negatives
pr = false_positives / (false_positives + true_negatives) * 100

print(f"{fp} normal requests out of {total} were wrongly identified ({pr}%)")

488 bot requests out of 516 were correctly identified (94.57364341085271 %)
5302 normal requests out of 73726 were wrongly identified (7.1914928247836585%)


#### To summarize: our model wrongly identified about  7% of the normal requests as infected by maleware while it correctly identified 92% of the malware infected requests.  The detection rate is quite satisfying, though its up to the stakeholders to decide whether a false positive rate of 7% is accepable. 