# MLlib Decision Trees to Predict Network Attacks
### Classifier will predict `normal` or `attack`.  

Source: https://github.com/jadianes/spark-py-notebooks/blob/master/nb9-mllib-trees/nb9-mllib-trees.ipynb  
File Last Updated: Feb 16, 2018

In [4]:
import urllib
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

## Dataset
Provided in the KDD Cup 1999, the file is provided as a Gzip file that we will download locally.

In [2]:
f = urllib.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz", "kddcup.data.gz")

In [12]:
data_file = "./kddcup.data.gz"
raw_data = sc.textFile(data_file)
raw_data.cache()

./kddcup.data.gz MapPartitionsRDD[16] at textFile at NativeMethodAccessorImpl.java:0

In [13]:
print("Records in dataset: {}".format(raw_data.count()))

Records in dataset: 4898431


In [18]:
raw_data.take(3)

[u'0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,1,1,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,2,2,1.00,0.00,0.50,0.00,0.00,0.00,0.00,0.00,normal.']

The [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) also provide test data that we will load in a separate RDD.  

In [7]:
ft = urllib.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/corrected.gz", "corrected.gz")

In [14]:
test_data_file = "./corrected.gz"
test_raw_data = sc.textFile(test_data_file)

In [17]:
print("Test data size is {}".format(test_raw_data.count()))

Test data size is 311029


In [20]:
test_raw_data.take(3)

[u'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.']

## Data preprocessing

In [21]:
from pyspark.mllib.regression import LabeledPoint
from numpy import array

csv_data = raw_data.map(lambda x: x.split(","))
test_csv_data = test_raw_data.map(lambda x: x.split(","))

In [30]:
protocols = csv_data.map(lambda x: x[1]).distinct().collect()
services = csv_data.map(lambda x: x[2]).distinct().collect()
flags = csv_data.map(lambda x: x[3]).distinct().collect()

In [34]:
protocols[:5]

[u'udp', u'icmp', u'tcp']

In [55]:
protocols.index(u'udp')

0

In [56]:
services[:5]

[u'urp_i', u'http_443', u'Z39_50', u'smtp', u'domain']

In [57]:
def create_labeled_point(line_split):
    # leave_out = [41]
    clean_line_split = line_split[0:41]
    
    # convert protocol to numeric categorical variable
    try: 
        clean_line_split[1] = protocols.index(clean_line_split[1])
    except:
        clean_line_split[1] = len(protocols)
        
    # convert service to numeric categorical variable
    try:
        clean_line_split[2] = services.index(clean_line_split[2])
    except:
        clean_line_split[2] = len(services)
    
    # convert flag to numeric categorical variable
    try:
        clean_line_split[3] = flags.index(clean_line_split[3])
    except:
        clean_line_split[3] = len(flags)
    
    # convert label to binary label
    attack = 1.0
    if line_split[41]=='normal.':
        attack = 0.0
        
    return LabeledPoint(attack, array([float(x) for x in clean_line_split]))


In [58]:
training_data = csv_data.map(create_labeled_point)

In [59]:
training_data.take(2)

[LabeledPoint(0.0, [0.0,2.0,58.0,10.0,215.0,45076.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,2.0,58.0,10.0,162.0,4528.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0])]

In [61]:
test_data = test_csv_data.map(create_labeled_point)

## Training the decision tree classifier
**(Took 193 seconds w default config)**

In [63]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from time import time

# Build the model
t0 = time()
tree_model = DecisionTree.trainClassifier(training_data, 
                                          numClasses=2, 
                                          categoricalFeaturesInfo={1: len(protocols), 2: len(services), 3: len(flags)},
                                          impurity='gini', 
                                          maxDepth=4, 
                                          maxBins=100)

print("Classifier training time (secs): {}".format(round(time() - t0, 3)))

Classifier training time (secs): 193.72


## Evaluating the model

In order to measure the classification error on our test data, we use `map` on the `test_data` RDD and the model to predict each test point class.  

Let's look at some labels and feature vectors:

In [69]:
test_data.map(lambda p: p.label).take(10)

[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0]

In [66]:
test_data.map(lambda p: p.features).take(2)

[DenseVector([0.0, 0.0, 5.0, 10.0, 105.0, 146.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 255.0, 254.0, 1.0, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]),
 DenseVector([0.0, 0.0, 5.0, 10.0, 105.0, 146.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 255.0, 254.0, 1.0, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])]

In [71]:
# Make predictions at each test point
predictions = tree_model.predict(test_data.map(lambda p: p.features))

# zip labels and predictions
labels_and_preds = test_data.map(lambda p: p.label).zip(predictions)

In [73]:
labels_and_preds.take(10)

[(0.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (1.0, 0.0),
 (1.0, 0.0),
 (1.0, 0.0),
 (0.0, 0.0),
 (0.0, 0.0),
 (1.0, 0.0),
 (0.0, 0.0)]

Compute classifier accuracy

In [74]:
accuracy_test = labels_and_preds.filter(lambda (lab, pred): lab == pred).count() / float(test_data.count())

print("Test accuracy = {}".format(round(accuracy_test, 4)))

Test accuracy = 0.9183


### Interpreting the model

In [80]:
print(tree_model.toDebugString())

DecisionTreeModel classifier of depth 4 with 27 nodes
  If (feature 22 <= 33.0)
   If (feature 25 <= 0.5)
    If (feature 36 <= 0.48)
     If (feature 34 <= 0.91)
      Predict: 0.0
     Else (feature 34 > 0.91)
      Predict: 1.0
    Else (feature 36 > 0.48)
     If (feature 2 in {0.0,42.0,24.0,20.0,46.0,57.0,60.0,44.0,27.0,12.0,7.0,3.0,18.0,67.0,43.0,26.0,55.0,58.0,36.0,4.0,47.0,15.0})
      Predict: 0.0
     Else (feature 2 not in {0.0,42.0,24.0,20.0,46.0,57.0,60.0,44.0,27.0,12.0,7.0,3.0,18.0,67.0,43.0,26.0,55.0,58.0,36.0,4.0,47.0,15.0})
      Predict: 1.0
   Else (feature 25 > 0.5)
    If (feature 3 in {10.0,6.0,2.0,7.0,3.0,4.0})
     If (feature 2 in {42.0,27.0,12.0,7.0,3.0,18.0,50.0,67.0,8.0,58.0,51.0,15.0,68.0})
      Predict: 0.0
     Else (feature 2 not in {42.0,27.0,12.0,7.0,3.0,18.0,50.0,67.0,8.0,58.0,51.0,15.0,68.0})
      Predict: 1.0
    Else (feature 3 not in {10.0,6.0,2.0,7.0,3.0,4.0})
     If (feature 38 <= 0.07)
      Predict: 0.0
     Else (feature 38 > 0.07)
      P

For example, a network interaction with the following features (see description here) will be classified as an attack by our model:

* count, the number of connections to the same host as the current connection in the past two seconds, being greater than 32.  
* dst_bytes, the number of data bytes from destination to source, is 0.  
* service is neither level 0 nor 52.  
* logged_in is false.

From our services list we know that:

In [81]:
print "Service 0 is {}".format(services[0])
print "Service 52 is {}".format(services[52])

Service 0 is urp_i
Service 52 is tftp_u


So we can characterise network interactions with more than 32 connections to the same server in the last 2 seconds, transferring zero bytes from destination to source, where service is neither *urp_i* nor *tftp_u*, and not logged in, as network attacks. A similar approach can be used for each tree terminal node.     

We can see that `count` is the first node split in the tree. Remember that each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node (see more [here](https://spark.apache.org/docs/latest/mllib-decision-tree.html#basic-algorithm)). At a second level we find variables `flag` (normal or error status of the connection) and `dst_bytes` (the number of data bytes from destination to source) and so on.    

This explaining capability of a classification (or regression) tree is one of its main benefits. Understanding data is a key factor to build better models.

## Building a minimal model using the three main splits

So now that we know the main features predicting a network attack, thanks to our classification tree splits, let's use them to build a minimal classification tree with just the main three variables: `count`, `dst_bytes`, and `flag`.  

We need to define the appropriate function to create labeled points.  

In [82]:
def create_labeled_point_minimal(line_split):
    # leave_out = [41]
    clean_line_split = line_split[3:4] + line_split[5:6] + line_split[22:23]
    
    # convert flag to numeric categorical variable
    try:
        clean_line_split[0] = flags.index(clean_line_split[0])
    except:
        clean_line_split[0] = len(flags)
    
    # convert label to binary label
    attack = 1.0
    if line_split[41]=='normal.':
        attack = 0.0
        
    return LabeledPoint(attack, array([float(x) for x in clean_line_split]))


In [83]:
training_data_minimal = csv_data.map(create_labeled_point_minimal)
test_data_minimal = test_csv_data.map(create_labeled_point_minimal)

That we use to train the model.  

In [85]:
# Build the model
t0 = time()
tree_model_minimal = DecisionTree.trainClassifier(training_data_minimal, numClasses=2, 
                                          categoricalFeaturesInfo={0: len(flags)},
                                          impurity='gini', maxDepth=3, maxBins=32)
tt = time() - t0

print "Classifier trained in {} seconds".format(round(tt,3))

Classifier trained in 132.567 seconds


Now we can predict on the testing data and calculate accuracy.  

In [88]:
predictions_minimal = tree_model_minimal.predict(test_data_minimal.map(lambda p: p.features))
labels_and_preds_minimal = test_data_minimal.map(lambda p: p.label).zip(predictions_minimal)

In [89]:
accuracy_test_minimal = labels_and_preds_minimal.filter(lambda (lab, pred): lab == pred).count() / float(test_data.count())

print("Test accuracy = {}".format(round(accuracy_test_minimal, 4)))

Test accuracy = 0.9049
