# Machine Learning - Decision Tree Classifier

In this notebook, we explore the use of decision tree classifier for a Network Intrusion detection use case. The data set used is from KDD cup 1999. 

http://www.kdd.org/kdd-cup/view/kdd-cup-1999

The dataset is a version from the 1998 DARPA Intrusion Detection Evaluation Program by MIT Lincoln Labs. The draw TCP dump ata was collected from a LAN in a lab environment over nine weeks. Simulated attacks were performed to generate malicious traffic.

Attacks could be classified into four main categories:

DOS: denial-of-service
R2L: Remote to Local - unauthorized access from a remote machine 
U2R: User to Root
probing: surveillance and other probing like port scanning.

Step 1. Load the Data

The schema structure is defined and the data is loaded into a spark DataFrame from a csv file

In [1]:
from pyspark.sql.types import *

schema = StructType( [
    
    StructField("duration", DoubleType(), True),
    StructField("protocol_type", StringType(), True),
    StructField("service", StringType(), True),
    StructField("flag", StringType(), True),
    StructField("src_bytes", DoubleType(), True),
    StructField("dst_bytes", DoubleType(), True),
    StructField("land", DoubleType(), True),
    StructField("wrong_fragment", DoubleType(), True),
    StructField("urgent", DoubleType(), True),
    StructField("hot", DoubleType(), True),
    StructField("num_failed_logins", DoubleType(), True),
    StructField("logged_in", DoubleType(), True),
    StructField("num_compromised", DoubleType(), True),
    StructField("root_shell", DoubleType(), True),
    StructField("su_attempted", DoubleType(), True),
    StructField("num_root", DoubleType(), True),
    StructField("num_file_creations", DoubleType(), True),
    StructField("num_shells", DoubleType(), True),
    StructField("num_access_files", DoubleType(), True),
    StructField("num_outbound_cmds", DoubleType(), True),
    StructField("is_host_login", DoubleType(), True),
    StructField("is_guest_login", DoubleType(), True),
    StructField("count", DoubleType(), True),
    StructField("srv_count", DoubleType(), True),
    StructField("serror_rate", DoubleType(), True),
    StructField("srv_serror_rate", DoubleType(), True),
    StructField("rerror_rate", DoubleType(), True),
    StructField("srv_rerror_rate", DoubleType(), True),
    StructField("same_srv_rate", DoubleType(), True),
    StructField("diff_srv_rate", DoubleType(), True),
    StructField("srv_diff_host_rate", DoubleType(), True),
    StructField("dst_host_count", DoubleType(), True),
    StructField("dst_host_srv_count", DoubleType(), True),
    StructField("dst_host_same_srv_rate", DoubleType(), True),
    StructField("dst_host_diff_srv_rate", DoubleType(), True),
    StructField("dst_host_same_src_port_rate", DoubleType(), True),
    StructField("dst_host_srv_diff_host_rate", DoubleType(), True),
    StructField("dst_host_serror_rate", DoubleType(), True),
    StructField("dst_host_srv_serror_rate", DoubleType(), True),
    StructField("dst_host_rerror_rate", DoubleType(), True),
    StructField("dst_host_srv_rerror_rate", DoubleType(), True),
    StructField("label", StringType(), True)
])

kdd10_df = spark.read.csv("/home/training/bdpgstudent/data/kddcup.data_10_percent", schema=schema)

kdd10_df.count()


494021

There are close to 500,000 records in this data set, which is a smaller version of the original data set.

Step 2

Split the data set into two, one for training the model, and the other one for verification

Persist the data, to improve performance

In [3]:

training_data, test_data = kdd10_df.randomSplit([0.7, 0.3], seed=1)

training_data.persist()

training_data.count()
test_data.count()

148128

Step 3 - Create Tranformers, Evaluator and Pipeline

We will now create a features vector. The features vector only takes numerical value as input. But the label column has String values. So we use StringIndexer, to create a new column with mapped numeric values.

Next we create the classifer and also a pipeline with three stages to execute the transformation and evaluation steps.

In [4]:
#Creating Transformers
import pyspark.ml.feature as ft

input_cols = ['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
               'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted',
               'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 
               'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate',
               'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
               'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 
               'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 
               'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']

featuresCreator = ft.VectorAssembler(inputCols=input_cols, outputCol="features")

label_indexer = ft.StringIndexer(inputCol="label", outputCol="label_num")

import pyspark.ml.classification as cl
dt = cl.DecisionTreeClassifier(labelCol="label_num")

from pyspark.ml import Pipeline
pipeline = Pipeline( stages = [ featuresCreator,label_indexer, dt] )


Step 4 - Fit the model

In this step, we fit our model with training data. The fit function of pipeline will take the input data, take the data through the stages and the output is a trained model.



In [5]:
model = pipeline.fit(training_data)



Step 5 - Evaluate the model

In this step, the trained model is evaluated with test data set. Then the predictions by the model based on test data is evaluated for accuracy.

In [6]:
import pyspark.ml.evaluation as ev

predictions = model.transform(test_data)

evaluator = ev.MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label_num", 
        metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(accuracy)


0.9890905163102182


The result is the percentage accuracy

Now we will print the decision tree rules.

In [21]:
dtree = model.stages[2]
print(dtree.toDebugString)

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4f9582349f5257734116) of depth 5 with 31 nodes
  If (feature 20 <= 444.5)
   If (feature 25 <= 0.32)
    If (feature 31 <= 0.14500000000000002)
     If (feature 1 <= 0.5)
      If (feature 32 <= 0.065)
       Predict: 1.0
      Else (feature 32 > 0.065)
       Predict: 7.0
     Else (feature 1 > 0.5)
      If (feature 1 <= 6.5)
       Predict: 4.0
      Else (feature 1 > 6.5)
       Predict: 2.0
    Else (feature 31 > 0.14500000000000002)
     If (feature 1 <= 6.5)
      If (feature 23 <= 0.995)
       Predict: 4.0
      Else (feature 23 > 0.995)
       Predict: 7.0
     Else (feature 1 > 6.5)
      If (feature 19 <= 75.5)
       Predict: 2.0
      Else (feature 19 > 75.5)
       Predict: 8.0
   Else (feature 25 > 0.32)
    If (feature 19 <= 275.5)
     If (feature 9 <= 0.5)
      If (feature 4 <= 0.5)
       Predict: 2.0
      Else (feature 4 > 0.5)
       Predict: 8.0
     Else (feature 9 > 0.5)
      If (feature 30 <= 0.935