# Decision Tree
Decision Trees (DTs) are a non-parametric supervised learning method used for classification. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

## Instruction to Use
1. load data from Github. 
2. run `main`

## Input

1. .csv - produced by pre_processing.ipynb
2. The pre_processed input data includes following techniques:
   * MinMax Scaling
   * PCA
   * Correlation

## Output/Analysis

1. Visualising the accuracy of RF with k-fold validation.
2. Comparing the accuracy of RF model with and without PCA.   

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz
import matplotlib.image as mpimg
import subprocess
from sklearn.model_selection import GridSearchCV

In [2]:
acc1 = []
f1 = []
acc = []

In [3]:
def load_data(filename):
    return pd.read_csv(filename)

# Spilt the input file into test and train dataset

I/P: dataframe

O/P: x_cross_val, y_cross_val, x_test, y_test

In [4]:
def prep_training(network_data):
    return train_test_split(network_data, train_size=0.8, test_size=0.2, random_state=42)

In [5]:
from pprint import pprint

In [6]:
def get_data(filename):
    return pd.read_csv(filename)

### RandomizedSearch CV versus GridSearchCV

Grid Search is good when we work with a small number of hyperparameters. However, if the number of parameters to consider is particularly high and the magnitudes of influence are imbalanced, the better choice is to use the Random Search. [Reference](https://towardsdatascience.com/machine-learning-gridsearchcv-randomizedsearchcv-d36b89231b10)

Therefore, we have used GridSearchCV.

In [7]:
def creatingParamterGrid():
    #n_estimators = [30,60,90,120]   # Number of trees in random forest
    max_depth = [2,4,6]    # Maximum number of levels in tree
    #min_samples_split = [2,4,6]    # Minimum number of samples required to split a node
    myCriterion = ['gini', 'entropy']
    
    # Create the parameter grid
    parameter_grid = {'max_depth': max_depth,
                     'criterion': myCriterion}
    return parameter_grid

## Cross-Validated Grid Search

We are now ready to create our grid-search object. We'll use each of the objects we've created thus far.
Instead of passing a `PredefinedSplit` object tothe `cv` paramter, we are simply passing the number of folds.

In [8]:
def prepToFindOptimalHyperParams(clf,parameter_grid):
    grid_search = GridSearchCV(estimator=clf, cv=3, param_grid = parameter_grid)
    return grid_search

## Training the Model

Now that we have created our `grid_search` object, we are ready to train our model.

In [9]:
def train(cross_val_df,grid_search):
    grid_search.fit(cross_val_df.drop("label",axis=1), cross_val_df["label"])

## Cross-validated Results

To examine the results individual fold, we use `grid_search`'s `cv_results_` attribute.
pd.DataFrame(grid_search.cv_results_).head()

## Optimal Hyperparamters

grid_search.best_params_
grid_search.best_score_
# Test the model and analyse it

In [10]:
def testDecisionTreeModel(test_df,grid_search):
    Ytest = test_df["label"]
    Ypred = grid_search.predict(test_df.drop("label",axis=1))
    acc=accuracy_score(Ytest,Ypred)
    print('Accuracy of Decision Tree: {:.4f}'.format(acc))  
    print(classification_report(Ytest,Ypred))
    print(confusion_matrix(Ytest, Ypred))

# Main Function

In [11]:
def main(network_data): 
    clf = DecisionTreeClassifier(random_state = 42)
    cross_val_df, test_df = prep_training(network_data)
    parameter_grid = creatingParamterGrid()
    pprint(parameter_grid)
    grid_search = prepToFindOptimalHyperParams(clf,parameter_grid)
    train(cross_val_df,grid_search)
    testDecisionTreeModel(test_df,grid_search)
    print(grid_search.best_params_)
    print(grid_search.best_score_)
    return grid_search

## Classification with DT without Preprocessing
We also drop 'attack_cat' because there is only one encoded digit for non-attack category.

In [12]:
network_data = load_data('https://raw.githubusercontent.com/divyaKh/CMPE255Project/main/2.Data_Cleaning/cleaned_dataset_label_encoding.csv')

In [13]:
if 'attack_cat' in list(network_data):
    network_data = network_data.drop("attack_cat",axis=1)
grid_search = main(network_data)

{'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6]}
Accuracy of Decision Tree: 0.9197
              precision    recall  f1-score   support

           0       0.91      0.86      0.89     18675
           1       0.92      0.95      0.94     32860

    accuracy                           0.92     51535
   macro avg       0.92      0.91      0.91     51535
weighted avg       0.92      0.92      0.92     51535

[[16041  2634]
 [ 1502 31358]]
{'criterion': 'gini', 'max_depth': 6}
0.9194956803083277


## Classification with DT after MinMax Scaling 

In [14]:
network_data1 = load_data('../input/dataset_minmax.csv')

In [15]:
network_data1

Unnamed: 0,# dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
0,1.833334e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000033,0.000000,0.090909,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.016393,0.0,0
1,1.333334e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000121,0.000000,0.125000,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.016393,0.0,0
2,8.333335e-08,0.901515,0.000000,0.5,0.000094,0.000000,0.000073,0.000000,0.200000,0.996078,...,0.000000,0.000000,0.031250,0.0,0.0,0.0,0.000000,0.032787,0.0,0
3,1.000000e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000061,0.000000,0.166667,0.996078,...,0.017241,0.000000,0.031250,0.0,0.0,0.0,0.016949,0.032787,0.0,0
4,1.666667e-07,0.901515,0.000000,0.5,0.000094,0.000000,0.000146,0.000000,0.100000,0.996078,...,0.017241,0.000000,0.031250,0.0,0.0,0.0,0.016949,0.032787,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257668,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.396552,0.266667,0.359375,0.0,0.0,0.0,0.389831,0.377049,0.0,1
257669,8.429368e-03,0.856061,0.000000,0.4,0.000845,0.000726,0.000042,0.000024,0.000034,0.996078,...,0.000000,0.000000,0.015625,0.0,0.0,0.0,0.000000,0.000000,0.0,1
257670,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.034483,0.044444,0.187500,0.0,0.0,0.0,0.033898,0.180328,0.0,1
257671,1.500000e-07,0.901515,0.166667,0.5,0.000094,0.000000,0.000006,0.000000,0.111111,0.996078,...,0.500000,0.288889,0.453125,0.0,0.0,0.0,0.491525,0.475410,0.0,1


In [16]:
grid_search1= main(network_data1)

{'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6]}
Accuracy of Decision Tree: 0.9197
              precision    recall  f1-score   support

           0       0.91      0.86      0.89     18675
           1       0.92      0.95      0.94     32860

    accuracy                           0.92     51535
   macro avg       0.92      0.91      0.91     51535
weighted avg       0.92      0.92      0.92     51535

[[16041  2634]
 [ 1502 31358]]
{'criterion': 'gini', 'max_depth': 6}
0.9194956802377273


In [17]:
pd.DataFrame(grid_search1.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.69696,0.165927,0.020021,0.004645,gini,2,"{'criterion': 'gini', 'max_depth': 2}",0.871233,0.871349,0.87222,0.871601,0.000441,5
1,0.885384,0.07258,0.025222,0.011971,gini,4,"{'criterion': 'gini', 'max_depth': 4}",0.908111,0.911181,0.909477,0.90959,0.001256,2
2,1.217615,0.179636,0.017048,0.000152,gini,6,"{'criterion': 'gini', 'max_depth': 6}",0.91984,0.918618,0.920029,0.919496,0.000625,1
3,0.525515,0.019298,0.016673,0.000573,entropy,2,"{'criterion': 'entropy', 'max_depth': 2}",0.87084,0.870985,0.872148,0.871324,0.000585,6
4,0.857665,0.018556,0.016825,0.000286,entropy,4,"{'criterion': 'entropy', 'max_depth': 4}",0.885247,0.887605,0.885944,0.886266,0.000989,4


## Classification with DT after Dimension Reduction (using PCA)

In [18]:
network_data2 = load_data('../input/dataset_minmax_corr.csv')

In [19]:
grid_search2 = main(network_data2)

{'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6]}
Accuracy of Decision Tree: 0.9069
              precision    recall  f1-score   support

           0       0.98      0.76      0.86     18675
           1       0.88      0.99      0.93     32860

    accuracy                           0.91     51535
   macro avg       0.93      0.88      0.89     51535
weighted avg       0.91      0.91      0.90     51535

[[14199  4476]
 [  322 32538]]
{'criterion': 'gini', 'max_depth': 6}
0.9064752759828201


In [20]:
pd.DataFrame(grid_search2.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.440263,0.017609,0.015185,0.000526,gini,2,"{'criterion': 'gini', 'max_depth': 2}",0.871233,0.871349,0.87222,0.871601,0.000441,5
1,0.720194,0.028975,0.015187,0.000232,gini,4,"{'criterion': 'gini', 'max_depth': 4}",0.895173,0.895769,0.895506,0.895483,0.000244,2
2,0.930846,0.023371,0.01538,0.000267,gini,6,"{'criterion': 'gini', 'max_depth': 6}",0.906582,0.905928,0.906916,0.906475,0.000411,1
3,0.448998,0.006971,0.014608,0.000352,entropy,2,"{'criterion': 'entropy', 'max_depth': 2}",0.87084,0.870985,0.872148,0.871324,0.000585,6
4,0.724712,0.009935,0.014928,0.000119,entropy,4,"{'criterion': 'entropy', 'max_depth': 4}",0.873401,0.873576,0.874039,0.873672,0.000269,4


## Classification with DT after MinMax Scaling + Correlation analysis

In [21]:
network_data3 = load_data('../input/dataset_pca.csv')

In [22]:
grid_search3 = main(network_data3)

{'criterion': ['gini', 'entropy'], 'max_depth': [2, 4, 6]}
Accuracy of Decision Tree: 0.8954
              precision    recall  f1-score   support

           0       0.99      0.72      0.83     18675
           1       0.86      1.00      0.92     32860

    accuracy                           0.90     51535
   macro avg       0.93      0.86      0.88     51535
weighted avg       0.91      0.90      0.89     51535

[[13397  5278]
 [  111 32749]]
{'criterion': 'gini', 'max_depth': 6}
0.8938332552800577


In [23]:
pd.DataFrame(grid_search3.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,1.19454,0.110614,0.021008,0.007389,gini,2,"{'criterion': 'gini', 'max_depth': 2}",0.863214,0.864247,0.865817,0.864426,0.00107,5
1,2.025503,0.011894,0.016883,0.000319,gini,4,"{'criterion': 'gini', 'max_depth': 4}",0.881784,0.881245,0.881389,0.881473,0.000228,3
2,3.12477,0.28146,0.021821,0.002736,gini,6,"{'criterion': 'gini', 'max_depth': 6}",0.894241,0.893848,0.89341,0.893833,0.000339,1
3,1.630528,0.205633,0.020637,0.005496,entropy,2,"{'criterion': 'entropy', 'max_depth': 2}",0.843247,0.843028,0.840203,0.842159,0.001386,6
4,2.565923,0.015969,0.018406,0.003047,entropy,4,"{'criterion': 'entropy', 'max_depth': 4}",0.876181,0.876137,0.876499,0.876272,0.000161,4
