# Lab 4, Exercise 2

## Instructions
The goal of this exercise is to build a straightforward machine learning pipeline for a problem with more than two classes.  A lot of the data preprocessing has already been done, so the main focus of this exercise is to become familiar with loading data, training a model, doing inference, and analyzing the results.

In [1]:
import numpy as np
import pandas as pd

## Load the data

For example, here's the first couple rows of the dataset:

| Source IP    |  Source Port |  Destination IP   |  Destination Port |  Protocol |  Flow Duration |  Flow Bytes/s |  Flow Packets/s |  Flow IAT Mean |  Flow IAT Std |  Flow IAT Max |  Flow IAT Min | Fwd IAT Mean |  Fwd IAT Std |  Fwd IAT Max |  Fwd IAT Min | Bwd IAT Mean |  Bwd IAT Std |  Bwd IAT Max |  Bwd IAT Min | Active Mean |  Active Std |  Active Max |  Active Min | Idle Mean |  Idle Std |  Idle Max |  Idle Min | label |
|--------------|--------------|-------------------|-------------------|-----------|----------------|---------------|-----------------|----------------|---------------|---------------|---------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|-------------|-------------|-------------|-------------|-----------|-----------|-----------|-----------|-------|
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 7248168        | 21126\.02798  | 29\.11080428    | 34515\.08571   | 273869\.2625  | 3897923       | 5             | 89483\.55556 | 437167\.5917 | 3898126      | 29           | 56614\.03906 | 349855\.1098 | 3898131      | 7            | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 5157723        | 1052\.790156  | 3\.683796125    | 286540\.1667   | 878838\.5256  | 3743359       | 135           | 644715\.375  | 1272066\.058 | 3743562      | 509          | 568901\.6667 | 1209110\.287 | 3743573      | 451          | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |



In [2]:
# Import CSV data as a Pandas dataframe
# The data is in 'data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv'

# CODE HERE

tor_df = pd.read_csv('data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv', delimiter=' *, *', engine='python')
# Create data and labels that can be used by sklearn's 'train_test_split'
# Create the labels

# CODE HERE
labels = tor_df['label']

# Create the data
# -Keep just the numeric features (i.e., those features between 'Flow Duration' and 'Idle Min')
# -Make sure not to keep the labels

# CODE HERE
data = tor_df.loc[:, 'Flow Duration':'Idle Min']

# You should now have data and labels that can be used by sklearn's 'train_test_split'

## Create a single train/test split for experimentation

In [3]:
# Randomly pick 50% of the data for the training set, and keep the remaining 50% for the test set
# Use sklearn's 'train_test_split'
# CODE HERE

from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.5);


## Train a classifier

In [4]:
# Train a random forest classifier using default hyperparameters
# Hint: Not counting any import statements, this can be done in a single line of code
# CODE HERE

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
model = rf.fit(train_data, train_labels)



## Test the classifier on the test set

In [5]:
# Predict the labels on the test set

# CODE HERE

pred_labels = model.predict(test_data)

# Use accuracy and a confusion matrix to measure performance
# Hint: Use sklearn's built-in metrics

# CODE HERE

from sklearn.metrics import confusion_matrix, accuracy_score

def print_results(test_labels, pred_labels):
    
    print('Confusion Matrix:')
    conf_mat = confusion_matrix(test_labels, pred_labels)
    print(conf_mat, '\n')
    print('Confusion Matrix Percentages with Labels:')
    print(pd.crosstab(test_labels, pred_labels, rownames=['True'], colnames=['Predicted']).apply(lambda r: 100.0 * r/r.sum()), '\n')

    accuracy = accuracy_score(test_labels, pred_labels)
    print('accuracy =', accuracy)
    
print_results(test_labels, pred_labels)

Confusion Matrix:
[[ 264   69    3    2    5    0   12    1]
 [  45  653   31    6   17    6   41    3]
 [   7   96   39    1    2    1    5    2]
 [   7   14    4  384   12    0   20    3]
 [   5   50    2    5   44    0   24    0]
 [   2    6    1    0    0  520    8    1]
 [  19  102    9   25    9   11  246    3]
 [   7   15    5    2    2    1    3 1140]] 

Confusion Matrix Percentages with Labels:
Predicted          AUDIO   BROWSING       CHAT  FILE-TRANSFER       MAIL  \
True                                                                       
AUDIO          74.157303   6.865672   3.191489       0.470588   5.494505   
BROWSING       12.640449  64.975124  32.978723       1.411765  18.681319   
CHAT            1.966292   9.552239  41.489362       0.235294   2.197802   
FILE-TRANSFER   1.966292   1.393035   4.255319      90.352941  13.186813   
MAIL            1.404494   4.975124   2.127660       1.176471  48.351648   
P2P             0.561798   0.597015   1.063830       0.000000

In [9]:
# Determine important features

# CODE HERE

from pprint import pprint

print('Important Features:')
feature_importances_df = pd.DataFrame(model.feature_importances_, index=test_data.columns, columns=['Importance'])
feature_importances_df.sort_values('Importance', ascending=False)


Important Features:


Unnamed: 0,Importance
Flow Packets/s,0.098725
Flow IAT Max,0.090705
Flow Bytes/s,0.085834
Bwd IAT Mean,0.076371
Bwd IAT Max,0.074899
Bwd IAT Min,0.072702
Fwd IAT Min,0.06738
Flow IAT Mean,0.064817
Fwd IAT Max,0.063509
Fwd IAT Mean,0.059337


Questions:

1) What is the overall accuracy using the default parameters?  

Accuracy: 0.8180

2) What is the confusion matrix for the tested approach?  What are the classes where the model performs well?  What are the classes where the model performs poorly?

Confusion Matrix:

```
[[ 264   69    3    2    5    0   12    1]
 [  45  653   31    6   17    6   41    3]
 [   7   96   39    1    2    1    5    2]
 [   7   14    4  384   12    0   20    3]
 [   5   50    2    5   44    0   24    0]
 [   2    6    1    0    0  520    8    1]
 [  19  102    9   25    9   11  246    3]
 [   7   15    5    2    2    1    3 1140]] 
 ```
 
 The model performs well on FILE-TRANSFER, P2P, and VOIP.
 The model performs poorly on AUDIO, BROWSING, CHAT, MAIL, and VIDEO.
 
3) What are the top 5 most important features?

1. Flow Packets/s
2. Flow IAT Max
3. Flow Bytes/s
4. Bwd IAT Mean
5. Bwd IAT Max

4) What hyperparameters could you tune in the random forest to improve performance? What is the best accuracy you can attain?

<!-- https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d -->
<!-- https://medium.com/@taplapinger/tuning-a-random-forest-classifier-1b252d1dde92 -->
<!-- https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/ -->

n_estimators=1000, max_features=10
Accuracy: 0.8381
(Tuned model below)

5) Bonus: How would you improve the pipeline above to automatically tune the hyperparameters?  How would you improve the pipeline to use multiple train/test splits?

In [13]:
pprint(rf.get_params())

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [14]:
tuned_rf = RandomForestClassifier(n_estimators=1000, max_features=10, random_state=0)
tuned_model = tuned_rf.fit(train_data, train_labels)
tuned_pred_labels = tuned_model.predict(test_data)
print_results(test_labels, tuned_pred_labels)

Confusion Matrix:
[[ 263   70    3    3    3    1   12    1]
 [  33  665   28    5   11    3   55    2]
 [   1   97   43    2    1    1    6    2]
 [   4   22    1  388   10    0   17    2]
 [   4   40    3    7   54    1   21    0]
 [   3    6    0    1    0  519    9    0]
 [  13   77    4   17    4   10  297    2]
 [   4   17    5    1    0    2    4 1142]] 

Confusion Matrix Percentages with Labels:
Predicted          AUDIO   BROWSING       CHAT  FILE-TRANSFER       MAIL  \
True                                                                       
AUDIO          80.923077   7.042254   3.448276       0.707547   3.614458   
BROWSING       10.153846  66.901408  32.183908       1.179245  13.253012   
CHAT            0.307692   9.758551  49.425287       0.471698   1.204819   
FILE-TRANSFER   1.230769   2.213280   1.149425      91.509434  12.048193   
MAIL            1.230769   4.024145   3.448276       1.650943  65.060241   
P2P             0.923077   0.603622   0.000000       0.235849