# Lab 4, Exercise 2

## Instructions
The goal of this exercise is to build a straightforward machine learning pipeline for a problem with more than two classes.  A lot of the data preprocessing has already been done, so the main focus of this exercise is to become familiar with loading data, training a model, doing inference, and analyzing the results.

In [7]:
import numpy as np
import pandas as pd

## Load the data

For example, here's the first couple rows of the dataset:

| Source IP    |  Source Port |  Destination IP   |  Destination Port |  Protocol |  Flow Duration |  Flow Bytes/s |  Flow Packets/s |  Flow IAT Mean |  Flow IAT Std |  Flow IAT Max |  Flow IAT Min | Fwd IAT Mean |  Fwd IAT Std |  Fwd IAT Max |  Fwd IAT Min | Bwd IAT Mean |  Bwd IAT Std |  Bwd IAT Max |  Bwd IAT Min | Active Mean |  Active Std |  Active Max |  Active Min | Idle Mean |  Idle Std |  Idle Max |  Idle Min | label |
|--------------|--------------|-------------------|-------------------|-----------|----------------|---------------|-----------------|----------------|---------------|---------------|---------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|-------------|-------------|-------------|-------------|-----------|-----------|-----------|-----------|-------|
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 7248168        | 21126\.02798  | 29\.11080428    | 34515\.08571   | 273869\.2625  | 3897923       | 5             | 89483\.55556 | 437167\.5917 | 3898126      | 29           | 56614\.03906 | 349855\.1098 | 3898131      | 7            | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |
| 10\.0\.2\.15 | 57188        | 82\.161\.239\.177 | 110               | 6         | 5157723        | 1052\.790156  | 3\.683796125    | 286540\.1667   | 878838\.5256  | 3743359       | 135           | 644715\.375  | 1272066\.058 | 3743562      | 509          | 568901\.6667 | 1209110\.287 | 3743573      | 451          | 0           | 0           | 0           | 0           | 0         | 0         | 0         | 0         | AUDIO |



In [57]:
# Import CSV data as a Pandas dataframe
# The data is in 'data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv'

# CODE HERE

tor_df = pd.read_csv('data/exercise2/TOR_TimeBasedFeatures-10s-Layer2.csv', delimiter=' *, *', engine='python')
# Create data and labels that can be used by sklearn's 'train_test_split'
# Create the labels

# CODE HERE
labels = tor_df['label']

# Create the data
# -Keep just the numeric features (i.e., those features between 'Flow Duration' and 'Idle Min')
# -Make sure not to keep the labels

# CODE HERE
data = tor_df.loc[:, 'Flow Duration':'Idle Min']

# You should now have data and labels that can be used by sklearn's 'train_test_split'

## Create a single train/test split for experimentation

In [58]:
# Randomly pick 50% of the data for the training set, and keep the remaining 50% for the test set
# Use sklearn's 'train_test_split'
# CODE HERE

from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.5);


## Train a classifier

In [80]:
# Train a random forest classifier using default hyperparameters
# Hint: Not counting any import statements, this can be done in a single line of code
# CODE HERE

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
model = rf.fit(train_data, train_labels)



## Test the classifier on the test set

In [89]:
# Predict the labels on the test set

# CODE HERE

pred_labels = model.predict(test_data)

# Use accuracy and a confusion matrix to measure performance
# Hint: Use sklearn's built-in metrics

# CODE HERE

from sklearn.metrics import confusion_matrix, accuracy_score

def print_results(test_labels, pred_labels):
    
    print('Confusion Matrix:')
    conf_mat = confusion_matrix(test_labels, pred_labels)
    print(conf_mat, '\n')
    print('Confusion Matrix Percentages with Labels:')
    print(pd.crosstab(test_labels, pred_labels, rownames=['True'], colnames=['Predicted']).apply(lambda r: 100.0 * r/r.sum()), '\n')

    accuracy = accuracy_score(test_labels, pred_labels)
    print('accuracy =', accuracy)
    
print_results(test_labels, pred_labels)

Confusion Matrix:
[[ 274   73    4    5    2    0    4    4]
 [  67  611   31   11   14    1   58    2]
 [  11   72   57    1    1    1    8    3]
 [   5   30    2  348    5    1   17    1]
 [   3   45    4    7   57    0   19    1]
 [   6    6    1    0    1  536    3    2]
 [  22   91    8   25   11   12  268    4]
 [   8   21    8    6    2    4    3 1114]] 

Confusion Matrix Percentages with Labels:
Predicted          AUDIO   BROWSING       CHAT  FILE-TRANSFER       MAIL  \
True                                                                       
AUDIO          69.191919   7.692308   3.478261       1.240695   2.150538   
BROWSING       16.919192  64.383562  26.956522       2.729529  15.053763   
CHAT            2.777778   7.586934  49.565217       0.248139   1.075269   
FILE-TRANSFER   1.262626   3.161222   1.739130      86.352357   5.376344   
MAIL            0.757576   4.741834   3.478261       1.736973  61.290323   
P2P             1.515152   0.632244   0.869565       0.000000

In [108]:
# Determine important features

# CODE HERE

from pprint import pprint

feature_importances = model.feature_importances_

print('Important Features:')
important_feats = sorted(list(zip(test_data.columns, feature_importances)), key=lambda feat: feat[1], reverse=True)
pprint(['Feature: {:20} Importance: {}'.format(*pair) for pair in important_feats])

Important Features:
['Feature: Flow Bytes/s         Importance: 0.10310492081883266',
 'Feature: Flow IAT Mean        Importance: 0.0876804243452377',
 'Feature: Fwd IAT Min          Importance: 0.08213329271746557',
 'Feature: Bwd IAT Max          Importance: 0.08101557963743383',
 'Feature: Fwd IAT Std          Importance: 0.07260585692758294',
 'Feature: Bwd IAT Mean         Importance: 0.06855163514908563',
 'Feature: Flow Duration        Importance: 0.06477037590870895',
 'Feature: Fwd IAT Max          Importance: 0.062401548688102944',
 'Feature: Flow Packets/s       Importance: 0.05992160889734805',
 'Feature: Fwd IAT Mean         Importance: 0.05829874301265863',
 'Feature: Flow IAT Std         Importance: 0.0539892778004704',
 'Feature: Bwd IAT Min          Importance: 0.053109622896184365',
 'Feature: Flow IAT Max         Importance: 0.05101987425716844',
 'Feature: Flow IAT Min         Importance: 0.04677616471049703',
 'Feature: Bwd IAT Std          Importance: 0.0340830708

Questions:

1) What is the overall accuracy using the default parameters?  

Accuracy: 0.7993

2) What is the confusion matrix for the tested approach?  What are the classes where the model performs well?  What are the classes where the model performs poorly?

Confusion Matrix:

```
[[ 259   92    3    2    3    2   13    1]
 [  60  622   29    2   17    2   63    1]
 [   3   99   45    2    1    0    5    2]
 [   5   35    1  379   10    0   17    4]
 [   5   56    6   11   52    0   18    0]
 [   6    8    1    1    0  519    7    1]
 [  30   93    5   19   12   11  259    3]
 [   4   23    5    1    0    3    4 1080]]
 ```
 
 The model performs well on FILE-TRANSFER, P2P, and VOIP.
 The model performs poorly on AUDIO, BROWSING, CHAT, MAIL, and VIDEO.
 
3) What are the top 5 most important features?

1. Flow Bytes/s
2. Flow IAT Max
3. Flow IAT Mean
4. Bwd IAT Min
5. Bwd IAT Max

4) What hyperparameters could you tune in the random forest to improve performance? What is the best accuracy you can attain?

<!-- https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d -->
<!-- https://medium.com/@taplapinger/tuning-a-random-forest-classifier-1b252d1dde92 -->
<!-- https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/ -->

n_estimators=1000, max_features=10
Accuracy: 0.8352
(Tuned model below)

5) Bonus: How would you improve the pipeline above to automatically tune the hyperparameters?  How would you improve the pipeline to use multiple train/test splits?

In [91]:
pprint(rf.get_params())

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [151]:
tuned_rf = RandomForestClassifier(n_estimators=1000, max_features=10, random_state=0)
tuned_model = tuned_rf.fit(train_data, train_labels)
tuned_pred_labels = tuned_model.predict(test_data)
print_results(test_labels, tuned_pred_labels)

Confusion Matrix:
[[ 281   61    3    4    2    1   11    3]
 [  60  647   22    2    7    1   54    2]
 [   7   78   53    1    0    1    9    5]
 [   6   26    2  363    2    0    8    2]
 [   5   49    3    8   50    0   20    1]
 [   5    4    2    0    0  536    6    2]
 [  21   71    4   18    5    8  310    4]
 [   4   19    8    4    1    3    8 1119]] 

Confusion Matrix Percentages with Labels:
Predicted          AUDIO   BROWSING       CHAT  FILE-TRANSFER       MAIL  \
True                                                                       
AUDIO          72.236504   6.387435   3.092784           1.00   2.985075   
BROWSING       15.424165  67.748691  22.680412           0.50  10.447761   
CHAT            1.799486   8.167539  54.639175           0.25   0.000000   
FILE-TRANSFER   1.542416   2.722513   2.061856          90.75   2.985075   
MAIL            1.285347   5.130890   3.092784           2.00  74.626866   
P2P             1.285347   0.418848   2.061856           0.00