# Assignment: DarkNet traffic detection

## Task:  Traffic Classification

Kaggle challenge: https://www.kaggle.com/peterfriedrich1/cicdarknet2020-internet-traffic

### CIC-Darknet2020

Darknet is the unused address space of the internet which is not speculated to interact with other computers in the world. Any communication from the dark space is considered sceptical owing to its passive listening nature which accepts incoming packets, but outgoing packets are not supported. Due to the absence of legitimate hosts in the darknet, any traffic is contemplated to be unsought and is characteristically treated as probe, backscatter, or misconfiguration. Darknets are also known as network telescopes, sinkholes, or blackholes.

Darknet traffic classification is significantly important to categorize real-time applications. Analyzing darknet traffic helps in early monitoring of malware before onslaught and detection of malicious activities after outbreak.


### Data
In CICDarknet2020 dataset, a two-layered approach is used to generate benign and darknet traffic at the first layer. The darknet traffic constitutes Audio-Stream, Browsing, Chat, Email, P2P, Transfer, Video-Stream and VOIP which is generated at the second layer. To generate the representative dataset, we amalgamated our previously generated datasets, namely, ISCXTor2016 and ISCXVPN2016, and combined the respective VPN and Tor traffic in corresponding Darknet categories. 

## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv(r'C:\Users\Varinder\datamining\week8\Darknet.CSV' , encoding = "ISO-8859-1" )

In [None]:
data.head()
data.shape


In [None]:
#Depict all values from all the columns

def display_all(data):
    with pd.option_context("display.max_rows", 800, "display.max_columns", 800): 
        display(data)
display_all(data.head().T)

In [None]:
data.columns
data.isnull().sum()
Label = pd.DataFrame(data['Label'])
Label_1=pd.DataFrame(data['Label.1'])
Label['Label'].nunique()
Label_1['Label.1'].value_counts()
data['Protocol'].nunique()
display_all(data.dtypes.T)

In [None]:
#Display the representation of the class variables - Label and Label.
#Looks like the classification for class variables is unstable
#Since there are 11 categories in the Label.1 variable, we will treat as being one of the Label.1 variable specific attributes.
figure, ax = plt.subplots(1,2,figsize =(12,6))

Label['Label'].value_counts()
sns.countplot(x="Label", palette="ch:.35", ax=ax[0], data=data)
sns.countplot(x="Label.1", palette="ch:.65",ax=ax[1], data=data)
plt.show()

In [None]:
#Encoding of variables that are categorical

from sklearn.preprocessing import LabelEncoder
import numpy as np
encoder = LabelEncoder()

# From the results, extract categorical attributes

d_1 = data.select_dtypes(include=['object']).copy()

d_2 = d_1.apply(encoder.fit_transform)

d_3= data.drop (['Flow ID', 'Src IP', 'Dst IP', 'Timestamp', 'Label', 'Label.1'], axis=1)

data_encoded=pd.concat([d_3, d_2], axis=1)

data_encoded['Label.1'].nunique()
data_encoded['Label'].unique()

In [None]:
d = dict.fromkeys(data_encoded.select_dtypes(np.int32).columns, np.int64)
data_encoded = data_encoded.astype(d)

In [None]:
display_all(data_encoded.dtypes.T)

In [None]:
#Correlation between variables

figure, ax = plt.subplots(figsize=(25,15))

d_corr=data_encoded.corr()
sns.heatmap(d_corr, cmap='rocket_r', ax=ax)
ax.set_title("Matrix of Variables Correlation", fontsize =12)
plt.show()


In [None]:

# Dropping all variables on the map that have null values
#Variables'Flow Bytes/s 'and' Flow Packets/s have nan values - these variables are dropped' 

db_new=data_encoded.drop(['Fwd URG Flags', 'URG Flag Count', 'ECE Flag Count', 'Fwd Packet/Bulk Avg', 'Bwd Bytes/Bulk Avg','Subflow Bwd Packets','Active Mean', 'Active Std', 'Active Min', 'Active Max', 'Bwd PSH Flags', 'Bwd URG Flags', 'URG Flag Count', 'CWE Flag Count', 'ECE Flag Count', 'Fwd Bytes/Bulk Avg', 'Fwd Bulk Rate Avg', 'Subflow Bwd Packets', 'Flow Bytes/s', 'Flow Packets/s'], axis=1) 

In [None]:
db_new

#Heatmap for the latest collection of data

figure, ax = plt.subplots(figsize=(25,15))

d_corr_1=db_new.corr()
sns.heatmap(d_corr_1, cmap='rocket_r', ax=ax)
ax.set_title("Matrix of Correlation variables", fontsize =16)
plt.show()



In [None]:
#Considering only those characteristics that correlate positively or negatively with the model

d_new_1 = db_new[['Protocol','Fwd Packet Length Min', 'Bwd Packet Length Min', 'Packet Length Min', 'Flow ID','Src Port', 'Dst Port', 'Flow Duration', 'Fwd Packet Length Std', 'FIN Flag Count', 'SYN Flag Count', 'Fwd Seg Size Min', 'Idle Mean', 'Idle Max', 'Timestamp', 'Label.1', 'Label']]

d_new_1.columns

In [None]:
train_x=d_new_1.drop(['Label'], axis=1)
train_y=pd.DataFrame(d_new_1['Label'])
train_y

In [None]:
train_x

In [None]:
#Verifying whether the features selected above match those created by the RF Classifier 

#Use RFC for selecting features

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier();

# fit random forest classifier on the training set
rfc.fit(train_x,train_y);

# extract Significance features
score = np.round(rfc.feature_importances_,20)
features = pd.DataFrame({'feature':train_x.columns,'Significance':score})
features = features.sort_values('Significance',ascending=False).set_index('feature')

# plot features
plt.rcParams['figure.figsize'] = (12, 4)
features.plot.bar();

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [None]:
#Split the data into trains and test them.

from sklearn.model_selection import train_test_split
from sklearn import metrics

data_train, data_test, labels_train, labels_test = train_test_split(train_x, train_y, test_size=0.3, random_state=111)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

lrc = LogisticRegression(solver='liblinear', penalty='l1')
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
rfc = RandomForestClassifier(n_estimators=31, random_state=111)

clfs = {'SVC' : svc,'KN' : knc, 'LR': lrc, 'RF': rfc}

def train_classifier(clf, data_train, labels_train):    
    clf.fit(data_train, labels_train)
    
def predict_labels(clf, features):
    return (clf.predict(features))

pred_scores = []
for k,v in clfs.items():
    train_classifier(v, data_train, labels_train)
    pred = predict_labels(v,data_test)
    pred_scores.append((k, [accuracy_score(labels_test,pred)]))

table = pd.DataFrame.from_items(pred_scores, orient='index', columns=['Score'])
table

In [None]:
#Logistic Regression

from sklearn.linear_model import LogisticRegression
LGR = LogisticRegression(n_jobs=-1, random_state=0)
LGR.fit(data_train, labels_train)
LGR_predictions = LGR.predict(data_test)

In [None]:
#RFC

from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=31, random_state=111).fit(data_train, labels_train)
RFC_predictions = RFC.predict(data_test)

In [None]:
#KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=200).fit(data_train, labels_train)
knn_predictions = knn.predict(data_test)


## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

In [None]:
#Model Evaluation

####Logistic Regression

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# model accuracy for data_test   
accuracy_LGR = LGR.score(data_test, labels_test) 
print ("LGR-Model-Precision:" "\n", accuracy_LGR)

    
# creating a classification matrix 
cm_LGR = metrics.classification_report(labels_test, LGR_predictions)
print("LGR Research on classification:" "\n", cm_LGR) 


#### RFC

accuracy_RFC = RFC.score(data_test, labels_test) 
print ("RFC_Model_Precision:" "\n", accuracy_RFC)
    
# creating a classification matrix 
cm_RFC = metrics.classification_report(labels_test, RFC_predictions)
print("RFC_Research on classificationt:" "\n", cm_RFC) 

    
#### KNN

accuracy_KNN = knn.score(data_test, labels_test) 
print ("KNN_Model_Precision:" "\n", accuracy_KNN)
    
# creating a classification matrix 
cm_KNN = metrics.classification_report(labels_test, knn_predictions)
print("KNN_Research on classification:" "\n", cm_KNN) 