### Smart Detection of Bot/Malware Generated Network Traffic  (using the CTU-13 dataset): ML Model

Malware traffic is often hard to detect as it uses real users' PC or browsers in order to generate fraudulent activity and Spam. This notebook shows how to build a simple supervised model that will be trained to detect malware based traffic in a network traffic log or capture. When the model flags an IP as generating malware based spam and fraudulent activity  it can be listed for quarantine or further analysis. 

##### This notebook first prepares and process the features, and  then build and evaluate a sequential neural network and a gradient boosting trees algorithms. The third file in this series implement this in Spark


About the Data Set
The Dataset used here is part of a larger dataset (named CTU-13) which records 4 hours of network traffic in a computer network of a university department in the CTU University, Czech Republic. The researchers that created the dataset infected one of the computers in the network in a malware that generates ClickFraud and Spam activity. The traffic was recorded by a traffic analytics tool which captured malware-based activity generated by the infected PC in addition to normal traffic. Since the infected computer is known, the data is labeled and the purpose of the project is to present a supervised classification model.

https://github.com/Hurence/logisland-flow-analytics-ml-jobs/blob/master/README.md



In [43]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix


In [44]:
df_raw = pd.read_csv(r'C:\Users\alon\OneDrive\Documents\Coursera-ML\Sample2Capture.csv')
#Remove uninteresting data based on domain knowledge

# we know the infected system IP addr so lets add it 
infected_addr = "147.32.84.165"
df_raw["Bot"] = np.where(df_raw['SrcAddr'] == infected_addr, 1, 0)

### Prepare Features for Processing

In [45]:
from sklearn import preprocessing

#df_raw = df_raw.sample(frac=0.75)

#Fill null values
df_raw["sTos"] = df_raw["sTos"].fillna(value=-1)
df_raw["dTos"] = df_raw["dTos"].fillna(value=-1)

# define processing functions
def encode_field(df,field):
    encoder = preprocessing.LabelEncoder()
    return  encoder.fit_transform(df[field])

def hot_encode(df, feature):
    return pd.get_dummies(
            encode_field(df_raw, feature), prefix=feature + "_", drop_first=True)

def group_less_frequent_values(df, feature, min_prc):
    categories = df[feature].value_counts()    
    for category in categories.index:                
        # how many times this category shows in the DS?
        freq = categories[category]        
        # if less than what we want (min_prc)
        if(freq < min_prc):
            new_val = "Q" if df[feature].dtypes == object else 99
            df.loc[df[feature] == category, feature] = new_val
 
# process the categorical features        
categorical_features = ['State','Proto','Dir', 'dTos','sTos']
# classes that are not frequent in the data (less than 1%) will be grouped. 
one_p = 0.01 * len(df_raw.index) 
#loog through the categorical 
for feature in categorical_features:
    #group the less frequent ones
    group_less_frequent_values(df_raw, feature, one_p)
    #index and then hot encode
    df_raw = pd.concat([df_raw, hot_encode(df_raw,feature)],axis=1)
    
##standardizing numerical feature
df_raw[['Dur', 'TotPkts','SrcBytes']] = StandardScaler().fit_transform(df_raw[['Dur', 'TotPkts','SrcBytes']])

## remove what we dont need 
df_raw = df_raw.drop(columns=categorical_features)
df_raw = df_raw.drop(columns=['SrcAddr','DstAddr','Label','TotBytes','Sport','Dport','StartTime'])




  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [46]:
display(df_raw.head(5))

Unnamed: 0,Dur,TotPkts,SrcBytes,Bot,State__1,State__2,State__3,State__4,State__5,State__6,Proto__1,Proto__2,Proto__3,Dir__1,Dir__2,dTos__1,dTos__2,sTos__1
0,-0.475694,-0.00509,-0.005682,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
1,-0.475708,-0.00555,-0.005064,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
2,-0.475662,-0.006471,-0.002179,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0
3,-0.476096,-0.009234,-0.010388,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0
4,-0.476095,-0.009234,-0.00795,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0


### Balance the data and prepare test and training sets

In [47]:
df = df_raw

set_train, set_test = train_test_split(df, test_size=0.2)

#balance the training set
bot_data = set_train[set_train['Bot'] == 1]
normal_data = set_train[set_train['Bot'] == 0]
normal_data_downsampled = normal_data.sample(n=len(bot_data.index))
set_train = pd.concat([bot_data,normal_data_downsampled])

col_list = list(df.columns)
col_list.remove("Bot")
x_train = set_train.loc[:,col_list]
x_test  = set_test.loc[:, col_list]
y_train = set_train.loc[:,'Bot']
y_test = set_test.loc[:,'Bot']


#### A function that will print the accuracy of our model by printing confusion.m

In [48]:
def print_perc(tn,tp,fn,fp):
    actual_true = tp + fn
    pr_t = round((tp / actual_true * 100), 2)
    actual_false = tn + fp
    pr_f = round((fp / actual_false * 100), 2)
    print(f"{tp}/{actual_true}({pr_t}%) were correctly identified as bots")
    print(f"{fp}/{actual_false} ({pr_f}%) were wrongly identified as bots")

### Model and evalutate a gradient boosting algoritm (XGBoost)

In [49]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(x_train, y_train)
print(model)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)


### Evaluate our model

In [50]:
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

cm = confusion_matrix(y_test, y_pred)
print(cm)
print("\n ********************************** \n")

tn = cm[0,0]
fp = cm[0,1]
fn = cm[1,0]
tp = cm[1,1]
print_perc(tn,tp,fn,fp)

Accuracy: 92.26%
[[68259  5725]
 [   46   513]]

 ********************************** 

513/559(91.77%) were correctly identified as bots
5725/73984 (7.74%) were wrongly identified as bots


##### 513/559(91.77%) were correctly identified as bots
##### 5725/73984 (7.74%) were wrongly identified as bots
#### The results are quite nice. Now let's compare it to a deep learning model

### Compare to a deep learning algorithm (Sequential NN)

In [27]:
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(8, activation='relu', kernel_initializer='random_normal', input_dim=17))
#Second  Hidden Layer
classifier.add(Dense(8, activation='relu', kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))
#Compiling the neural network
classifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])
#Fitting the data to the training dataset
classifier.fit(x_train,y_train, batch_size=64, epochs=100)
eval_model=classifier.evaluate(x_train, y_train)
print("\n ********************************** \n")
print(eval_model)
y_pred=classifier.predict(x_test)
y_pred =(y_pred>0.5)
cm = confusion_matrix(y_test, y_pred)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100

 ********************************** 

[0.29015027702610613, 0.8821042743071865]


#### Evaluate the deep learning model

In [28]:
print("\n ********************************** \n")
print(cm)

tn = cm[0,0]
fp = cm[0,1]
fn = cm[1,0]
tp = cm[1,1]

print("\n ********************************** \n")

print_perc(tn,tp,fn,fp)


 ********************************** 

[[63873 10108]
 [   58   504]]

 ********************************** 

504/562(89.68%) were correctly identified as bots
10108/73981 (13.66%) were wrongly identified as bots
