## Data Exploration

In this step, we will explore the datasets and do a little preprocessing.

# Modelling Intrusion Detection: Analysis of a Feature Selection Mechanism

## Method Description

### Step 1: Data preprocessing:
All features are made numerical using one-Hot-encoding. The features are scaled to avoid features with large values that may weigh too much in the results.

### Step 2: Feature Selection:
Eliminate redundant and irrelevant data by selecting a subset of relevant features that fully represents the given problem.
Univariate feature selection with ANOVA F-test. This analyzes each feature individually to detemine the strength of the relationship between the feature and labels. Using SecondPercentile method (sklearn.feature_selection) to select features based on percentile of the highest scores. 
When this subset is found: Recursive Feature Elimination (RFE) is applied.

### Step 4: Build the model:
Decision tree model is built.

### Step 5: Prediction & Evaluation (validation):
Using the test data to make predictions of the model.
Multiple scores are considered such as:accuracy score, recall, f-measure, confusion matrix.
perform a 10-fold cross-validation.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score # for calculating accuracy of model
from sklearn.metrics import classification_report # for generating a classification report of model
import warnings
warnings.filterwarnings("ignore")

# Dataset Information

In [None]:
dataset_train=pd.read_csv('../datasets/KDDTrain+.txt',header=None)

In [None]:
dataset_test=pd.read_csv('../datasets/KDDTest+.txt',header=None)

# Sample view of the training dataset

In [None]:
dataset_train.head()

# Sample view of the test dataset

In [None]:
dataset_test.head()

# Columns Name of Training and Test dataset

In [None]:
col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label", "difficulty_level"]


# Shape of Training and Test

In [None]:
print("Shape of Training Dataset:", dataset_train.shape)
print("Shape of Testing Dataset:", dataset_test.shape)

# Columns Assignement

In [None]:
# Assigning attribute name to dataset
dataset_train.columns = col_names
dataset_test.columns = col_names

# Label of training and test dataset

In [None]:
#label distribution of Training set and testing set
print('Label distribution Training set:')
print(dataset_train['label'].value_counts())
print()
print('Label distribution Test set:')
print(dataset_test['label'].value_counts())

# Data preprocessing

One-Hot-Encoding (one-of-K) is used to to transform all categorical features into binary features. Requirement for One-Hot-encoding: "The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature. It is assumed that input features take on values in the range [0, n_values)."

Therefore the features first need to be transformed with LabelEncoder, to transform every category to a number


## Drop Useless Column

In [None]:
dataset_train.drop(['difficulty_level'],axis=1,inplace=True)
dataset_test.drop(['difficulty_level'],axis=1,inplace=True)

## Changing attack labels to their respective attack class

Within the data set exists 4 different classes of attacks: Denial of Service (DoS), Probe, User to Root(U2R), and Remote to Local (R2L). The dataset contains subclasses of these attacks which we will be replacing with the main class of attacks.

The sub classes of attacks are:

| Dos | R2L | Probe | U2R |
| --- | --- | --- | --- |
| back | ftp_write | ipsweep |  buffer_overflow |
| land | guess_passwd | mscan |  loadmodule |
| neptune | httptunnel | nmap |  perl |
| mailbomb | imap | portsweep |  ps |
| pod | multihop | saint |  rootkit |
| processtable | named | satan |  sqlattack |
| smurf | phf |      | xterm | |
| teardrop | sendmail |      | |
| udpstorm | snmpgetattack |      | |
| worm |   snmpguess |      | |
|       | spy |      | |
|       | warezclient |      | |
|       | warezmaster |      | |
|       | xlock |      | |
|       | xsnoop |      | |



In [None]:
def change_label_of(df):
  df.label.replace(['apache2','back','land','neptune','mailbomb','pod','processtable','smurf','teardrop','udpstorm','worm'],'Dos',inplace=True)
  df.label.replace(['ftp_write','guess_passwd','httptunnel','imap','multihop','named','phf','sendmail',
       'snmpgetattack','snmpguess','spy','warezclient','warezmaster','xlock','xsnoop'],'R2L',inplace=True)
  df.label.replace(['ipsweep','mscan','nmap','portsweep','saint','satan'],'Probe',inplace=True)
  df.label.replace(['buffer_overflow','loadmodule','perl','ps','rootkit','sqlattack','xterm'],'U2R',inplace=True)


change_label_of(dataset_train)
change_label_of(dataset_test)

In [None]:
#label distribution of Training set and testing set after relabeling
print('Label distribution Training set:')
print(dataset_train['label'].value_counts())
print()
print('Label distribution Test set:')
print(dataset_test['label'].value_counts())

# Identify categorical features

In [None]:
# explore categorical features
print('Training set:')
for col_name in dataset_train.columns:
    if dataset_train[col_name].dtypes == 'object' :
        unique_cat = len(dataset_train[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} categories".format(col_name=col_name, unique_cat=unique_cat))


In [None]:
# Test set
print('Test set:')
for col_name in dataset_test.columns:
    if dataset_test[col_name].dtypes == 'object' :
        unique_cat = len(dataset_test[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} categories".format(col_name=col_name, unique_cat=unique_cat))


# Encoding categorical features


### Insert categorical features into a 2D numpy array


In [None]:
categorical_columns=['protocol_type', 'service', 'flag']
label = ["label"]

# seperate categorical data from non-categorical data and labels
categorical_train_data = dataset_train[categorical_columns]
non_categorical_train_data = dataset_train.drop(categorical_columns+label, axis=1)

categorical_test_data = dataset_test[categorical_columns]
non_categorical_test_data = dataset_test.drop(categorical_columns+label, axis=1)

# Separate label from dataset
label_train_data = dataset_train['label']
label_test_data = dataset_test['label']

# visualize categorical data
categorical_train_data.head()


In [None]:

# convert categorical data to numeric using one-hot encoding
categorical_train_data_encoded = pd.get_dummies(categorical_train_data)
categorical_test_data_encoded = pd.get_dummies(categorical_test_data)

# visualize encoded categorical data
categorical_train_data_encoded.head()

# Dataset Normalization
We will use MinMaxScaler to normalize the non-categorical features.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Function to normalize the dataset
def normalization(df):
  for i in df.columns:
    arr = df[i]
    arr = np.array(arr)
    df[i] = scaler.fit_transform(arr.reshape(len(arr),1))
  return df


# Normalize the training set
non_categorical_train_data = normalization(non_categorical_train_data)

# Normalize the test set
non_categorical_test_data = normalization(non_categorical_test_data)

non_categorical_train_data.head()



# Feature Selection with Principal Component Analysis (PCA)

We will now perform PCA on the non-catergorical dataset to reduce the dimensionality of the dataset.

In [None]:
#  feature selection using pca
from sklearn.decomposition import PCA
# We choose the minimum number of principal components 
# such that 95% of the variance is retained.
pca = PCA(.95)

pca.fit(non_categorical_train_data)

non_categorical_train_data_pca = pca.transform(non_categorical_train_data)

non_categorical_test_data_pca = pca.transform(non_categorical_test_data)

non_categorical_train_data_pca = pd.DataFrame(data = non_categorical_train_data_pca,
                                                columns = [f"component_{i}" for i in range(1, pca.n_components_+1)])

non_categorical_test_data_pca = pd.DataFrame(data = non_categorical_test_data_pca,
                                                columns = [f"component_{i}" for i in range(1, pca.n_components_+1)])

non_categorical_test_data_pca.head()

In [None]:
# check number of components selected
print("number of features selected: ", pca.n_components_)

# Encoding the Label

In [None]:
# create a new column in the label dataset with the ecoded labels
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

encoded_train_labels = label_encoder.fit_transform(label_train_data)
encoded_test_labels = label_encoder.fit_transform(label_test_data)

label_classes = label_encoder.classes_


# Join encoded categorical dataframe with the non-categorical dataframe

In [None]:
categorical_test_data_encoded
categorical_train_data_encoded

non_categorical_test_data_pca
non_categorical_train_data_pca

# concatenate the categorical and non-categorical data and labels
train_data = pd.concat([categorical_train_data_encoded, non_categorical_train_data_pca], axis=1)
test_data = pd.concat([categorical_test_data_encoded, non_categorical_test_data_pca], axis=1)

train_data["target"] = encoded_train_labels
test_data["target"] = encoded_test_labels
test_data.head()

In [None]:
plt.figure(figsize=(8,8))
plt.pie(train_data.target.value_counts(),labels=train_data.target.unique(),autopct='%0.2f%%')
plt.title('Pie chart distribution of labels')
plt.legend()
plt.show()

# Building the models

### Here, we will build 3 models:
### 1. Linear SVM
### 2. K-Nearest Neighbors
### 3. AdaBoost Ensemble Model with Linear SVM and K-Nearest Neighbors

In [None]:
# Setting Training and Testing variables

x_train = train_data.drop(['target'],axis=1).to_numpy()
y_train = train_data.target
x_test = test_data.drop(['target'],axis=1).to_numpy()
y_test = test_data.target


# Linear Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

svm = SVC(kernel='linear',gamma='auto')
svm.fit(x_train,y_train) 

In [None]:
svm_pred = svm.predict(x_test) 
svm_ac=accuracy_score(y_test, svm_pred)*100  
print("The Accuracy of SVM-Classifier is: ", svm_ac)

In [None]:
# SVM classification report
classification_report(y_test, svm_pred,target_names=label_classes)

# K-nearest-neighbor Classifier (Multi-class Classification)


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train) 

In [None]:
knn_pred=knn.predict(x_test)  
knn_ac=accuracy_score(y_test, knn_pred)*100  

print("The Accuracy of the KNN-Classifier is: ", knn_ac)

In [None]:
# classification report
classification_report(y_test, knn_pred,target_names=label_classes)

# AdaBoost Ensemble Classifier