# Assignment: Intrusion detection

## Task:  Connection Classification

Kaggle challenge: https://www.kaggle.com/sampadab17/network-intrusion-detection?select=Train_data.csv

### Problem description
The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. It created an environment to acquire raw TCP/IP dump data for a network by simulating a typical US Air Force LAN. The LAN was focused like a real environment and blasted with multiple attacks.

## Data
A connection is a sequence of TCP packets starting and ending at some time duration between which data flows to and from a source IP address to a target IP address under some well-defined protocol. Also, each connection is labelled as either normal or as an attack with exactly one specific attack type. Each connection record consists of about 100 bytes.
For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features) .The class variable has two categories:
• Normal
• Anomalous



## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

In [None]:
# The train data has almost even number of normal and anomalous data

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [None]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import time

In [None]:
data_train = pd.read_csv('Train_data.csv' , encoding = "ISO-8859-1")

In [None]:
data_test = pd.read_csv('Test_data.csv' , encoding = "ISO-8859-1")

In [None]:
data_train.head()

In [None]:
data_test.head()

In [None]:
#checking number of columns and type of each column 
data_train.info()

# Data analysis


In [None]:
data_train['protocol_type'].value_counts()

In [None]:
data_train['flag'].value_counts()

In [None]:
pd.set_option('display.max_row',None)
data_train['service'].value_counts()

In [None]:
data_train['class'].value_counts()

In [None]:
#Descriptive statistics
data_train.describe()

In [None]:
data_train.isnull().sum()

In [None]:
#observation - we have 53% - 47% of class lables.
# we can simply visualiztion the data frame!
data_train.columns


In [None]:
print("Train dataset shape - ", data_train.shape)
print("Train dataset shape -", data_test.shape)

In [None]:
#Visualization of dataframe

Class = pd.DataFrame(data['class'])
Class

##plotting the normal and anomalous records

bins_colors = ["yello","red"]
sns.countplot('class', bins_colors=bins_colors, palette = bins_colors)

plt.title("Class distribution", fontsize = 14)

In [None]:
#no outliers
data.boxplot(figsize=(20,10))

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

#extract numerical attributes so that it has zero mean and variance of 1
#https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

cols = data.select_dtypes(include=['float64','int64']).columns
sc_train = scaler.fit_transform(data.select_dtypes(include=['float64','int64']))
sc_test = scaler.fit_transform(test.select_dtypes(include=['float64','int64']))

In [None]:
sc_train=pd.DataFrame(sc_train, columns = cols)
sc_test=pd.DataFrame(sc_test)
sc_train.head()

In [None]:
#encoding categorical features

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# extract categorical attributes from both training and test sets 
train_cat = data.select_dtypes(include=['object']).copy()
test_cat = test.select_dtypes(include=['object']).copy()

# encode the categorical attributes
traincat = train_cat.apply(encoder.fit_transform)
testcat = test_cat.apply(encoder.fit_transform)

#drop the reduntant attribute class and append teh encoded categorical variables

train_drop = traincat.drop(['class'], axis=1)
a = traincat[['class']].copy()

train_x=pd.concat([sc_train, train_drop], axis=1)
train_y=data['class']
test_x=pd.concat([sc_test, testcat],axis=1)
traincat.head()
test_x.shape
train_y.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rfc = RandomForestClassifier();

# fit random forest classifier on the training set
rfc.fit(train_x, train_y);
# extract important features
score = np.round(rfc.feature_importances_,3)
features = pd.DataFrame({'feature':train_x.columns,'importance':score})
features = features.sort_values('importance',ascending=False).set_index('feature')

# plot features
plt.rcParams['figure.figsize'] = (11, 4)
features.plot.bar();

In [None]:
#creating a RFE model

from sklearn.feature_selection import RFE
import itertools
rfc = RandomForestClassifier()

#select 10 attributes
rfe = RFE(rfc, n_features_to_select=10)
rfe = rfe.fit(train_x, train_y)

# summarize the selection of the attributes
feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), train_x.columns)]
selected_features = [v for i, v in feature_map if i==True]

selected_features

#split the data

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split (train_x, train_y, train_size=0.3, random_state=2)

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [None]:
from sklearn.naive_bayes import BernoulliNB 

# Train Gaussian Naive Baye Model
BNB = BernoulliNB()
BNB.fit(X_train, Y_train)

In [None]:
#Model Evaluation

from sklearn import metrics

models=[]

models.append(('NB Classifier', BNB))

for i,v in models:
    accuracy = metrics.accuracy_score(Y_train, v.predict(X_train))
    confusion_matrix = metrics.confusion_matrix(Y_train, v.predict(X_train))
    classification = metrics.classification_report(Y_train, v.predict(X_train))
    print('{} Model Evaluation'.format(i))
    print ("Model Accuracy:" "\n", accuracy)
    print()
    print("Confusion matrix:" "\n", confusion_matrix)
    print()
    print("Classification report:" "\n", classification) 
    print() 

## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

In [None]:
#validating on the test data

for i, v in models:
    accuracy = metrics.accuracy_score(Y_test, v.predict(X_test))
    confusion_matrix = metrics.confusion_matrix(Y_test, v.predict(X_test))
    classification = metrics.classification_report(Y_test, v.predict(X_test))
    print('{} Model Evaluation'.format(i))
    print ("Model Accuracy:" "\n", accuracy)
    print()
    print("Confusion matrix:" "\n", confusion_matrix)
    print()
    print("Classification report:" "\n", classification) 
    print() 