# Intrusion Detection Systems using SVM Kernels

<p>This Jupyter Notebook is a comprehensive guide to implementing Intrusion Detection Systems (IDS) using Support Vector Machine (SVM) kernels. The main topics covered in this Notebook are:</p>
<ol>
    <li>Importing Modules and Defining Functions</li>
    <li>Importing and Exploring the Dataset</li>
    <li>Feature Selection</li>
    <li>SVM Modeling</li>
    <li>Grid Search</li>
    <li>Conclusion</li>
</ol>


## Importing Module and Defining Functions

<p>The following imports are commonly used in data science and machine learning tasks and will be included in a Jupyter notebook:</p>

<ol>
    <li>import pandas as pd: This imports the Pandas library, which is a powerful data manipulation tool used to work with structured data. It is commonly used to read, write and manipulate data in data frames or tables.</li>
    <li>import numpy as np: This imports the NumPy library, which is a fundamental package for scientific computing in Python. It is commonly used to work with arrays and matrices of numerical data.</li>
    <li>import scikitplot as skplt: This imports the Scikit-plot library, which provides an intuitive interface for working with various Scikit-learn plots.</li>
    <li>import matplotlib.pyplot as plt: This imports the Matplotlib library, which is a plotting library used to create static, animated, and interactive visualizations in Python.</li>
    <li>import warnings: This is a built-in Python library that provides a way to handle warnings.</li>
    <li>from sklearn import svm: This imports the SVM module from the Scikit-learn library, which provides implementation of Support Vector Machines for classification and regression.</li>
    <li>from sklearn.preprocessing import StandardScaler: This imports the StandardScaler module from the Scikit-learn library, which provides a way to standardize features by removing the mean and scaling to unit variance.</li>
    <li>from sklearn.pipeline import make_pipeline: This imports the make_pipeline function from the Scikit-learn library, which provides a way to construct a pipeline of machine learning models.</li>
    <li>from sklearn.svm import SVC: This imports the SVC module from the Scikit-learn library, which is a Support Vector Machine implementation for classification.</li>
    <li>from sklearn.model_selection import train_test_split, GridSearchCV: This imports the train_test_split and GridSearchCV modules from the Scikit-learn library, which provides a way to split the data into training and testing sets and perform grid search to find the best hyperparameters, respectively.</li>
    <li>from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2: This imports the SelectKBest, mutual_info_classif, and chi2 modules from the Scikit-learn library, which provides a way to select the most relevant features from a dataset using various feature selection techniques.</li>
    <li>from sklearn.metrics import classification_report: This imports the classification_report function from the Scikit-learn library, which provides a way to evaluate the performance of a classification model.</li>
    <li>from sklearn.ensemble import ExtraTreesClassifier: This imports the ExtraTreesClassifier module from the Scikit-learn library, which is an ensemble learning method used for classification.</li>
</ol>

In [2]:
import pandas as pd
import numpy as np
import scikitplot as skplt
import matplotlib.pyplot as plt
import warnings
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, mutual_info_classif,chi2
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier

<p>The code snippet below includes several settings that can be used in data analysis and visualization using Python libraries such as Pandas, Matplotlib, and Warnings. Here's a breakdown of each line:<p>
<ol>
<li>pd.set_option('display.max_rows', None): This line sets the maximum number of rows to display in Pandas data frames to be unlimited. This means that all rows in a data frame will be shown when it is printed to the console.</li>
<li>pd.set_option('display.max_columns', None): Similarly to the previous line, this sets the maximum number of columns to display in Pandas data frames to be unlimited. This means that all columns in a data frame will be shown when it is printed to the console.</li>
<li>plt.rcParams["figure.figsize"] = (15,15): This line sets the default figure size for Matplotlib plots to be 15 inches by 15 inches. This means that any plot created using Matplotlib will have a default size of 15 inches by 15 inches unless otherwise specified.</li>
<li>warnings.filterwarnings('ignore'): This line suppresses warnings that may be generated by Python or its libraries. It is generally not recommended to suppress warnings, as they can provide valuable information about potential issues with code, but in some cases it may be useful to ignore certain warnings for debugging or presentation purposes.</li>
</ol>

In [3]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
plt.rcParams["figure.figsize"] = (15,15)
warnings.filterwarnings('ignore')

<p>
def getRanges(data): This function takes in a Pandas DataFrame and returns a dictionary where each key is a column name and the corresponding value is the range of values in that column (i.e. the difference between the maximum and minimum values). It only considers columns that have numeric data types (int64 or float64).

def getCategoricalValues(data): This function takes in a Pandas DataFrame and returns a dictionary where each key is a column name and the corresponding value is an array of the unique categorical values in that column. It only considers columns that have an object data type (i.e. strings).

def select_features(X_train, y_train, X_test, k_value='all'): This function takes in the training and testing data sets along with a value for k_value, which determines the number of features to select for the model. It uses the SelectKBest function from the sklearn.feature_selection module to select the k_value best features based on their mutual information score with the target variable y_train. The function returns the transformed training and testing data sets with only the selected features, as well as the SelectKBest object fs, which contains the scores and p-values for all features.
</p>

In [4]:
def getRanges(data):
    ranges={}
    for c in data.columns:
        if data[c].dtype=='int64' or data[c].dtype=='float64':
            ranges[c]=(data[c].max()-data[c].min())
    return ranges

def getCategoricalValues(data):
    categoricalVals={}
    for c in data.columns:
        if data[c].dtype=='object':
            categoricalVals[c]=data[c].unique()
            print(f'{c} : {sorted(list(data[c].unique()))} \n count = {len(list(data[c].unique()))} \n')
    return categoricalVals

def select_features(X_train, y_train, X_test, k_value='all'):
    fs = SelectKBest(score_func=mutual_info_classif, k=k_value)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

## Importing Dataset

<p> 
The code below reads in two CSV files, KDDTrain+.csv and KDDTest+.csv, which contain data related to network intrusion detection. Here's a breakdown of each line:
<ol>
<li>dfTrain = pd.read_csv('nsl-kdd/KDDTrain+.csv'): This line reads in the KDDTrain+.csv file using the Pandas read_csv function, which converts the CSV file to a Pandas data frame. The resulting data frame is assigned to the variable dfTrain.</li>
<li>dsTrain = dfTrain.copy(): This line creates a copy of the dfTrain data frame and assigns it to the variable dsTrain. This is often done to create a new data frame that can be modified without affecting the original data frame.</li>
<li>dfTest = pd.read_csv('nsl-kdd/KDDTest+.csv'): This line reads in the KDDTest+.csv file using the Pandas read_csv function, which converts the CSV file to a Pandas data frame. The resulting data frame is assigned to the variable dfTest.</li>
<li>dsTest = dfTest.copy(): This line creates a copy of the dfTest data frame and assigns it to the variable dsTest. This is often done to create a new data frame that can be modified without affecting the original data frame.</li>
</ol>

In [5]:
dfTrain = pd.read_csv('nsl-kdd/KDDTrain+.csv')
dsTrain = dfTrain.copy()
dfTest = pd.read_csv('nsl-kdd/KDDTest+.csv')
dsTest = dfTest.copy()

## Exploring Dataset

### Train Data

In [6]:
dsTrain.head()

Unnamed: 0,'duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations','num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count','serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0,udp,other,SF,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0.0,0.0,0.0,0.0,0.08,0.15,0.0,255,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.0,1.0,0.0,0.0,0.05,0.07,0.0,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,anomaly
3,0,tcp,http,SF,232,8153,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.2,0.2,0.0,0.0,1.0,0.0,0.0,30,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,30,32,0.0,0.0,0.0,0.0,1.0,0.0,0.09,255,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal


In [7]:
dsTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 42 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   'duration'                     125973 non-null  int64  
 1   'protocol_type'                125973 non-null  object 
 2   'service'                      125973 non-null  object 
 3   'flag'                         125973 non-null  object 
 4   'src_bytes'                    125973 non-null  int64  
 5   'dst_bytes'                    125973 non-null  int64  
 6   'land'                         125973 non-null  int64  
 7   'wrong_fragment'               125973 non-null  int64  
 8   'urgent'                       125973 non-null  int64  
 9   'hot'                          125973 non-null  int64  
 10  'num_failed_logins'            125973 non-null  int64  
 11  'logged_in'                    125973 non-null  int64  
 12  'num_compromised'             

In [8]:
dsTrain.isnull().sum()

'duration'                       0
'protocol_type'                  0
'service'                        0
'flag'                           0
'src_bytes'                      0
'dst_bytes'                      0
'land'                           0
'wrong_fragment'                 0
'urgent'                         0
'hot'                            0
'num_failed_logins'              0
'logged_in'                      0
'num_compromised'                0
'root_shell'                     0
'su_attempted'                   0
'num_root'                       0
'num_file_creations'             0
'num_shells'                     0
'num_access_files'               0
'num_outbound_cmds'              0
'is_host_login'                  0
'is_guest_login'                 0
'count'                          0
'srv_count'                      0
'serror_rate'                    0
'srv_serror_rate'                0
'rerror_rate'                    0
'srv_rerror_rate'                0
'same_srv_rate'     

In [9]:
getRanges(dsTrain)

{"'duration'": 42908,
 "'src_bytes'": 1379963888,
 "'dst_bytes'": 1309937401,
 "'land'": 1,
 "'wrong_fragment'": 3,
 "'urgent'": 3,
 "'hot'": 77,
 "'num_failed_logins'": 5,
 "'logged_in'": 1,
 "'num_compromised'": 7479,
 "'root_shell'": 1,
 "'su_attempted'": 2,
 "'num_root'": 7468,
 "'num_file_creations'": 43,
 "'num_shells'": 2,
 "'num_access_files'": 9,
 "'num_outbound_cmds'": 0,
 "'is_host_login'": 1,
 "'is_guest_login'": 1,
 "'count'": 511,
 "'srv_count'": 511,
 "'serror_rate'": 1.0,
 "'srv_serror_rate'": 1.0,
 "'rerror_rate'": 1.0,
 "'srv_rerror_rate'": 1.0,
 "'same_srv_rate'": 1.0,
 "'diff_srv_rate'": 1.0,
 "'srv_diff_host_rate'": 1.0,
 "'dst_host_count'": 255,
 "'dst_host_srv_count'": 255,
 "'dst_host_same_srv_rate'": 1.0,
 "'dst_host_diff_srv_rate'": 1.0,
 "'dst_host_same_src_port_rate'": 1.0,
 "'dst_host_srv_diff_host_rate'": 1.0,
 "'dst_host_serror_rate'": 1.0,
 "'dst_host_srv_serror_rate'": 1.0,
 "'dst_host_rerror_rate'": 1.0,
 "'dst_host_srv_rerror_rate'": 1.0}

### Test Data

In [10]:
dsTest.head()

Unnamed: 0,'duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations','num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count','serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'
0,0,tcp,private,REJ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,229,10,0.0,0.0,1.0,1.0,0.04,0.06,0.0,255,10,0.04,0.06,0.0,0.0,0.0,0.0,1.0,1.0,anomaly
1,0,tcp,private,REJ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,136,1,0.0,0.0,1.0,1.0,0.01,0.06,0.0,255,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,anomaly
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,134,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,65,0.0,0.0,0.0,0.0,1.0,0.0,1.0,3,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,anomaly
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,8,0.0,0.12,1.0,0.5,1.0,0.0,0.75,29,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,anomaly


In [11]:
dsTest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22544 entries, 0 to 22543
Data columns (total 42 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   'duration'                     22544 non-null  int64  
 1   'protocol_type'                22544 non-null  object 
 2   'service'                      22544 non-null  object 
 3   'flag'                         22544 non-null  object 
 4   'src_bytes'                    22544 non-null  int64  
 5   'dst_bytes'                    22544 non-null  int64  
 6   'land'                         22544 non-null  int64  
 7   'wrong_fragment'               22544 non-null  int64  
 8   'urgent'                       22544 non-null  int64  
 9   'hot'                          22544 non-null  int64  
 10  'num_failed_logins'            22544 non-null  int64  
 11  'logged_in'                    22544 non-null  int64  
 12  'num_compromised'              22544 non-null 

In [12]:
dsTest.isnull().sum()

'duration'                       0
'protocol_type'                  0
'service'                        0
'flag'                           0
'src_bytes'                      0
'dst_bytes'                      0
'land'                           0
'wrong_fragment'                 0
'urgent'                         0
'hot'                            0
'num_failed_logins'              0
'logged_in'                      0
'num_compromised'                0
'root_shell'                     0
'su_attempted'                   0
'num_root'                       0
'num_file_creations'             0
'num_shells'                     0
'num_access_files'               0
'num_outbound_cmds'              0
'is_host_login'                  0
'is_guest_login'                 0
'count'                          0
'srv_count'                      0
'serror_rate'                    0
'srv_serror_rate'                0
'rerror_rate'                    0
'srv_rerror_rate'                0
'same_srv_rate'     

In [13]:
getRanges(dsTest)

{"'duration'": 57715,
 "'src_bytes'": 62825648,
 "'dst_bytes'": 1345927,
 "'land'": 1,
 "'wrong_fragment'": 3,
 "'urgent'": 3,
 "'hot'": 101,
 "'num_failed_logins'": 4,
 "'logged_in'": 1,
 "'num_compromised'": 796,
 "'root_shell'": 1,
 "'su_attempted'": 2,
 "'num_root'": 878,
 "'num_file_creations'": 100,
 "'num_shells'": 5,
 "'num_access_files'": 4,
 "'num_outbound_cmds'": 0,
 "'is_host_login'": 1,
 "'is_guest_login'": 1,
 "'count'": 511,
 "'srv_count'": 511,
 "'serror_rate'": 1.0,
 "'srv_serror_rate'": 1.0,
 "'rerror_rate'": 1.0,
 "'srv_rerror_rate'": 1.0,
 "'same_srv_rate'": 1.0,
 "'diff_srv_rate'": 1.0,
 "'srv_diff_host_rate'": 1.0,
 "'dst_host_count'": 255,
 "'dst_host_srv_count'": 255,
 "'dst_host_same_srv_rate'": 1.0,
 "'dst_host_diff_srv_rate'": 1.0,
 "'dst_host_same_src_port_rate'": 1.0,
 "'dst_host_srv_diff_host_rate'": 1.0,
 "'dst_host_serror_rate'": 1.0,
 "'dst_host_srv_serror_rate'": 1.0,
 "'dst_host_rerror_rate'": 1.0,
 "'dst_host_srv_rerror_rate'": 1.0}

<p> - Three columns in both datasets have a very large range... could possibly affect model performance.<p>
<p> - Viewing all categorical data in both datasets (train and test) to prepare the for encoding.