# Project Introduction
This project aims to develop machine learning pipelines that aid in detecting network anomalies/attacks by creating a predictive model that distinguishes the “bad” connections from “good” connections. This is done through data preprocessing, implementation of
several machine learning algorithms, and comprehensive evaluation of said models.

The cell below is reponsible for outputting a sample of our fully encodeded data. Its output consists of the original number
of rows and columns of the kddcup data set as well as the number of rows and columns after our data reprocessing


# Data Loading

In [8]:
import pandas as pd

# Step 1: Define the column names (KDD dataset has 41 features + 1 label)
column_names = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised",
    "root_shell", "su_attempted", "num_root", "num_file_creations", "num_shells",
    "num_access_files", "num_outbound_cmds", "is_host_login", "is_guest_login", "count",
    "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
    "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
    "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
    "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "label"
]


# Step 2: Load the gzipped CSV file
df = pd.read_csv("data/kddcup.data_10_percent.gz", header=None, names=column_names)
print("Dataset before")
print("Shape of the dataset:", df.shape)



Dataset before
Shape of the dataset: (494021, 42)


# Data Preprocessing
In this step of the project we 
1. Delete duplicates
2. Labeled 'attack' and 'normal' connections
3. Get rid of columns with no variance
4. Encoded categorical data

In [11]:
df = df.drop_duplicates()

duplicates = df.T[df.T.duplicated()].index
df = df.drop(columns=duplicates)

constant_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns=constant_cols)

In [12]:
#label encoding set 0 for normal connections and 1 for attack connections
df['label'] = df['label'].apply(lambda x: 0 if x == 'normal.' else 1)


# ONE HOT ENCODING FOR PROTOCAL AND FLAG COLUMNS
df = pd.get_dummies(df, columns=['protocol_type', 'flag'])

# FREQUENCY ENCODING FOR "SERVICES" column to avoid adding too many columns
freq = df['service'].value_counts(normalize=True)
df['service'] = df['service'].map(freq)



print("Dataset after:")
print("Shape of the dataset:", df.shape)


pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 0)           # Automatically adjust to window width

#this will show everything
df.head(10)

Dataset after:
Shape of the dataset: (145586, 52)


Unnamed: 0,duration,service,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,protocol_type_icmp,protocol_type_tcp,protocol_type_udp,flag_OTH,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0,0.426236,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
1,0,0.426236,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
2,0,0.426236,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
3,0,0.426236,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,0,0.426236,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
5,0,0.426236,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,59,59,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
6,0,0.426236,212,1940,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,2,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1,69,1.0,0.0,1.0,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
7,0,0.426236,159,4087,0,0,0,0,0,1,0,0,0,0,0,0,0,0,5,5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11,79,1.0,0.0,0.09,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
8,0,0.426236,210,151,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8,89,1.0,0.0,0.12,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
9,0,0.426236,212,786,0,0,0,1,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8,99,1.0,0.0,0.12,0.05,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False


# Model Pipelines
Train/test split

Define pipelines for SVM, KNN, Random Forest

Fit each model and evaluate them