# Anomaly Detection in Network Traffic
This project aims to develop machine learning models that detect network anomalies/attacks on the KDD Cup 1999 dataset. Our goal is to build predictive models that accurately distinguish the “bad” connections from “good” connections. This is done through data preprocessing, implementation of several machine learning algorithms, and comprehensive evaluation of said models.


## Data Loading and analysis
The cell below is reponsible for outputting a sample of our fully encodeded data. Its output consists of the original number
of rows and columns of the kddcup data set as well as the number of rows and columns after our data reprocessing

We load the KDD Cup 1999 dataset, which consists of 41 features describing different aspects of each network connection. The dataset includes a target label indicating whether a connection is normal or belongs to one of several types of attacks.

In [8]:
import pandas as pd

# Step 1: Define the column names (KDD dataset has 41 features + 1 label)
column_names = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised",
    "root_shell", "su_attempted", "num_root", "num_file_creations", "num_shells",
    "num_access_files", "num_outbound_cmds", "is_host_login", "is_guest_login", "count",
    "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
    "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
    "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
    "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "label"
]


# Step 2: Load the gzipped CSV file
df = pd.read_csv("data/kddcup.data_10_percent.gz", header=None, names=column_names)
print("Dataset before")
print("Shape of the dataset:", df.shape)



Dataset before
Shape of the dataset: (494021, 42)


## Data Preprocessing
To prepare the dataset for modeling:
- We convert all attack categories into a single "attack" class, simplifying the task to binary classification.
- We perform one-hot encoding for categorical features.
- We scale numerical features using `MinMaxScaler` for models sensitive to feature magnitude (e.g., KNN, SVM).

In [9]:
df = df.drop_duplicates()

duplicates = df.T[df.T.duplicated()].index
df = df.drop(columns=duplicates)

constant_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns=constant_cols)

In [10]:
#label encoding set 0 for normal connections and 1 for attack connections
df['label'] = df['label'].apply(lambda x: 0 if x == 'normal.' else 1)


# ONE HOT ENCODING FOR PROTOCAL AND FLAG COLUMNS
df = pd.get_dummies(df, columns=['protocol_type', 'flag'])

# FREQUENCY ENCODING FOR "SERVICES" column to avoid adding too many columns
freq = df['service'].value_counts(normalize=True)
df['service'] = df['service'].map(freq)



print("Dataset after:")
print("Shape of the dataset:", df.shape)


pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 0)           # Automatically adjust to window width

#this will show everything
df.head(10)

Dataset after:
Shape of the dataset: (145586, 52)


Unnamed: 0,duration,service,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,protocol_type_icmp,protocol_type_tcp,protocol_type_udp,flag_OTH,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0,0.426236,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
1,0,0.426236,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
2,0,0.426236,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
3,0,0.426236,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,0,0.426236,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
5,0,0.426236,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,6,6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,59,59,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
6,0,0.426236,212,1940,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,2,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1,69,1.0,0.0,1.0,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
7,0,0.426236,159,4087,0,0,0,0,0,1,0,0,0,0,0,0,0,0,5,5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11,79,1.0,0.0,0.09,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
8,0,0.426236,210,151,0,0,0,0,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8,89,1.0,0.0,0.12,0.04,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
9,0,0.426236,212,786,0,0,0,1,0,1,0,0,0,0,0,0,0,0,8,8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8,99,1.0,0.0,0.12,0.05,0.0,0.0,0.0,0.0,0,False,True,False,False,False,False,False,False,False,False,False,False,True,False


## Random Forest

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split features and labels
X = df.drop('label', axis=1)
y = df['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

Random Forest Accuracy: 0.9994962908691272


### Evaluation Metrics
To assess the performance of the Random Forest classifier, we calculate:
- **Accuracy**
- **Recall**
- **F1 Score**
- **Confusion Matrix**

These metrics provide insight into how well Random Forest balances detecting attacks while maintaining overall classification performance.

In [16]:
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Random Forest Accuracy: 0.9994962908691272

Confusion Matrix:
 [[26422     9]
 [   13 17232]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     26431
           1       1.00      1.00      1.00     17245

    accuracy                           1.00     43676
   macro avg       1.00      1.00      1.00     43676
weighted avg       1.00      1.00      1.00     43676



In [17]:
# Normalization Using MinMaxScaler 

In [18]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Support Vector Machine (SVM)



### Model Training and Prediction

In [19]:
from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', C=1.0, gamma=1)

svm_model.fit(X_train_scaled, y_train)

y_pred_svm = svm_model.predict(X_test_scaled)

### Evaluation Metrics

We evaluate the model using:
- **Accuracy**: Overall correct predictions.
- **Recall**: Ability to detect actual attacks.
- **F1 Score**: Balance between precision and recall.

We also visualize the **confusion matrix** for better interpretation.

In [20]:
print("SVM (RBF Kernel) Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))

SVM (RBF Kernel) Accuracy: 0.9983972891290411

Confusion Matrix:
 [[26404    27]
 [   43 17202]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     26431
           1       1.00      1.00      1.00     17245

    accuracy                           1.00     43676
   macro avg       1.00      1.00      1.00     43676
weighted avg       1.00      1.00      1.00     43676



## K-Nearest Neighbor(KNN)

We apply `MinMaxScaler` to normalize the dataset since KNN relies on Euclidean distance, and unscaled features could distort proximity calculations.


## Model Training and Prediction

In [64]:
# Code for model

### Evaluation Metrics
To evaluate the performance of the KNN model, we use:
- **Accuracy**
- **Recall**
- **F1 Score**
- **Confusion Matrix**

These metrics allow us to compare the KNN model with SVM and Random Forest, especially in terms of correctly flagging attacks without raising excessive false positives.

In [66]:
# Code for eval