## Imports
This section contains all the required imports for this model.

In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

## Data Pre-processing
The data for this model can be found [here](https://archive.ics.uci.edu/dataset/542/internet+firewall+data). Once the data has been imported we drop any rows with empty values. After this we map the different types of ports to their respective port numbers. We then apply this map to the `Destination Port` column. After this we encode the column with a label encoder. After this we drop any unnecessary columns.

In [2]:
# Load Data
df = pd.read_csv('log2.csv')
df.dropna(inplace=True)

# Map Ports to Services
port_to_service = {53: "DNS", 80: "HTTP", 443: "HTTPS", 3389: "RDP", 22: "SSH", 123: "NTP"}
df["TrafficType"] = df["Destination Port"].map(port_to_service).fillna("Unknown")

# Encode TrafficType
le = LabelEncoder()
df["TrafficType"] = le.fit_transform(df["TrafficType"])

df = df.drop(["Source Port", "Destination Port", "NAT Source Port", "NAT Destination Port", "Action"], axis=1)
df.head()

Unnamed: 0,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received,TrafficType
0,177,94,83,2,30,1,1,0
1,4768,1600,3168,19,17,10,9,4
2,238,118,120,2,1199,1,1,6
3,3327,1438,1889,15,17,8,7,4
4,25358,6778,18580,31,16,13,18,2


## Model Fitting
The first thing we do when fitting the data to our model is defining our features and our target. Once this is done we split the data into test data and training data into a 70:30 split respectively. Once this has been done we apply the Synthetic Minority Oversampling Technique (SMOTE) to help balance our minority classes. Then, we scale both the resampled training data and the test data to ensure all features are on the same scale. Finally, we train our model with 3 neighbors and distance-based weighting on the scaled training data and use it to predict results on the test set.

In [3]:
# Features and Target
X = df.drop(["TrafficType"], axis=1)
y = df["TrafficType"]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3)

# Apply SMOTE
smote = SMOTE(sampling_strategy="not majority")  # Balance all classes
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Scale Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=3, weights="distance")
knn.fit(X_train_scaled, y_train_resampled)
y_pred = knn.predict(X_test_scaled)

## Prediction
Once our data has been fitted, we decode our predictions and evaluate them using a classification report and accuracy score.

In [4]:
# Decode Predictions
y_pred_labels = le.inverse_transform(y_pred)
y_test_labels = le.inverse_transform(y_test)

# Evaluate
print(classification_report(y_test_labels, y_pred_labels))
print("Accuracy Score:", accuracy_score(y_test_labels, y_pred_labels))

              precision    recall  f1-score   support

         DNS       0.98      0.97      0.98     10723
        HTTP       0.62      0.74      0.68      2840
       HTTPS       0.87      0.84      0.86      8117
         NTP       0.36      0.74      0.48       128
         RDP       0.54      0.81      0.65       126
         SSH       0.26      0.19      0.22        83
     Unknown       0.98      0.97      0.98     23856

    accuracy                           0.93     45873
   macro avg       0.66      0.75      0.69     45873
weighted avg       0.94      0.93      0.93     45873

Accuracy Score: 0.9312885575392933
