### Problem Statement


* Modern computer networks are constantly exposed to a wide range of cyber-attacks such as denial-of-service, probing, and privilege escalation attacks. Detecting malicious network traffic in real time is a critical requirement for ensuring the security and reliability of information systems.


* The objective of this project is to build and evaluate multiple machine learning classification models that can accurately distinguish between normal network traffic and malicious traffic using the UNSW-NB15 dataset. By comparing the performance of traditional machine learning models and ensemble techniques using standard evaluation metrics, this project aims to identify the most effective model for network intrusion detection.

In [22]:
#importing libraries and Models
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score, matthews_corrcoef

In [2]:
#Loading Dataset

df=pd.read_csv('../data/UNSW_NB15_training-set.csv')
print(df.shape)
df.info(),df.columns

(175341, 36)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175341 entries, 0 to 175340
Data columns (total 36 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   dur                175341 non-null  float64
 1   proto              175341 non-null  object 
 2   service            175341 non-null  object 
 3   state              175341 non-null  object 
 4   spkts              175341 non-null  int64  
 5   dpkts              175341 non-null  int64  
 6   sbytes             175341 non-null  int64  
 7   dbytes             175341 non-null  int64  
 8   rate               175341 non-null  float64
 9   sload              175341 non-null  float64
 10  dload              175341 non-null  float64
 11  sloss              175341 non-null  int64  
 12  dloss              175341 non-null  int64  
 13  sinpkt             175341 non-null  float64
 14  dinpkt             175341 non-null  float64
 15  sjit               175341 non-null  fl

(None,
 Index(['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
        'dbytes', 'rate', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt',
        'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt',
        'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
        'response_body_len', 'ct_src_dport_ltm', 'ct_dst_sport_ltm',
        'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'is_sm_ips_ports',
        'attack_cat', 'label'],
       dtype='object'))

In [3]:
df.iloc[:10,:18]

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sload,dload,sloss,dloss,sinpkt,dinpkt,sjit,djit,swin
0,0.121478,tcp,-,FIN,6,4,258,172,74.08749,14158.942,8495.365,0,0,24.2956,8.375,30.177547,11.830604,255
1,0.649902,tcp,-,FIN,14,38,734,42014,78.47337,8395.112,503571.3,2,17,49.915,15.432865,61.426933,1387.7783,255
2,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,1572.2719,60929.23,1,6,231.87556,102.737206,17179.586,11420.926,255
3,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,2740.179,3358.622,1,3,152.87654,90.235725,259.08017,4991.7847,255
4,0.449454,tcp,-,FIN,10,6,534,268,33.373825,8561.499,3987.0598,2,1,47.75033,75.6596,2415.8376,115.807,255
5,0.380537,tcp,-,FIN,10,6,534,268,39.41798,10112.025,4709.135,2,1,39.92878,52.241,2223.7302,82.5505,255
6,0.637109,tcp,-,FIN,10,8,534,354,26.683033,6039.783,3892.5837,2,1,68.26778,81.13771,4286.8286,119.42272,255
7,0.521584,tcp,-,FIN,10,8,534,354,32.593025,7377.5273,4754.747,2,1,55.794,66.05414,3770.5808,118.96263,255
8,0.542905,tcp,-,FIN,10,8,534,354,31.31303,7087.7964,4568.0186,2,1,60.210888,68.109,4060.6255,106.61155,255
9,0.258687,tcp,-,FIN,10,6,534,268,57.985134,14875.12,6927.291,2,1,27.505112,39.1068,1413.6864,57.200394,255


In [4]:
df.iloc[:10,18:37]

Unnamed: 0,stcpb,dtcpb,dwin,tcprtt,synack,ackdat,smean,dmean,trans_depth,response_body_len,ct_src_dport_ltm,ct_dst_sport_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,is_sm_ips_ports,attack_cat,label
0,621772692,2202533631,255,0.0,0.0,0.0,43,43,0,0,1,1,0,0,0,0,Normal,0
1,1417884146,3077387971,255,0.0,0.0,0.0,52,1106,0,0,1,1,0,0,0,0,Normal,0
2,2116150707,2963114973,255,0.111897,0.061458,0.050439,46,824,0,0,1,1,0,0,0,0,Normal,0
3,1107119177,1047442890,255,0.0,0.0,0.0,52,64,0,0,1,1,1,1,0,0,Normal,0
4,2436137549,1977154190,255,0.128381,0.071147,0.057234,53,45,0,0,2,1,0,0,0,0,Normal,0
5,3984155503,1796040391,255,0.172934,0.119331,0.053603,53,45,0,0,2,1,0,0,0,0,Normal,0
6,1787309226,1767180493,255,0.143337,0.069136,0.074201,53,44,0,0,1,1,0,0,0,0,Normal,0
7,205985702,316006300,255,0.116615,0.059195,0.05742,53,44,0,0,3,1,0,0,0,0,Normal,0
8,884094874,3410317203,255,0.118584,0.066133,0.052451,53,44,0,0,3,1,0,0,0,0,Normal,0
9,3368447996,584859215,255,0.087934,0.063116,0.024818,53,45,0,0,3,1,0,0,0,0,Normal,0


The label column was used as the target variable, where 0 represents normal network traffic and 1 represents malicious traffic. The attack_cat column was excluded to maintain a binary classification setup.

In [5]:
#sample data for train 

sampled_df,_ =train_test_split(df, train_size=20000,stratify=df['label'], random_state=42)


sampled_df["label"].value_counts(),sampled_df.shape

(label
 1    13612
 0     6388
 Name: count, dtype: int64,
 (20000, 36))

In [6]:
target_column='label'
drop_columns=['id','attack_cat']    
num_features=["dur","sbytes","dbytes","spkts","dpkts","rate","sload","dload","sloss","dloss","sinpkt","dinpkt","sjit","djit","swin","dwin","tcprtt","synack","ackdat"]
cat_features=["proto","service","state"]
features=num_features+cat_features

In [7]:
# Split features and target
X = sampled_df[features]
y = sampled_df[target_column]
X.shape,y.shape

((20000, 22), (20000,))

Feature selection was performed by choosing a subset of relevant numerical and categorical attributes commonly used in network traffic analysis.

A total of 22 features (19 numerical and 3 categorical ) were selected, while non-imformative and multi-class attributes were excluded to maintain a binary classification setup.

In [8]:

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
    ]
)

X_train.shape, X_test.shape

((16000, 22), (4000, 22))

Numerical features were standardized using StandardScalar, while categorical features were encoded using OnehotEncoder.

 A ColumnTransformer was used to apply  appropriate preprocessing steps to each feature type.

In [10]:
#Model-1

# Logistic Regression pipeline
log_reg = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression(max_iter=1000, n_jobs=-1))
    ]
)

# Train
log_reg.fit(X_train, y_train)

# Predict
y_pred_lr = log_reg.predict(X_test)

# Metrics
lr_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_lr),
    "Precision": precision_score(y_test, y_pred_lr),
    "Recall": recall_score(y_test, y_pred_lr),
    "F1-score": f1_score(y_test, y_pred_lr),
    "AUC": roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_lr)
}

lr_metrics

{'Accuracy': 0.9165,
 'Precision': 0.9055706521739131,
 'Recall': 0.9794268919911829,
 'F1-score': 0.9410518884574656,
 'AUC': 0.9561892951307323,
 'MCC': 0.8059466920264895}

Logistic Regression was used as a Baseline  linear classification model to evaluate the effectiveness of the selected features in detecting network intrusion.

In [18]:
#Model-2

dt = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", DecisionTreeClassifier(
            max_depth=10,
            random_state=42
        ))
    ]
)

dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)

dt_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_dt),
    "Precision": precision_score(y_test, y_pred_dt),
    "Recall": recall_score(y_test, y_pred_dt),
    "F1-score": f1_score(y_test, y_pred_dt),
    "AUC": roc_auc_score(y_test, dt.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_dt)
}

dt_metrics

{'Accuracy': 0.9295,
 'Precision': 0.9224376731301939,
 'Recall': 0.9786921381337252,
 'F1-score': 0.9497326203208556,
 'AUC': 0.9737463477903916,
 'MCC': 0.836180825018848}

A Decision Tree classifier was trained to capture non-linear relationships in network traffic features.

In [20]:
#Model-3

rf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", RandomForestClassifier(
            n_estimators=200,
            max_depth=15,
            random_state=42,
            n_jobs=-1
        ))
    ]
)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

rf_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_rf),
    "Precision": precision_score(y_test, y_pred_rf),
    "Recall": recall_score(y_test, y_pred_rf),
    "F1-score": f1_score(y_test, y_pred_rf),
    "AUC": roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_rf)}

rf_metrics

{'Accuracy': 0.93575,
 'Precision': 0.9190751445086706,
 'Recall': 0.9930198383541513,
 'F1-score': 0.9546176938018718,
 'AUC': 0.9868439964630628,
 'MCC': 0.8526587079576785}

Random Forest was employed as an ensemble learning method to improve classification robustness and reduce overfitting.

In [23]:
#Model-4

svm = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", SVC(kernel="linear", probability=True, random_state=42)
        )
    ]
)

svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)

svm_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_svm),
    "Precision": precision_score(y_test, y_pred_svm),
    "Recall": recall_score(y_test, y_pred_svm),
    "F1-score": f1_score(y_test, y_pred_svm),
    "AUC": roc_auc_score(y_test, svm.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_svm)
}

svm_metrics


{'Accuracy': 0.92475,
 'Precision': 0.9082630691399662,
 'Recall': 0.9893460690668626,
 'F1-score': 0.9470722700896782,
 'AUC': 0.9141766674830598,
 'MCC': 0.8267533898767352}

A linear Support Vector Machine was used to evaluate margin-based classification performance on the intrusion detection task.

In [24]:
#Model-5

knn = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", KNeighborsClassifier(
            n_neighbors=5,
            n_jobs=-1
        ))
    ]
)

knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

knn_metrics = {
   "Accuracy": accuracy_score(y_test, y_pred_knn),
    "Precision": precision_score(y_test, y_pred_knn),
    "Recall": recall_score(y_test, y_pred_knn),
    "F1-score": f1_score(y_test, y_pred_knn),
    "AUC": roc_auc_score(y_test, knn.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_knn)
}

knn_metrics

{'Accuracy': 0.92275,
 'Precision': 0.9289015286171347,
 'Recall': 0.9599559147685526,
 'F1-score': 0.9441734417344173,
 'AUC': 0.9625722536706072,
 'MCC': 0.8200952088412745}

K-Nearest Neighbors was applied as a distance-based classifier to analyze local similarity patterns in network traffic.

In [25]:
#Model-6

# Dense encoder for Naive Bayes
nb_preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_features)
    ]
)

nb = Pipeline(
    steps=[
        ("preprocessor", nb_preprocessor),
        ("classifier", GaussianNB())
    ]
)

# Train
nb.fit(X_train, y_train)

# Predict
y_pred_nb = nb.predict(X_test)

# Metrics
nb_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_nb),
    "Precision": precision_score(y_test, y_pred_nb),
    "Recall": recall_score(y_test, y_pred_nb),
    "F1-score": f1_score(y_test, y_pred_nb),
    "AUC": roc_auc_score(y_test, nb.predict_proba(X_test)[:, 1]),
    "MCC": matthews_corrcoef(y_test, y_pred_nb)
}

nb_metrics

{'Accuracy': 0.48425,
 'Precision': 1.0,
 'Recall': 0.24210139603232916,
 'F1-score': 0.38982549541555755,
 'AUC': 0.6312834965544758,
 'MCC': 0.30431673513896346}

Naive Bayes was included as a probabilistic baseline model assuming conditional independence among features.

Multiple machine learning models were evaluated on the UNSW-NB15 dataset for network intrusion detection. Among all models, ensemble-based methods such as Random Forest achieved the best overall performance, demonstrating higher robustness and generalization ability compared to linear and probabilistic models.

In [29]:
#model-7

xgb = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", XGBClassifier(
            n_estimators=200,
            max_depth=6,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            eval_metric="logloss",
            random_state=42,
            n_jobs=-1
        ))
    ]
)

# Train
xgb.fit(X_train, y_train)

# Predict
y_pred_xgb = xgb.predict(X_test)
y_prob_xgb = xgb.predict_proba(X_test)[:, 1]

xgb_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred_xgb),
    "Precision": precision_score(y_test, y_pred_xgb),
    "Recall": recall_score(y_test, y_pred_xgb),
    "F1-score": f1_score(y_test, y_pred_xgb),
    "AUC": roc_auc_score(y_test, y_prob_xgb),
    "MCC": matthews_corrcoef(y_test, y_pred_xgb)
}

xgb_metrics

{'Accuracy': 0.93975,
 'Precision': 0.9413020277481323,
 'Recall': 0.9720793534166055,
 'F1-score': 0.9564431592264594,
 'AUC': 0.9878670750932241,
 'MCC': 0.8600049742699041}

In [28]:
results_df = pd.DataFrame.from_dict(
    {
        "Logistic Regression": lr_metrics,
        "Decision Tree": dt_metrics,
        "Random Forest": rf_metrics,
        "SVM": svm_metrics,
        "KNN": knn_metrics,
        "Naive Bayes": nb_metrics,
        "XGBoost":xgb_metrics
    },
    orient="index"
)

results_df

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC,MCC
Logistic Regression,0.9165,0.905571,0.979427,0.941052,0.956189,0.805947
Decision Tree,0.9295,0.922438,0.978692,0.949733,0.973746,0.836181
Random Forest,0.93575,0.919075,0.99302,0.954618,0.986844,0.852659
SVM,0.92475,0.908263,0.989346,0.947072,0.914177,0.826753
KNN,0.92275,0.928902,0.959956,0.944173,0.962572,0.820095
Naive Bayes,0.48425,1.0,0.242101,0.389825,0.631283,0.304317
XGBoost,0.93975,0.941302,0.972079,0.956443,0.987867,0.860005


### Model Comparison

XGBoost outperforms traditional models such as Naive Bayes and KNN across
all evaluation metrics. Its ability to model non-linear relationships and
feature interactions makes it particularly effective for intrusion detection
tasks on the UNSW-NB15 dataset.

The results demonstrate that ensemble-based gradient boosting techniques
are well-suited for cybersecurity classification problems.