# Model Training
-----

- High True Positive Rate (Detection rate)
- Low False Negative Rates (FNR)
- Low False Positive Rates

In [4]:
import pandas as pd
import glob

data_csv = glob.glob('../data/*.csv')
df1 = pd.concat((pd.read_csv(file) for file in data_csv), ignore_index=True)
df1.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,22,166,1,1,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,60148,83,1,2,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,123,99947,1,1,48,48,48,48,48.0,0.0,...,40,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,123,37017,1,1,48,48,48,48,48.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,0,111161336,147,0,0,0,0,0,0.0,0.0,...,0,1753752.625,2123197.578,4822992,95,9463032.7,2657727.996,13600000,5700287,BENIGN


Exploring data, checking for missing values

In [5]:
print(df1.isnull().sum())

 Destination Port              0
 Flow Duration                 0
 Total Fwd Packets             0
 Total Backward Packets        0
Total Length of Fwd Packets    0
                              ..
Idle Mean                      0
 Idle Std                      0
 Idle Max                      0
 Idle Min                      0
 Label                         0
Length: 79, dtype: int64


Strip spaces from column names, prepare features (X) and Labels (Y)

In [6]:
df1.columns = df1.columns.str.strip()

In [7]:
x= df1.drop(columns=['Label'])
y = df1['Label']

Encode Labels

In [8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)

Split dataset

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    x, y_encoded, test_size=0.3, random_state=42
)

Ran into problem with inf and -inf values in dataset, here we cleaned those values up

In [10]:
import numpy as np

print("Any infinity in X_train:", np.isinf(X_train).values.any())
print("Any NaN in X_train:", np.isnan(X_train).values.any())


Any infinity in X_train: True
Any NaN in X_train: True


In [11]:
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)


Replacing inf and -inf values with NaN, then replacing NaN values with mean or zero values

In [12]:
X_train.fillna(X_train.mean(), inplace=True)
X_test.fillna(X_test.mean(), inplace=True)


Training and Evaluating model

In [13]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


In [14]:
from sklearn.metrics import classification_report, accuracy_score

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    261441
           1       0.90      0.79      0.84       616
           2       1.00      1.00      1.00     38251
           3       1.00      0.50      0.67        12
           4       0.99      1.00      1.00     47683
           5       0.74      0.81      0.77       453
           6       1.00      0.14      0.25         7
           7       0.44      0.32      0.37       201

    accuracy                           1.00    348664
   macro avg       0.88      0.69      0.74    348664
weighted avg       1.00      1.00      1.00    348664

Accuracy: 0.997384301218365


In [None]:
import joblib
joblib.dump(model, '../ai_engine/model2.pkl')

['../ai engine/model2.pkl']