# Информационная безопасность

Компания онлайн-сервис с высоким уровнем входящего трафика имеет специализированный отдел безопасности, который занимается фильтрацией и анализом трафика. Сотрудники этого отдела обратились за помощью в автоматизации выявления аномального и злонамеренного трафика. Ваша задача - разработать модель, которая будет классифицировать трафик на нормальный и злонамеренный, включая следующие типы атак: DDoS, SQL-инъекции, брутфорс, вредоносные программы и т.д.

___

**Задача:**

- Разработать модель, которая будет классифицировать трафик на нормальный и злонамеренный. При этом модель должна работать максимально качественно, так как цена ошибки может быть очень высока.
- Оценить качество модели по различным метрикам классификации: precision, recall, f1_score, accuracy.

___

Целевым признаком является столбец `Label`. Значение **BENIGN** - это нормальный трафик, а остальные 14 значений - это различные типы злонамеренного трафика.

## Загрузка данных

### Импорт используемых библиотек

In [2]:
import pandas as pd
import numpy as np
import re

import ydata_profiling

import matplotlib.pyplot as plt
import seaborn as sns

import phik
from sklearn.preprocessing import LabelEncoder

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

from IPython.core.display import display, HTML

pd.set_option('display.max_columns', None)

display(HTML("<style>.container { width:80% !important; }</style>"))

import warnings
warnings.filterwarnings('ignore')

  from IPython.core.display import display, HTML


### Чтение файла

In [3]:
df = pd.read_csv('network_traffic_data.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539616 entries, 0 to 539615
Data columns (total 79 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0    Destination Port             539616 non-null  int64  
 1    Flow Duration                539616 non-null  int64  
 2    Total Fwd Packets            539616 non-null  int64  
 3    Total Backward Packets       539616 non-null  int64  
 4   Total Length of Fwd Packets   539616 non-null  int64  
 5    Total Length of Bwd Packets  539616 non-null  int64  
 6    Fwd Packet Length Max        539616 non-null  int64  
 7    Fwd Packet Length Min        539616 non-null  int64  
 8    Fwd Packet Length Mean       539616 non-null  float64
 9    Fwd Packet Length Std        539616 non-null  float64
 10  Bwd Packet Length Max         539616 non-null  int64  
 11   Bwd Packet Length Min        539616 non-null  int64  
 12   Bwd Packet Length Mean       539616 non-nul

Оптимизируем память нашего датасета, форматируя столбцы из 64-битного типа в 16-битный

In [5]:
float64_features = df.select_dtypes(include='float64').columns.to_list()
int64_features = df.select_dtypes(include='int64').columns.to_list()

df[float64_features] = df[float64_features].astype('float16')
df[int64_features] = df[int64_features].astype('int16')

Посмотрим на данные

In [6]:
def print_dataset_info(df):
    print("\nПервые 5 строк датасета:")
    display(df.head())
    #print("\nСтатистическое описание данных:")
    #display(df.describe())
    print("\nКоличество пропусков:")
    display(df.isna().sum().sum())

In [7]:
print_dataset_info(df)


Первые 5 строк датасета:


Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,80,-24950,3,1,12,0,6,0,4.0,3.464844,0,0,0.0,0.0,2.189453,0.72998,inf,inf,3316,101,-24950,inf,inf,3316,-28266,0,0.0,0.0,0,0,0,0,0,0,72,32,0.547363,0.182495,0,6,2.400391,3.287109,10.796875,0,0,0,1,0,0,0,0,0,3.0,4.0,0.0,72,0,0,0,0,0,0,3,12,1,0,8192,-22756,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,443,-8919,9,10,703,3950,267,0,78.125,103.3125,1448,0,395.0,587.5,6536.0,26.6875,39552.0,50144.0,-10571,1,26477,inf,inf,19006,230,27050,inf,inf,3228,1,0,0,0,0,296,328,12.640625,14.046875,0,1448,232.625,442.75,inf,0,0,0,1,0,0,0,0,1,244.875,78.125,395.0,296,0,0,0,0,0,0,9,703,10,3950,29200,252,4,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,53,22326,2,2,80,224,40,40,40.0,0.0,112,112,112.0,0.0,1982.0,26.078125,51136.0,inf,22319,3,3,3.0,0.0,3,3,4,4.0,0.0,4,4,0,0,0,0,40,40,13.039062,13.039062,40,112,68.8125,39.4375,1555.0,0,0,0,0,0,0,0,0,1,86.0,40.0,112.0,40,0,0,0,0,0,0,2,80,2,224,-1,-1,1,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,53,-7876,1,1,46,128,46,46,46.0,0.0,128,128,128.0,0.0,3018.0,34.6875,57664.0,0.0,-7876,-7876,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,17.34375,17.34375,46,128,73.3125,47.34375,2242.0,0,0,0,0,0,0,0,0,1,110.0,46.0,128.0,20,0,0,0,0,0,0,1,46,1,128,-1,-1,0,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,8446,767,3,1,43,6,31,6,14.335938,14.4375,6,6,6.0,0.0,63872.0,5216.0,255.625,394.25,710,3,713,356.5,500.0,710,3,0,0.0,0.0,0,0,0,0,0,0,60,20,3912.0,1304.0,6,31,11.0,11.179688,125.0,0,0,0,0,1,0,0,0,0,13.75,14.335938,6.0,60,0,0,0,0,0,0,3,43,1,6,1017,0,2,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN



Количество пропусков:


488

В данных имеются бесконечные значения, которые не дают оценить статистику данных

Посмотрим на респределение классов в целевом признаке

In [8]:
df['Label'].value_counts(normalize=True) * 100

BENIGN                        44.476072
DoS Hulk                      21.491950
PortScan                      14.762350
DDoS                          11.928112
DoS GoldenEye                  1.907468
FTP-Patator                    1.471046
SSH-Patator                    1.092814
DoS slowloris                  1.074097
DoS Slowhttptest               1.019058
Bot                            0.364333
Web Attack � Brute Force       0.279273
Web Attack � XSS               0.120827
Infiltration                   0.006671
Web Attack � Sql Injection     0.003892
Heartbleed                     0.002038
Name: Label, dtype: float64

Около 45% от всего трафика является нормальным, остальной - вредоносный.

## Предобработка

Переведем названия столбцов в змеинный регистр

In [9]:
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()

Подсчет статистики не выполнился, т.к. имеются бесконечные значения. Найдем признаки с этими значениями

In [10]:
columns_with_infinity = df.columns[df.isin([np.inf, -np.inf]).any()]
columns_with_infinity

Index(['flow_bytes/s', 'flow_packets/s', 'flow_iat_mean', 'flow_iat_std',
       'fwd_iat_mean', 'fwd_iat_std', 'bwd_iat_mean', 'bwd_iat_std',
       'fwd_packets/s', 'bwd_packets/s', 'packet_length_variance',
       'active_mean', 'active_std', 'idle_mean', 'idle_std'],
      dtype='object')

Заменим положительные бесконечности на максимальные значения...

In [11]:
df.replace(np.inf, np.nan, inplace=True)

max_values = df.max()
df.fillna(max_values, inplace=True)

... отрицательные - на минимальные.

In [12]:
df.replace(-np.inf, np.nan, inplace=True)

min_values = df.min()
df.fillna(min_values, inplace=True)

In [13]:
df.isna().sum().sum()

0

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
destination_port,539616.0,-391.477656,5788.539066,-32768.0,53.00,80.0,259.00,32767.0
flow_duration,539616.0,1075.061201,15011.004092,-32768.0,-4004.25,58.0,7505.25,32767.0
total_fwd_packets,539616.0,5.512726,69.155087,1.0,1.00,2.0,6.00,19449.0
total_backward_packets,539616.0,5.135350,88.388878,-359.0,1.00,2.0,5.00,28467.0
total_length_of_fwd_packets,539616.0,320.064896,1372.159056,-32629.0,2.00,44.0,319.00,32741.0
...,...,...,...,...,...,...,...,...
active_min,539616.0,865.377452,6606.775359,-32768.0,0.00,0.0,0.00,32764.0
idle_mean,539616.0,0.000000,0.000000,0.0,0.00,0.0,0.00,0.0
idle_std,539616.0,,,0.0,0.00,0.0,0.00,65504.0
idle_max,539616.0,-933.267961,10173.583493,-32768.0,0.00,0.0,0.00,32763.0


## Деление на выборки

In [15]:
features = df.drop(['label'], axis=1)
target = df['label']

In [16]:
label_encoder = LabelEncoder()
target = label_encoder.fit_transform(target)

In [17]:
np.unique(target)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [18]:
features_train, features_test, target_train, target_test = train_test_split(
    features,
    target,
    test_size=0.25,
    stratify=target,
    random_state=42)

## Обучение моделей

In [19]:
catboost_model = CatBoostClassifier(loss_function='MultiClass', verbose=25, random_state=42)

param_grid = {
    'iterations': [300],
    'depth': [6, 8],
    'learning_rate': [0.01, 0.05],
}

grid_search = GridSearchCV(estimator=catboost_model, param_grid=param_grid, cv=2, verbose=25)

grid_search.fit(features_train, target_train)

print("Лучшие параметры:", grid_search.best_params_)

best_model = grid_search.best_estimator_

predictions = best_model.predict(features_test)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV 1/2; 1/4] START depth=6, iterations=300, learning_rate=0.01.................
0:	learn: 2.5912722	total: 835ms	remaining: 4m 9s
25:	learn: 1.3917325	total: 18.2s	remaining: 3m 11s
50:	learn: 0.9661156	total: 35.4s	remaining: 2m 52s
75:	learn: 0.7225102	total: 53.1s	remaining: 2m 36s
100:	learn: 0.5614325	total: 1m 10s	remaining: 2m 19s
125:	learn: 0.4471444	total: 1m 28s	remaining: 2m 2s
150:	learn: 0.3621113	total: 1m 44s	remaining: 1m 42s
175:	learn: 0.2981684	total: 1m 59s	remaining: 1m 24s
200:	learn: 0.2480679	total: 2m 15s	remaining: 1m 6s
225:	learn: 0.2090228	total: 2m 30s	remaining: 49.3s
250:	learn: 0.1786596	total: 2m 48s	remaining: 32.8s
275:	learn: 0.1539494	total: 3m 5s	remaining: 16.1s
299:	learn: 0.1338867	total: 3m 20s	remaining: 0us
[CV 1/2; 1/4] END depth=6, iterations=300, learning_rate=0.01;, score=0.983 total time= 3.4min
[CV 2/2; 1/4] START depth=6, iterations=300, learning_rate=0.01.................
0

In [20]:
precision = precision_score(target_test, predictions, average='weighted')
recall = recall_score(target_test, predictions, average='weighted')
f1 = f1_score(target_test, predictions, average='weighted')
accuracy = accuracy_score(target_test, predictions)

print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'Accuracy: {accuracy:.4f}')

Precision: 0.9975
Recall: 0.9974
F1 Score: 0.9970
Accuracy: 0.9974


In [21]:
target_names = list(df['label'].unique())

In [22]:
print(classification_report(target_test, predictions, target_names=target_names, zero_division = 0))

                            precision    recall  f1-score   support

                    BENIGN       1.00      1.00      1.00     60000
                  PortScan       0.96      0.96      0.96       492
                  DoS Hulk       1.00      1.00      1.00     16091
                      DDoS       0.99      1.00      0.99      2573
                       Bot       1.00      1.00      1.00     28994
              Infiltration       1.00      0.99      0.99      1375
  Web Attack � Brute Force       1.00      0.99      1.00      1449
          Web Attack � XSS       1.00      1.00      1.00      1984
Web Attack � Sql Injection       1.00      1.00      1.00         3
               FTP-Patator       1.00      0.78      0.88         9
               SSH-Patator       1.00      1.00      1.00     19915
             DoS slowloris       1.00      0.99      1.00      1474
          DoS Slowhttptest       0.71      0.98      0.82       377
             DoS GoldenEye       0.00      0.00

In [23]:
best_model.save_model('best_catboost_model.bin')