This notebook aims to use pycaret on the CICDDoS2019 dataset split according to the [original release paper](https://ieeexplore.ieee.org/abstract/document/8888419).

we show the proportions of Anomaly/Total data points caluclated from inputting the following into the excel sheet\
`=COUNTIF(CK:CK,"Syn")/(COUNTIF(CK:CK,"Syn")+COUNTIF(CK:CK,"BENIGN"))`
loading the large csv into the data frame took too much time so this faster method was used.

01-12 folder as the training data
```
['DrDoS_SSDP.csv': 99.971%
'DrDoS_NTP.csv': 98.7062%
'TFTP.csv', 
'UDPLag.csv' : 98.9991% of 370166 rows
'DrDoS_UDP.csv', 99.8919% of 1048575 rows
'Syn.csv' : 99.9966%
'DrDoS_MSSQL.csv', 
'DrDoS_SNMP.csv', 
'DrDoS_DNS.csv', 
'DrDoS_LDAP.csv':99.9256%]
```

03-12 as the testing data
```
['LDAP.csv':
 'MSSQL.csv':
 'NetBIOS.csv':
 'Portmap.csv': 97.5304% of 191694
 'Syn.csv':
 'UDP.csv':
 'UDPLag.csv': ]
```

and attempts to use all categories

In [1]:
print("we can assume we have enough BENIGN in each CSV to sample to make the proportion clores to 50%. for example, this is how many BENIGN rows we have in UDPLag.csv: ", 370166* (100-98.9991) / 100)

we can assume we have enough BENIGN in each CSV to sample to make the proportion clores to 50%. for example, this is how many BENIGN rows we have in UDPLag.csv:  3704.9914940000053


luckally, pycaret has tons of features to address these data issues written below

In [2]:
import dask.dataframe as dd
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
import pycaret
%matplotlib inline 

Function to efficiently read a CSV file into a DataFrame

In [3]:
def read_csv_efficiently(file_path):
   chunksize = 10000  # Adjust chunksize as needed based on file size and memory
   df_chunks = dd.read_csv(file_path, chunksize=chunksize)
   df = dd.concat(df_chunks, ignore_index=True)
   return df

strictly type and use dask to minimize RAM usage

In [4]:
def reduce_mem_usage(df, int_cast=False, obj_to_category=True, subset=None):
    """
    Optimizes memory usage of a Dask DataFrame by adjusting dtypes.
    """
    start_mem = df.memory_usage(deep=True).sum().compute() / 1024 ** 2
    cols = subset if subset is not None else df.columns

    for col in cols:
        col_type = df[col].dtype
        if col_type != 'object' and col_type != 'string' and not isinstance(col_type, (pd.DatetimeTZDtype, pd.CategoricalDtype, np.dtypes.StrDType)):
            try:  # Handle potential typing errors
                c_min = df[col].min().compute()
                c_max = df[col].max().compute()
            except TypeError:
                continue  # Skip columns with non-numeric values

            # Check for integer conversion
            treat_as_int = str(col_type)[:3] == 'int'
            if int_cast and not treat_as_int:
                treat_as_int = pd.api.types.is_integer_dtype(df[col])

            if treat_as_int:
                for np_type in [np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64]:
                    if c_min > np.iinfo(np_type).min and c_max < np.iinfo(np_type).max:
                        df[col] = df[col].astype(np_type)
                        break
            else:
                for np_type in [np.float16, np.float32, np.float64]:
                    # Extract numeric values before comparison
                    if c_min > np.finfo(np_type).min and c_max < np.finfo(np_type).max:
                        df[col] = df[col].astype(np_type)
                        break
        #seems to be causing problems. Decreased memory usage by 50.8% with this
        #elif not isinstance(col_type, pd.DatetimeTZDtype) and obj_to_category:
        #    df[col] = df[col].astype('category')

    end_mem = df.memory_usage(deep=True).sum().compute() / 1024 ** 2
    print('Memory Usage Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

listing columns names\

Initially I've tried running pycaret models on 1 CSV at a time with the original features, but they over value source_port as seen [here](https://bec-sv.atlassian.net/wiki/spaces/AN/pages/3538911264/Kentaro+Update+for+4+22+24). So we will remove this for now. we will explore removing each of the 5 tuple flows as the external dataset is from some network that will have different strcuture from whoever uses our ddos model.

In [5]:
columns = ['Unnamed: 0', 'Flow ID', ' Source IP', ' Source Port',
       ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp',
       ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Length', ' Max Packet Length',
       ' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance',
       'FIN Flag Count', ' SYN Flag Count', ' RST Flag Count',
       ' PSH Flag Count', ' ACK Flag Count', ' URG Flag Count',
       ' CWE Flag Count', ' ECE Flag Count', ' Down/Up Ratio',
       ' Average Packet Size', ' Avg Fwd Segment Size',
       ' Avg Bwd Segment Size', ' Fwd Header Length.1', 'Fwd Avg Bytes/Bulk',
       ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk',
       ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate', 'Subflow Fwd Packets',
       ' Subflow Fwd Bytes', ' Subflow Bwd Packets', ' Subflow Bwd Bytes',
       'Init_Win_bytes_forward', ' Init_Win_bytes_backward',
       ' act_data_pkt_fwd', ' min_seg_size_forward', 'Active Mean',
       ' Active Std', ' Active Max', ' Active Min', 'Idle Mean', ' Idle Std',
       ' Idle Max', ' Idle Min', 'SimillarHTTP', ' Inbound', ' Label']
ignore_columns = ['Flow Bytes/s', ' Flow Packets/s', ' Source IP', ' Source Port', ' Destination IP', ' Destination Port',]
use_columns = [x for x in columns if x not in ignore_columns]

function to combine the df

In [6]:
def combine_df(csv_dir, categories):
    df = dd.from_pandas(pd.DataFrame(columns=use_columns), npartitions=1)
    for ddos_type in categories:
        file_path = os.path.join(csv_dir, ddos_type)
        
        # pre-defining these removes some bugs
        dtype={'SimillarHTTP': 'object', ' Label': 'object', 'Flow_ID' : 'object', ' Source_IP' : 'object', ' Destination IP': 'object', ' TimeStamp': 'object'}
        df = dd.concat([df, reduce_mem_usage(dd.read_csv(file_path, dtype=dtype))], ignore_index=True)
        
    # Check for potential issues and handle them as needed
    if df.isnull().values.any():
       print("Warning: DataFrame contains missing values. Consider handling them.")
    return df

Read all of the CSVs into one Data Frame. RegEx to remove special characters like '/' that cause errors in pycaret and make column name more readable

In [7]:
# CSV file directory and file names for training
train_dir ='C:\\Users\\ktv07101\\Desktop\\BHNI Anomaly Detection Related\\CondaModelReplication\\CICDDoS2019\\CSV-01-12\\01-12'
train_ddos_categories = ['Syn.csv', 'DrDoS_LDAP.csv']
train_df = combine_df(train_dir, train_ddos_categories).rename(columns=lambda x: x.replace('/', '_').replace(' ', '_')).rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Memory Usage Decreased by 55.3%
Memory Usage Decreased by 53.2%


In [None]:
# CSV file directory and file names for testing
test_dir = 'C:\\Users\\ktv07101\\Desktop\\BHNI Anomaly Detection Related\\CondaModelReplication\\CICDDoS2019\\CSV-03-11\\03-11'
test_ddos_categories = ['LDAP.csv', 'Syn.csv']
test_df = combine_df(test_dir, test_ddos_categories).rename(columns=lambda x: x.replace('/', '_').replace(' ', '_')).rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Setting up [pycaret](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.setup) to test 5 models.\
we are using the following features for the following reasons:\
`fix_imbalance` balances the distributions\
`transformation` makes the distribution gaussian\
`normalize` normalizes features\
`pca` reduces the features we use\
`pca_method` is `incremental` for we are using a large dataset\
`feature_selection` selects few features to use\
`n_features_to_select` specifies how many or fractions of features to start with\
`numeric_imputation` is done via knn to accurately impute\
`imputation_type` is `iterative` to be accurate\
`polynomial_features` generates features. can adjust n variables allowed with\
`log_experiment` to view the experiments\
`use_gpu` to speed up algorithms\

If we want to use algorithms that can't handle categories, set "max_encoding_ohe=-1" 

In [None]:
from pycaret.classification import *
s = setup(test_data=test_df, log_plots=True, log_profile=True, load_data=True, profile=True, use_gpu=True, data=train_df, target='_Label', log_experiment=True, feature_selection=True, n_features_to_select=0.5, numeric_imputation='knn',  imputation_type='iterative', polynomial_features=True, remove_multicollinearity=True, fix_imbalance=True, normalize=True, transformation=True, pca=True, pca_method='incremental')

In [None]:
best = compare_models(budget_time=420)

In [None]:
preds = predict_model(best)

In [None]:
plot_model(best, plot='confusion_matrix')

In [None]:
plot_model(best, plot='auc')

In [None]:
plot_model(best, plot='class_report')

In [None]:
plot_model(best, plot='feature')

In [None]:
result = predict_model(best, data=combined_df)

In [None]:
result.head()

In [None]:
result.query('Type == prediction_label').shape

In [None]:
result.shape

In [None]:
result = result[['TimeDateStamp', 'AddressOfEntryPoint', 'SizeOfInitializedData', 'SizeOfCode', 'SizeOfImage', 'Type', 'prediction_label', 'prediction_score']]