Going to download data from https://www.unb.ca/cic/datasets/ids-2018.html

But let's see first what's in the bucket:

In [None]:
!aws s3 ls --no-sign-request "s3://cse-cic-ids2018" --recursive --human-readable --summarize

And now download only the csv files:

In [None]:
!aws s3 cp --no-sign-request "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" data/ --recursive

In [None]:
!ls -l --block-size=M data

In [None]:
# dates_by_datasets = {
#     "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv": "02-03-2018",
#     "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv": "16-02-2018",
#     "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv": "23-02-2018",
#     "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv": "23-02-2018",
#     "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv": "01-03-2018",
#     "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv": "15-02-2018",
#     "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv": "22-02-2018",
#     "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv": "14-02-2018",
#     "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv": "21-02-2018",
#     "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv": "28-02-2018",
# }

In [1]:
import pandas as pd

It seems that dataset "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv" is too heavy

Let's split it first.

In [None]:
huge_df = pd.read_csv("data/Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv")

In [None]:
nb_of_chunks = 10
start, chunk_size = 0, int(huge_df.shape[0] / nb_of_chunks)

for chunk in range(nb_of_chunks):
    huge_df.iloc[start: start + (chunk_size)].to_csv(
        'data/Thuesday-20-02-2018_TrafficForML_CICFlowMeter_{}.csv'.format(chunk),
        index=False
    )
    start += chunk_size
    break

In [None]:
!ls -l --block-size=M data

In [None]:
!rm data/Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv

In [None]:
csv_files = [
    "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_0.csv",
    "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_1.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_2.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_3.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_4.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_5.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_6.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_7.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_8.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_9.csv",
]

Create a dataframe object

Let's now explore labels distribution and memory usage for each datasets

In [None]:
labels_dist_by_dataset, tot_mem = {}, 0

for csv in csv_files:

    print('Loading', csv)
    tmp_df = pd.read_csv("data/{}".format(csv))
    tmp_dist = (tmp_df['Label'].value_counts(normalize=True) * 100).to_dict()
    tmp_mem = tmp_df.memory_usage(index=True).sum()
    tot_mem += tmp_mem
    tmp_dist['memory_usage'] = tmp_mem
    print(tmp_dist)
    labels_dist_by_dataset[csv] = tmp_dist
    

Normalize memory usage to be more understandable

In [None]:
for d in labels_dist_by_dataset:
    labels_dist_by_dataset[d]['memory_usage'] = round((labels_dist_by_dataset[d]['memory_usage'] / tot_mem) * 100, 3)

Here we go! Now having more representative stats on each dataset for splitting into train, test and validate sets

In [None]:
labels_dist_by_dataset

In [None]:
huge_df['Label'].value_counts(normalize=True) * 100

Print columns the names of the first columns because I noticed some differences

In [None]:
for csv in csv_files:

    print('Loading', csv)
    tmp_df = pd.read_csv("data/{}".format(csv))
    print(tmp_df.columns[:5])
    

Obviously the datasets are unbalanced. We must therefore find a compromise to have enough "M-Profile" labeled data (malicious) of each datasets in the training set while avoiding any process of false data generation such as oversampling or something else

In [2]:
def format_columns(df):
    df.columns = [col.upper().replace(' ', '_') for col in df.columns]
    df = df.drop(['FLOW_ID', 'SRC_IP', 'DST_IP', 'SRC_PORT'], axis=1, errors='ignore')
    # ...
    return df

In [3]:
from sklearn.model_selection import train_test_split

def train_test_validation_split(df, test_size, val_size, label_column='LABEL'):
    '''
        :return: train_dataset (pandas.DataFrame), 
                 test_dataset (pandas.DataFrame), 
                 validation_dataset (pandas.DataFrame)
    '''
    # Format columns
    df = format_columns(df)
    
    # Splitting X and y from df
    X, y = df[df.columns.difference(['LABEL'])], df['LABEL']
    # Cutting out test set from df
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)
    # Cutting out val set from train_df
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=val_size, random_state=1)
    
    return pd.concat([X_train, y_train], axis=1), pd.concat([X_test, y_test], axis=1), pd.concat([X_val, y_val], axis=1)

So let's keep this files list for creating our datasets:

In [4]:
csv_files = [
    "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_0.csv",
    "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv",
#     "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_1.csv",
#     "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_2.csv",
#     "Thuesday-20-02-2018_TrafficForML_CICFlowMeter_3.csv",
]

In [5]:
train, test, val = pd.DataFrame({}), pd.DataFrame({}), pd.DataFrame({})

for csv in csv_files:

    print('*** Loading', csv)
    
    tmp_train, tmp_test, tmp_val = train_test_validation_split(
        df=pd.read_csv("data/{}".format(csv)),
        test_size=0.3,
        val_size=0.14, # (1 - 0.3) * 0.14 = 0.1
    )
    
    
    train = pd.concat([train, tmp_train])
    print('Train set concatenated: new train dataframe shape: {}'.format(train.shape[0]))
    test = pd.concat([test, tmp_test])
    print('Test set concatenated: new test dataframe shape: {}'.format(test.shape[0]))
    val = pd.concat([val, tmp_val])
    print('Val set concatenated: new test dataframe shape: {}'.format(val.shape[0]))

*** Loading Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
Train set concatenated: new train dataframe shape: 631241
Test set concatenated: new test dataframe shape: 314573
Val set concatenated: new test dataframe shape: 102761
*** Loading Friday-16-02-2018_TrafficForML_CICFlowMeter.csv


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Train set concatenated: new train dataframe shape: 1262482
Test set concatenated: new test dataframe shape: 629146
Val set concatenated: new test dataframe shape: 205522
*** Loading Friday-23-02-2018_TrafficForML_CICFlowMeter.csv


FileNotFoundError: [Errno 2] No such file or directory: 'data/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv'

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

In [None]:
print(val.shape)
val.head()

In [None]:
list(train.columns) == list(test.columns) == list(val.columns)

And then, export them in csv files for an upcoming exploration, modelization, test and validation....

In [None]:
train.to_csv('train.csv', index=False)

In [None]:
test.to_csv('test.csv', index=False)

In [None]:
val.to_csv('val.csv', index=False)

In [None]:
!ls