Going to download data from https://www.unb.ca/cic/datasets/ids-2018.html

But let's see first what's in the bucket:

In [1]:
!aws s3 ls --no-sign-request "s3://cse-cic-ids2018" --recursive --human-readable --summarize

2018-10-10 13:52:09    0 Bytes Original Network Traffic and Log data/
2018-10-10 13:52:23    0 Bytes Original Network Traffic and Log data/Friday-02-03-2018/
2018-10-10 14:00:39  225.8 MiB Original Network Traffic and Log data/Friday-02-03-2018/logs.zip
2018-10-10 14:00:51   41.7 GiB Original Network Traffic and Log data/Friday-02-03-2018/pcap.zip
2018-10-10 13:52:34    0 Bytes Original Network Traffic and Log data/Friday-16-02-2018/
2018-10-10 14:45:49  148.1 MiB Original Network Traffic and Log data/Friday-16-02-2018/logs.zip
2018-10-10 14:46:01   35.9 GiB Original Network Traffic and Log data/Friday-16-02-2018/pcap.zip
2018-10-10 13:52:41    0 Bytes Original Network Traffic and Log data/Friday-23-02-2018/
2018-10-10 14:46:10  199.8 MiB Original Network Traffic and Log data/Friday-23-02-2018/logs.zip
2018-10-10 14:46:31   55.0 GiB Original Network Traffic and Log data/Friday-23-02-2018/pcap.zip
2018-10-10 13:52:47    0 Bytes Original Network Traffic and Log data/Thursday-01

And now download only the csv files:

In [None]:
!aws s3 cp --no-sign-request "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" data/ --recursive

In [2]:
!ls data

Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
Friday-16-02-2018_TrafficForML_CICFlowMeter.csv
Friday-23-02-2018_TrafficForML_CICFlowMeter.csv
Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv
Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv
Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv
Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv
Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv
Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv
Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv


In [3]:
import pandas as pd

Create a dataframe object

In [4]:
csv_files = [
    "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv",
    "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv",
    "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv",
    "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv"
]

dates_by_datasets = {
    "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv": "02-03-2018",
    "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv": "16-02-2018",
    "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv": "23-02-2018",
    "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv": "23-02-2018",
    "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv": "01-03-2018",
    "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv": "15-02-2018",
    "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv": "22-02-2018",
    "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv": "14-02-2018",
    "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv": "21-02-2018",
    "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv": "28-02-2018",
}

train_test_validation_datasets_split = {
    "train": [
        "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv",
        "Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv",
    ],
    "test": [
        "Friday-02-03-2018_TrafficForML_CICFlowMeter.csv",
        "Friday-16-02-2018_TrafficForML_CICFlowMeter.csv",
        "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv",
        "Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv",
        "Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv"
    ],
    "validate": [
        "Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv",
        "Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv",
        "Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv",
    ]
}

Load and concat data from csv

In [5]:
df = pd.DataFrame({})
dataset = "validate"

for csv in train_test_validation_datasets_split[dataset]:

    df = pd.concat([df, pd.read_csv("data/{}".format(csv))])

In [6]:
print(df.shape)
df.head()

(3145725, 80)


Unnamed: 0,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,0,15/02/2018 08:25:18,112641158,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320579.0,704.2784,56321077,56320081,Benign
1,22,6,15/02/2018 08:29:05,37366762,14,12,2168,2993,712,0,...,32,1024353.0,649038.754495,1601183,321569,11431221.0,3644991.0,15617415,8960247,Benign
2,47514,6,15/02/2018 08:29:42,543,2,0,64,0,64,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign
3,0,0,15/02/2018 08:28:07,112640703,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320351.5,366.9884,56320611,56320092,Benign
4,0,0,15/02/2018 08:30:56,112640874,3,0,0,0,0,0,...,0,0.0,0.0,0,0,56320437.0,719.8347,56320946,56319928,Benign


In [7]:
df.columns

Index(['Dst Port', 'Protocol', 'Timestamp', 'Flow Duration', 'Tot Fwd Pkts',
       'Tot Bwd Pkts', 'TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Max',
       'Fwd Pkt Len Min', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Std',
       'Bwd Pkt Len Max', 'Bwd Pkt Len Min', 'Bwd Pkt Len Mean',
       'Bwd Pkt Len Std', 'Flow Byts/s', 'Flow Pkts/s', 'Flow IAT Mean',
       'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Tot',
       'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
       'Bwd IAT Tot', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
       'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags',
       'Bwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
       'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
       'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
       'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
       'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
      

Export them in split and compressed datasets

In [8]:
compression_opts = dict(method='zip', archive_name='out.csv') 
df.to_csv('{}.zip'.format(dataset), index=False, compression=compression_opts)

  return self._open_to_write(zinfo, force_zip64=force_zip64)


In [9]:
!ls

1-fetch_data.ipynb  data  test.zip  train.zip  validate.zip
