# An Initial Analysis of the Dataset
---
After writing a simple program to process the 13 binetflow files a summary of each file and their labeled flows as well as the total flows was produced. This is the same table as theone provided by the CTU-13 dataset webpage.

However there were some slight discrepencies between my summary and the existing summary.


### Reproduced Summary
|Scen.|Total Flows|Botnet Flows|Normal Flows|C&C Flows|Background Flows|
|---|---|---|---|---|---|
|1|2824636|40961(1.45%)|30387(1.08%)|341(0.01%)|2753288(97.47%)|
|2|1808122|20941(1.16%)|9120(0.50%)|673(0.04%)|1778061(98.34%)|
|3|4710638|26822(0.57%)|116887(2.48%)|63(0.00%)|4566929(96.95%)|
|4|129832|901(0.69%)|4679(3.60%)|24(0.02%)|124252(95.70%)|
|5|1925149|40003(2.08%)|31939(1.66%)|536(0.03%)|1853207(96.26%)|
|6|1121076|2580(0.23%)|25268(2.25%)|52(0.00%)|1093228(97.52%)|
|7|114077|63(0.06%)|1677(1.47%)|26(0.02%)|112337(98.47%)|
|8|2954230|6127(0.21%)|72822(2.47%)|1074(0.04%)|2875281(97.33%)|
|9|558919|4630(0.83%)|7494(1.34%)|199(0.04%)|546795(97.83%)|
|10|2087508|184987(8.86%)|29967(1.44%)|2973(0.14%)|1872554(89.70%)|
|11|107251|8164(7.61%)|2718(2.53%)|2(0.00%)|96369(89.85%)|
|12|1309791|106352(8.12%)|15847(1.21%)|33(0.00%)|1187592(90.67%)|
|13|325471|2168(0.67%)|7628(2.34%)|25(0.01%)|315675(96.99%)|

### Original CTU-13 Summary
![CTU-13 Dataset Summary](http://mcfp.weebly.com/uploads/1/1/2/3/11233160/7883961.jpg?728)

---

It can be seen that there are some slight discrepencies between the flow values and their percentages. They may be negligible in the long run. 

The code I used to produce the summary can be found [here](https://github.com/corysabol/binetflow-botnet-detect/blob/master/src/sample.py).

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import os
import datetime as dt
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns

### Sample of the data
---

In [4]:
dataset_path = os.path.join('..','CTU-13-Dataset')
directory = os.fsencode(dataset_path)

files = os.listdir(directory)
sample_file = files[0]

# read the file with pandas
sample_df = pd.read_csv(os.path.join(directory, sample_file).decode('utf-8'), low_memory=False)
print(sample_df.size)
#sample_df[sample_df.Label.str.contains("Botnet")].head(10)
sample_df.head(5)

42369540


Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
0,2011/08/10 09:46:59.607825,1.026539,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
1,2011/08/10 09:47:00.634364,1.009595,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
2,2011/08/10 09:47:48.185538,3.056586,tcp,147.32.86.89,4768,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
3,2011/08/10 09:47:48.230897,3.111769,tcp,147.32.86.89,4788,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
4,2011/08/10 09:47:48.963351,3.083411,tcp,147.32.86.89,4850,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt


### Take a look at the distribution of the data
---
We want to grab 1 second intervals of flows and average the duration of each of these time windows.

In [6]:
from functools import reduce

def label_data(df):
    # label each flow as attack or normal
    # If the existing label contains 'Botnet' then it is attack.
    # everything else is considered as normal traffic.
    lbl = lambda x: 'Attack' if 'Botnet' in x else 'Normal'
    df['Category'] = df['Label'].apply(lbl)

# convert the time to seconds
def time_to_sec(time):
    f = float
    t = time.split(':')
    h,m,s = f(t[0]),f(t[1]),f(t[2])
    return float(h)*3600 + float(m)*60 + float(s)

def agg_timewindow(df):
    '''
    Grab 1 second windows of binet flows, extract the features from that window into 
    a row of a new dataframe.
    '''
    # reindex based on StartTime
    df.set_index('StartTime')
    print(df.index)
    df.index = pd.to_datetime(df.index)
    print(df.index)
    #avg_dur = lambda x: 
    #df.resample('1S')

#agg_timewindow(sample_df)
sample_df.head(10)
# relabel the data
# label_data(sample_df)
# sample_df[sample_df.Category.str.contains('Attack')].head()

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
0,2011/08/10 09:46:59.607825,1.026539,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
1,2011/08/10 09:47:00.634364,1.009595,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
2,2011/08/10 09:47:48.185538,3.056586,tcp,147.32.86.89,4768,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
3,2011/08/10 09:47:48.230897,3.111769,tcp,147.32.86.89,4788,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
4,2011/08/10 09:47:48.963351,3.083411,tcp,147.32.86.89,4850,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
5,2011/08/10 09:47:58.806814,3.097288,tcp,147.32.86.89,4866,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
6,2011/08/10 09:51:34.450457,1.048908,tcp,213.200.244.217,47908,->,147.32.84.59,6881,S_RA,0.0,0.0,4,244,124,flow=Background-Established-cmpgw-CVUT
7,2011/08/10 09:54:55.231320,4.373526,tcp,75.105.28.60,1419,->,147.32.84.59,6881,S_RA,0.0,0.0,4,252,132,flow=Background-Established-cmpgw-CVUT
8,2011/08/10 09:57:13.352114,4.827912,tcp,75.105.28.60,1491,->,147.32.84.59,6881,S_RA,0.0,0.0,4,252,132,flow=Background-Established-cmpgw-CVUT
9,2011/08/10 09:58:43.301515,0.049697,tcp,178.111.79.115,41752,->,147.32.84.229,13363,SR_SA,0.0,0.0,5,352,208,flow=Background-TCP-Established


In [21]:
# We want to check 1 second intervals, and plot each feature for each file.
#     0  n_dports>1024    1 background_flow_count
#     2  n_s_a_p_address  3 avg_duration
#     4  n_s_b_p_address  5 n_sports<1024
#     6  n_sports>1024    7 n_conn
#     8  n_s_na_p_address 9 n_udp

#     10 n_icmp           11 n_d_na_p_address
#     12 n_d_a_p_address  13 n_s_c_p_address
#     14 n_d_c_p_address  15 normal_flow_count
#     16 n_dports<1024    17 n_d_b_p_address
#     18 n_tcp

# plot one - n_dports>1024
#sample_df.index[0].total_seconds()
#sample_df.index = (sample_df.index - dt.datetime(1970,1,1)).total_seconds()
pd.to_datetime(sample_df.StartTime)
sample_df['StartTime'] = sample_df['StartTime'].apply(lambda x: x[:19])

In [12]:
sample_df.head()

DatetimeIndex([          '1970-01-01 00:00:00',
               '1970-01-01 00:00:00.000000001',
               '1970-01-01 00:00:00.000000002',
               '1970-01-01 00:00:00.000000003',
               '1970-01-01 00:00:00.000000004',
               '1970-01-01 00:00:00.000000005',
               '1970-01-01 00:00:00.000000006',
               '1970-01-01 00:00:00.000000007',
               '1970-01-01 00:00:00.000000008',
               '1970-01-01 00:00:00.000000009',
               ...
               '1970-01-01 00:00:00.002824626',
               '1970-01-01 00:00:00.002824627',
               '1970-01-01 00:00:00.002824628',
               '1970-01-01 00:00:00.002824629',
               '1970-01-01 00:00:00.002824630',
               '1970-01-01 00:00:00.002824631',
               '1970-01-01 00:00:00.002824632',
               '1970-01-01 00:00:00.002824633',
               '1970-01-01 00:00:00.002824634',
               '1970-01-01 00:00:00.002824635'],
              dtype=

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
1970-01-01 00:00:00.000000000,2011/08/10 09:46:59.607825,1.026539,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
1970-01-01 00:00:00.000000001,2011/08/10 09:47:00.634364,1.009595,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
1970-01-01 00:00:00.000000002,2011/08/10 09:47:48.185538,3.056586,tcp,147.32.86.89,4768,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
1970-01-01 00:00:00.000000003,2011/08/10 09:47:48.230897,3.111769,tcp,147.32.86.89,4788,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
1970-01-01 00:00:00.000000004,2011/08/10 09:47:48.963351,3.083411,tcp,147.32.86.89,4850,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
