# An Initial Analysis of the Dataset
---
After writing a simple program to process the 13 binetflow files a summary of each file and their labeled flows as well as the total flows was produced. This is the same table as theone provided by the CTU-13 dataset webpage.

However there were some slight discrepencies between my summary and the existing summary.


### Reproduced Summary
|Scen.|Total Flows|Botnet Flows|Normal Flows|C&C Flows|Background Flows|
|---|---|---|---|---|---|
|1|2824636|40961(1.45%)|30387(1.08%)|341(0.01%)|2753288(97.47%)|
|2|1808122|20941(1.16%)|9120(0.50%)|673(0.04%)|1778061(98.34%)|
|3|4710638|26822(0.57%)|116887(2.48%)|63(0.00%)|4566929(96.95%)|
|4|129832|901(0.69%)|4679(3.60%)|24(0.02%)|124252(95.70%)|
|5|1925149|40003(2.08%)|31939(1.66%)|536(0.03%)|1853207(96.26%)|
|6|1121076|2580(0.23%)|25268(2.25%)|52(0.00%)|1093228(97.52%)|
|7|114077|63(0.06%)|1677(1.47%)|26(0.02%)|112337(98.47%)|
|8|2954230|6127(0.21%)|72822(2.47%)|1074(0.04%)|2875281(97.33%)|
|9|558919|4630(0.83%)|7494(1.34%)|199(0.04%)|546795(97.83%)|
|10|2087508|184987(8.86%)|29967(1.44%)|2973(0.14%)|1872554(89.70%)|
|11|107251|8164(7.61%)|2718(2.53%)|2(0.00%)|96369(89.85%)|
|12|1309791|106352(8.12%)|15847(1.21%)|33(0.00%)|1187592(90.67%)|
|13|325471|2168(0.67%)|7628(2.34%)|25(0.01%)|315675(96.99%)|

### Original CTU-13 Summary
![CTU-13 Dataset Summary](http://mcfp.weebly.com/uploads/1/1/2/3/11233160/7883961.jpg?728)

---

It can be seen that there are some slight discrepencies between the flow values and their percentages. They may be negligible in the long run. 

The code I used to produce the summary can be found [here](https://github.com/corysabol/binetflow-botnet-detect/blob/master/src/sample.py).

---

# Data preparation
---

We want to begin preparing the data for training our models. We are going to create models based on the following techniques:

1. SVM
2. Random Forest
3. Decision Trees
4. Naive Bayes
5. Deep Learning

We want to compare these algorithms for precision, accuracy, and recall. We will split each of the 13 data files into 70% training data and 30% testing data.

In [2]:
import pandas as pd
import numpy as np
import os
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as forest
from sklearn.tree import DecisionTreeClassifier as d_tree
from sklearn.naive_bayes import GaussianNB as g_nb

# will setup for tensor flow later.
# run the neural net on the GPU

---
### Read in the datasets

---

In [5]:
dataset_path = os.path.join('..','CTU-13-Dataset')
directory = os.fsencode(dataset_path)
files = []

# do we want to train on one file at time, or is it all one huge dataset?
# It's about 2.5gb of data, which can fit into memory, thankfully.
for f in os.listdir(directory):
    f_name = os.fsdecode(f)
    if f_name.endswith('.binetflow'):
        p = os.path.join(dataset_path, f_name)
        files.append(pd.read_csv(p, low_memory=False))
len(files)

13

---
### Prepare the data (70/30)
---

In [6]:
# How should we go about actually splitting the data, while they are in 
# dataframes? We can do some kind of labeling
files[0]


array([['2011/08/10 09:46:59.607825', 1.0265389999999999, 'tcp', ..., 276,
        156, 'flow=Background-Established-cmpgw-CVUT'],
       ['2011/08/10 09:47:00.634364', 1.009595, 'tcp', ..., 276, 156,
        'flow=Background-Established-cmpgw-CVUT'],
       ['2011/08/10 09:47:48.185538', 3.0565860000000002, 'tcp', ..., 182,
        122, 'flow=Background-TCP-Attempt'],
       ..., 
       ['2011/08/10 15:54:07.357302', 0.0, 'tcp', ..., 74, 74,
        'flow=Background-TCP-Attempt'],
       ['2011/08/10 15:54:07.366830', 0.002618, 'udp', ..., 520, 460,
        'flow=Background-UDP-Established'],
       ['2011/08/10 15:54:07.368340', 0.0011220000000000002, 'udp', ...,
        137, 77, 'flow=Background-UDP-Established']], dtype=object)