# Data streaming

## Task 4. Botnet profiling task

Group 97

Choose a probabilistic sequential model (Markov chain, n-grams, state machines, HMMs, ...) Use a sliding window to obtain sequence data. Learn a probabilistic sequential model from the data of one infected host and match its profile with all other hosts from the same scenario. Evaluate how many new infections your method finds and false positives it raises. Can you determine what behaviour your profile detects?

Per documentation, the distribution of labels in the NetFlows for scenario 10 in the dataset is:

Total flows | Botnet flows    | Normal flows  | C&C flows  | Background flows
------------|-----------------|---------------|------------|-------------------
1,309,791   | 106,315 (8.11%) | 15,847 (1.2%) | 37 (.002%) | 1,187,592 (90.67%)

Reference: "An empirical comparison of botnet detection methods" Sebastian Garcia, Martin Grill, Jan Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp 100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, scale
from sklearn.cluster import MiniBatchKMeans

np.random.seed(42)
random.seed(42)
%matplotlib inline

In [2]:
# define filepath for scenario 10 dataset
filepath = './data/capture20110818.pcap.netflow.labeled'

# read data from the file
f = open(filepath, 'r')
lines = f.readlines()
f.close()
data = lines[1:] # drop the header

In [3]:
def preprocessing(data):
    '''data preprocessing
    Input
    -----
    data: string of a data flow
    
    Return
    ------
    o = cleaned, formated data
    
    '''
    s = data.split('\t')
    s = [x for x in s if x] # remove empty elements
    if len(s) < 12: # special fix for an outlier string @2011-08-18 12:18:31.264
        s = s[0].rsplit(' ', 11) 
    o = np.array([pd.to_datetime(s[0], format='%Y-%m-%d %H:%M:%S.%f'), # timestamp
                  float(s[1]), # duration
                  s[2], # protocol
                  s[3].split(':')[0], # ScrAddr
                  s[5].split(':')[0], # DstAddr
                  s[6].lstrip('_').rstrip('_').rstrip(), # flags
                  int(s[7]), # Tos
                  int(s[8]), # packets
                  int(s[9]), # bytes
                  int(s[10]), # flows
                  s[11].rstrip('\n').rstrip() # label
                 ])
    return o

In [4]:
df = list(map(preprocessing, data)) # data preprocessing
df = pd.DataFrame(df, columns=['Time', 'Duration', 'Protocol', 'ScrAddr', 'DstAddr', 
                               'Flags', 'Tos', 'Packets', 'Bytes', 'Flows', 'Label'])

In [6]:
# save cleaned dataframe to csv
df.to_csv('./data/scenario10_cleaned.csv')

In [8]:
# load data
df = pd.read_csv('./data/scenario10_cleaned.csv')
df.set_index('Time', drop=True, inplace=True)

df.head()

Unnamed: 0_level_0,Duration,Protocol,ScrAddr,DstAddr,Flags,Tos,Packets,Bytes,Flows,Label
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2011-08-18 10:19:13.328,0.002,TCP,147.32.86.166,212.24.150.110,FRPA,0,4,321,1,Background
2011-08-18 10:19:13.328,4.995,UDP,82.39.2.249,147.32.84.59,INT,0,617,40095,1,Background
2011-08-18 10:19:13.329,4.996,UDP,147.32.84.59,82.39.2.249,INT,0,1290,1909200,1,Background
2011-08-18 10:19:13.330,0.0,TCP,147.32.86.166,147.32.192.34,A,0,1,66,1,Background
2011-08-18 10:19:13.330,0.0,TCP,212.24.150.110,147.32.86.166,FPA,0,2,169,1,Background


## 1. Train model with one infected host

In [9]:
# assign selected infected host for training
INFECTED_HOST = '147.32.84.207'

## 2. Test model