### Botnet Challenge

This puzzle comes from Dr. Chiara Sabatti at Stanford University.

A ~500K CSV with summary of some real network traffic data from the past. The dataset has ~21K rows and covers 10 local workstation IPs over a three month period. Half of these local IPs were compromised at some point during this period and became members of various botnets. 

Each row consists of four columns:

* date: yyyy-mm-dd (from 2006-07-01 through 2006-09-30)
* l_ipn: local IP (coded as an integer from 0-9)
* r_asn: remote ASN (an integer which identifies the remote ISP)
* f: flows (count of connnections for that day)


Reports of "odd" activity or suspicions about a machine's behavior triggered investigations on the following days (although the machine might have been compromised earlier)
* Date : IP
* 08-24 : 1
* 09-04 : 5
* 09-18 : 4
* 09-26 : 3 6

### Parallel Coordinate Plot

I thought that a good way to visualize network connections was to create a parallel coordinate plot:

In [1]:
import colorlover as cl
from IPython.display import HTML
import numpy as np
import pandas as pd 
import plotly.plotly as py
import plotly.graph_objs as go
%matplotlib inline

For this plot, I'm only going to look at the IPNs where suspicious activity was reported: 

In [9]:
# Load data 
cs = pd.read_csv('computer_security.csv')

# Grab the IPNs that had reports of suspicious activity 
reports = [1,3,4,5,6]
cs = cs[cs.l_ipn.isin(reports)]

For the parallel coordinate plot, it would be interesting to link the observations to report dates to see if there are is anything we can observe. Let's create a reported column to do so: 

In [10]:

# Preprocess: add reports column indicating whether or not suspicious acitivity was reported on a given day / IPN
def get_reports(x):
    if x.date == '2006-08-24' and x.l_ipn == 1:
        return 1
    elif x.date == '2006-09-04' and x.l_ipn == 5:
        return 1
    elif x.date == '2006-09-18' and x.l_ipn == 4:
        return 1 
    elif x.date == '2006-09-26' and x.l_ipn == 3:
        return 1
    elif x.date == '2006-09-26' and x.l_ipn == 6:
        return 1
    else:
        return 0
cs['reported'] = cs.apply(lambda x: get_reports(x),axis=1)

Let's go ahead and bucket the ASNs so the plot isn't as messy:

In [11]:
# Bucketize the ASNs for nicer plotting on parallel coord plot 
buckets = sum([[i]*2 for i in range(len(cs.r_asn.unique())//2)],[])
unique_r_asn = list(cs.r_asn.unique())
unique_r_asn.sort()
r_asn_mapping = {unique_r_asn[i]:buckets[i] for i in range(len(buckets))}
cs.r_asn = cs.r_asn.apply(lambda x: r_asn_mapping[x])

Mapping date to integer:

In [12]:
# Map date to integer for plotting 
unique_date = list(cs.date.unique())
unique_date.sort()
date_mapping = {unique_date[i]:i for i in range(len(unique_date))}
cs.date = cs.date.apply(lambda x: date_mapping[x])

Normalizing connections:

In [13]:
# Normalize connections 
for i in reports:
    max_connections = max(cs[cs.l_ipn == i].f)
    cs.loc[cs.l_ipn==i,['f']] = cs[cs.l_ipn == i].f.apply(lambda x: x/max_connections)

Parallel Coordinate Plot:
    
* create sliders by dragging a window over each column 

In [14]:
# Create Parallel Coord Plot 
reds = cl.scales['3']['seq']['Reds']
level = np.linspace(0,1,len(reds))
seq_colors = [[level[i],reds[i]]for i in range(len(reds))]

data = [
    go.Parcoords(
        line = dict(color = cs['f'],
                   colorscale=seq_colors,
                   showscale = True),
        dimensions = list([
            dict(tickvals = [0,1],
                label = 'Incident Reported?',
                ticktext = ['No','Yes'],
                values = cs.reported),
            dict(range = [0,91],
                label = 'Days from July 1st', values = cs.date),
            dict(range = [1,6],
                 tickvals = [1,3,4,5,6],
                 label = 'Local IPN', values = cs.l_ipn),
            dict(range = [0,470],
                 label = 'Remote ASN Bucket', values = cs.r_asn),
            dict(range = [0,1],
                 label = 'Normalized Connections \
                 per IPN', values = cs.f)
        ])
    )
]
layout=go.Layout(title="Local IP and Remote ASN Interactions")
figure=go.Figure(data=data,layout=layout)

py.iplot(figure, filename = 'Local IP and Remote ASN Interactions')