# The dataset
---

The dataset for this practical exercise is composed of log records that look similar to this:
```
1       2           3 4     5   6    7    8  9  10                       11                      12 13  
1.2.3.4 11.22.33.44 6 53211 443 1910 2452 14 16 2016-08-15-13:30:28.8410 2016-08-15-13:30:29.6240 1 1
1.2.3.4 11.22.33.44 6 53214 443 1698 2452 14 16 2016-08-15-13:35:18.6120 2016-08-15-13:35:19.4037 1 1
1.2.3.4 11.22.33.44 6 53229 443 1698 2452 14 16 2016-08-15-13:39:57.4420 2016-08-15-13:39:58.2344 1 1
1.2.3.4 11.22.33.44 6 53232 443 1698 2452 14 16 2016-08-15-13:44:26.4776 2016-08-15-13:44:27.2729 1 1
1.2.3.4 11.22.33.44 6 53235 443 1698 2452 14 16 2016-08-15-13:49:14.8779 2016-08-15-13:49:14.8779 1 1
1.2.3.4 11.22.33.44 6 53239 443 1698 2452 14 16 2016-08-15-13:53:45.0699 2016-08-15-13:53.45.8680 1 1
1.2.3.4 11.22.33.44 6 53256 443 1698 2452 14 16 2016-08-15-13:58:43.5585 2016-08-15-13:58:44.3501 1 1
```

# Data dictionary / code book
---

The data is composed of the following

*  first column is an IP address. IP-Address (1)
*  second column is an IP address. IP-Address (2)
*  third column is the protocol.  In this case the protocol is set to 6 (TCP)
*  fourth column is the port associated with IP-Address 1.  In this case Ephemeral Ports.
*  fifth column is the port associated with IP-Address 2.  In this case HTTPS (443).
*  sixth column is the number of bytes received by IP-Address 1.  
*  seventh column is the number of bytes received by IP-Address 2.  
*  eighth column is the number of packets received by IP-Address 1. 
*  ninth column is the number of packets received by IP-Address 2.  
*  tenth column is the time the first packet was received.
*  eleventh column is the time the last packet was received.
*  twelfth column shows which IP Address sent the first packet (initiated the conversation).
*  thirteenth column shows which IP Address sent the last packet (finished the conversation).

# Questions
---

Based on this data, we will explore multiple analytic questions, such as these...

* find the IP pair with biggest time difference between first packet and last packet (on a single line)
* find the IP with most total bytes/least total bytes
* find the most common ephemeral port by bytes sent AND by packets sent

In [4]:
import csv

In [5]:
SMALLFILE = 'output.small.csv'
LARGEFILE = 'output.csv'

ACTIVEFILE = SMALLFILE

# Question
---

Find the IP pair with biggest time difference between first packet and last packet (on a single line)

In [22]:
from datetime import datetime, timedelta

In [23]:
fin = open(ACTIVEFILE)
lines = csv.reader(fin, delimiter=' ')

high_diff = timedelta(0, 0, 0)
high_ip_pair = None

for line in lines:
    ip1 = line[0]
    ip2 = line[1]
    
    firsttime = datetime.strptime(line[9], '%Y-%m-%d-%H:%M:%S.%f')  # 2016-08-15-13:30:28.8410
    lasttime = datetime.strptime(line[10], '%Y-%m-%d-%H:%M:%S.%f')
    
    diff = lasttime - firsttime
    
    if diff > high_diff:
        high_diff = diff
        high_ip_pair = (ip1, ip2)
    
    print(high_diff, high_ip_pair)

0:00:00.629493 ('113.102.66.188', '165.90.83.51')
0:00:01.246643 ('57.245.8.149', '165.90.83.102')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')
0:00:03.270895 ('48.209.101.198', '85.155.248.149')


# Question
---

Find the IP with most total bytes/least total bytes


In [21]:
from collections import defaultdict, Counter

fin = open(ACTIVEFILE)
lines = csv.reader(fin, delimiter=' ')

ips1 = defaultdict(int)
ips2 = defaultdict(int)

for line in lines:
    ip1 = line[0]
    ip2 = line[1]
    
    bytes1 = int(line[5])
    bytes2 = int(line[6])
    
    ips1[ip1] += bytes1
    ips2[ip2] += bytes2
    
ips1 = Counter(ips1)
ips2 = Counter(ips2)

print('First IP:')
print(ips1.most_common(1))
print(ips1.most_common()[-1])
print()
print('Second IP:')
print(ips2.most_common(1)[0][0])
print(ips2.most_common()[-1][0])

First IP:
[('57.245.8.149', 935747421)]
('113.102.66.188', 121986028)

Second IP:
112.160.150.170
136.226.96.121


In [26]:
# Something fun
# How do we eliminate some of the clutter and duplication in the code above?
# How do we make the code more readable?
# How do we make the code more maintainable?


def most_common_ip(ips, most=True):
    '''Given a Counter-like object (i.e. object with a .most_common() method,
    return the most OR least common IP.
    
    Default is to return the most common IP.
    Set most=False to return the least common IP.
    '''
    
    if most:
        return ips.most_common(1)[0][0]
    else:
        return ips.most_common()[-1][0]
    

print(most_common_ip(ips2))
print(most_common_ip(ips2, most=False))


112.160.150.170
136.226.96.121


# Question
---

Find the most common ephemeral port by bytes sent.

Also find the most common ephemeral port by packets sent (NOTE: This step is left as an exercise for the student)

In [34]:
from collections import defaultdict, Counter

fin = open(ACTIVEFILE)
lines = csv.reader(fin, delimiter=' ')

ports1 = defaultdict(int)
ports2 = defaultdict(int)

for line in lines:
    port1 = line[3]
    port2 = line[4]
    
    bytes1 = int(line[5])
    bytes2 = int(line[6])
    
    ports1[port1] += bytes1
    ports2[port2] += bytes2
    
ports1 = Counter(ports1)
ports2 = Counter(ports2)

print('First Port:')
print(most_common_ip(ports1))
print(most_common_ip(ports1, most=False))
print()
print('Second Port:')
print(most_common_ip(ports2))
print(most_common_ip(ports2, most=False))


First Port:
58947
6508

Second Port:
7312
53
