# The dataset
---

The dataset for this practical exercise is composed of log records that look similar to this:
```
1       2           3 4     5   6    7    8  9  10                       11                      12 13  
1.2.3.4 11.22.33.44 6 53211 443 1910 2452 14 16 2016-08-15-13:30:28.8410 2016-08-15-13:30:29.6240 1 1
1.2.3.4 11.22.33.44 6 53214 443 1698 2452 14 16 2016-08-15-13:35:18.6120 2016-08-15-13:35:19.4037 1 1
1.2.3.4 11.22.33.44 6 53229 443 1698 2452 14 16 2016-08-15-13:39:57.4420 2016-08-15-13:39:58.2344 1 1
1.2.3.4 11.22.33.44 6 53232 443 1698 2452 14 16 2016-08-15-13:44:26.4776 2016-08-15-13:44:27.2729 1 1
1.2.3.4 11.22.33.44 6 53235 443 1698 2452 14 16 2016-08-15-13:49:14.8779 2016-08-15-13:49:14.8779 1 1
1.2.3.4 11.22.33.44 6 53239 443 1698 2452 14 16 2016-08-15-13:53:45.0699 2016-08-15-13:53.45.8680 1 1
1.2.3.4 11.22.33.44 6 53256 443 1698 2452 14 16 2016-08-15-13:58:43.5585 2016-08-15-13:58:44.3501 1 1
```

# Data dictionary / code book
---

The data is composed of the following

*  first column is an IP address. IP-Address (1)
*  second column is an IP address. IP-Address (2)
*  third column is the protocol.  In this case the protocol is set to 6 (TCP)
*  fourth column is the port associated with IP-Address 1.  In this case Ephemeral Ports.
*  fifth column is the port associated with IP-Address 2.  In this case HTTPS (443).
*  sixth column is the number of bytes received by IP-Address 1.  
*  seventh column is the number of bytes received by IP-Address 2.  
*  eighth column is the number of packets received by IP-Address 1. 
*  ninth column is the number of packets received by IP-Address 2.  
*  tenth column is the time the first packet was received.
*  eleventh column is the time the last packet was received.
*  twelfth column shows which IP Address sent the first packet (initiated the conversation).
*  thirteenth column shows which IP Address sent the last packet (finished the conversation).

# Questions
---

Based on this data, we will explore multiple analytic questions, such as these...

* count the number of IP addresses
* count the number of IP address pairs (from and to)
* find the biggest/smallest values for packet counts
* find the most common protocols
* find the most common protocols based on the total number of bytes associated with that protocol?


In [None]:
import csv

In [None]:
SMALLFILE = 'output.small.csv'
LARGEFILE = 'output.csv'

ACTIVEFILE = SMALLFILE

# Question
---

How many unique IP addresses appear in the file as IP address 1 and as IP address 2?

In [None]:
fin = open(ACTIVEFILE)
logs = csv.reader(fin, delimiter=' ')

ipaddresses1 = set()
ipaddresses2 = set()

for line in logs:
    ipaddr1, ipaddr2, *junk = line
    ipaddresses1.add(ipaddr1)
    ipaddresses2.add(ipaddr2)  

In [None]:
print('Number of ipaddresses in column 1:', len(ipaddresses1))
print('Number of ipaddresses in column 2:', len(ipaddresses2))

In [None]:
print(sorted(ipaddresses1))
print()
print(sorted(ipaddresses2))

# Question
---

How many unique IP address pairs appear in the file?

In [None]:
fin = open(ACTIVEFILE)
logs = csv.reader(fin, delimiter=' ')

ippairs = set()

for line in logs:
    ipaddr1, ipaddr2, *junk = line
    ippairs.add((ipaddr1, ipaddr2))


In [None]:
print('Number of ip pairs', len(ippairs))

# Question
---

What are the biggest/smallest values for packet counts?

In [None]:
fin = open(ACTIVEFILE)
logs = csv.reader(fin, delimiter=' ')

packets1 = set()
packets2 = set()

for line in logs:
    packets1.add(int(line[7]))
    packets2.add(int(line[8]))

# Two ways of solving this question...
# Method ONE
packets1 = sorted(packets1)    
sm_packets1 = packets1[0]
lg_packets1 = packets1[-1]

# NOTE: this tends to take significantly longer than method TWO

# Method TWO
sm_packets2 = min(packets2)      
lg_packets2 = max(packets2)

# NOTE: this tends to be much faster than method ONE

In [None]:
print('Addr 1: Smallest packet count:', sm_packets1)
print('Addr 1: Largest packet count:', lg_packets1)
print()
print('Addr 2: Smallest packet count:', sm_packets2)
print('Addr 2: Largest packet count:', lg_packets2)

# Question
---

What are the three most common protocols based on the number of times it appears?

In [None]:
fin = open(ACTIVEFILE)
logs = csv.reader(fin, delimiter=' ')

from collections import Counter

protocols = []

for line in logs:
    protocols.append(line[2])
    
count = Counter(protocols)    
print(count.most_common(3))

# Question
---

What are the most common protocols based on the total number of bytes associated with that protocol?

In [None]:
fin = open(ACTIVEFILE)
logs = csv.reader(fin, delimiter=' ')

from collections import defaultdict

protocols = defaultdict(int)

for line in logs:
    bytes1 = int(line[7])
    bytes2 = int(line[8])
    byte_sum = bytes1 + bytes2
    protocols[line[2]] += byte_sum
    
print(protocols)

In [None]:
from collections import Counter
count = Counter(protocols)

In [None]:
# To see all the most common items, 
# you can call the .most_common() method

count.most_common()

In [None]:
# To limit the display to the n most common items, 
# you can identify how many elements to display:

count.most_common(3)

In [None]:
# Because the .most_common() method produces an ordered 
# list, you can also slice the items off the "least common" 
# end of the list using standard slice notation:

count.most_common()[-3:]