# The dataset
---

The dataset for this practical exercise is composed of log records that look similar to this:
```
1       2           3 4     5   6    7    8  9  10                       11                      12 13  
1.2.3.4 11.22.33.44 6 53211 443 1910 2452 14 16 2016-08-15-13:30:28.8410 2016-08-15-13:30:29.6240 1 1
1.2.3.4 11.22.33.44 6 53214 443 1698 2452 14 16 2016-08-15-13:35:18.6120 2016-08-15-13:35:19.4037 1 1
1.2.3.4 11.22.33.44 6 53229 443 1698 2452 14 16 2016-08-15-13:39:57.4420 2016-08-15-13:39:58.2344 1 1
1.2.3.4 11.22.33.44 6 53232 443 1698 2452 14 16 2016-08-15-13:44:26.4776 2016-08-15-13:44:27.2729 1 1
1.2.3.4 11.22.33.44 6 53235 443 1698 2452 14 16 2016-08-15-13:49:14.8779 2016-08-15-13:49:14.8779 1 1
1.2.3.4 11.22.33.44 6 53239 443 1698 2452 14 16 2016-08-15-13:53:45.0699 2016-08-15-13:53.45.8680 1 1
1.2.3.4 11.22.33.44 6 53256 443 1698 2452 14 16 2016-08-15-13:58:43.5585 2016-08-15-13:58:44.3501 1 1
```

# Data dictionary / code book
---

The data is composed of the following

*  first column is an IP address. IP-Address (1)
*  second column is an IP address. IP-Address (2)
*  third column is the protocol.  In this case the protocol is set to 6 (TCP)
*  fourth column is the port associated with IP-Address 1.  In this case Ephemeral Ports.
*  fifth column is the port associated with IP-Address 2.  In this case HTTPS (443).
*  sixth column is the number of bytes received by IP-Address 1.  
*  seventh column is the number of bytes received by IP-Address 2.  
*  eighth column is the number of packets received by IP-Address 1. 
*  ninth column is the number of packets received by IP-Address 2.  
*  tenth column is the time the first packet was received.
*  eleventh column is the time the last packet was received.
*  twelfth column shows which IP Address sent the first packet (initiated the conversation).
*  thirteenth column shows which IP Address sent the last packet (finished the conversation).

# Questions
---

Based on this data, we will explore multiple analytic questions, such as these...

* for each IP pair, identify the difference in time (using the time the first packet was received) between the current record and the next subsequent record, capturing all the time differences associated with each IP pair...
    ```
    1.2.3.4 11.22.33.44 ... 2016-08-15-13:30:28.8410 >>> n/a
    1.2.3.4 11.22.33.44 ... 2016-08-15-13:35:18.6120 >>> approx 5 seconds from previous record
    5.5.5.5 99.99.99.99 ... 2016-08-15-13:39:57.4420 >>> n/a (different IP pair)
    1.2.3.4 11.22.33.44 ... 2016-08-15-13:44:26.4776 >>> approx 9 seconds from previous record

    (1.2.3.4, 11.22.33.44) is associated with time differences of 5, 9, etc...
    ```
* for each IP pair, calculate the average difference and the standard deviation of the times 
    * the average will give you a sense of how often signal occurs
    * the standard deviation will give you a sense of how much the average varies (how wide the spread is)


# Question
---

For each IP pair, identify the difference in time (using the time the first packet was received) between the current record and the next subsequent record, capturing all the time differences associated with each IP pair...

In [57]:
DUPEFILE = 'output.small.dupes.csv'
LARGEFILE = 'output.csv'

ACTIVEFILE = LARGEFILE

In [58]:
import csv
from collections import defaultdict

from datetime import datetime

logs = csv.reader(open(ACTIVEFILE), delimiter=' ')

timedeltas = defaultdict(list)

for line in logs:
    ip1, ip2 = line[0:2]
    # print(line[9])
    time = datetime.strptime(line[9], '%Y-%m-%d-%H:%M:%S.%f') # 2016-08-15-13:58:43.5585
    timedeltas[(ip1, ip2)].append(time)


In [59]:
import itertools
import operator
import statistics
from collections import Counter

stdev = []
averages = []

lowest_stdev = 100000000
lowest_ippair = None

In [62]:
deviations = defaultdict(list)

for k, v in timedeltas.items():
    
    a = [(y - x).seconds + (y - x).microseconds/100000  for x, y in zip(v, v[1:])]
    
    avg = sum(a)/len(a)
    
    std = statistics.stdev(a)
    deviations[k].append(std)
    
    if std <= lowest_stdev:
        lowest_stdev = std
        lowest_ippair = k
    
    stdev.append(std)
    averages.append(avg)

In [63]:
c = Counter(deviations)


print(min(stdev), max(stdev), min(averages), max(averages))
print(lowest_ippair)

0.5126484193666336 330007.5457974581 86046.17058482756 682626.1683168317
('105.71.60.19', '93.138.73.231')


In [64]:
# c

In [65]:
cmc = c.most_common()


In [66]:
cmc[-5:]

[(('87.223.140.1', '19.137.118.175'), [186.12128123939786]),
 (('45.121.54.166', '51.193.232.193'), [184.26673339927996]),
 (('99.212.188.183', '68.80.214.81'), [184.07288778149754]),
 (('239.41.249.194', '239.218.61.8'), [169.43494359133112]),
 (('105.71.60.19', '93.138.73.231'), [0.5126484193666336])]

In [55]:
min(stdev)

0.0

In [56]:
a

[317810,
 618587,
 697328,
 867539,
 790587,
 552373,
 990107,
 140970,
 296229,
 508527,
 665544,
 886129,
 1022727,
 538412,
 158327,
 90238,
 586151,
 361695,
 307323,
 281984,
 863673,
 741970,
 152498,
 268071,
 226756,
 350430,
 670630,
 440059,
 622029,
 1055635,
 694993,
 110056,
 750782,
 951819,
 686356,
 1066332,
 489712,
 571023,
 542065,
 430488,
 779666,
 783956,
 1085972,
 488034,
 829180,
 229549,
 528048,
 179897,
 444623,
 959559,
 568791,
 778202,
 595954,
 979071,
 578851,
 800899,
 427864,
 372730,
 459810,
 745776,
 577739,
 670871,
 409181,
 150931,
 1072895,
 698971,
 101447,
 929807,
 896114,
 702495,
 979362,
 1078381,
 434203,
 933042,
 522892,
 727796,
 630200,
 538025,
 345045,
 1030009,
 616344,
 520217,
 333242,
 753413,
 453788,
 675667,
 820637,
 374071,
 451420,
 550458,
 354977,
 583758,
 101086,
 711342,
 658184,
 512903,
 328212,
 205576,
 593562,
 721555,
 1066967,
 401957,
 908821,
 335164,
 457752,
 325227,
 837035,
 651805,
 197701,
 670987,
 93