### Data Munging Challenges


#### Challenge 1

- Open up a new IPython notebook
- Download a few MTA turnstile data files
- Open up a file, use csv reader to read it, make a python dict where
  there is a key for each (C/A, UNIT, SCP, STATION). These are the
  first four columns. The value for this key should be a list of
  lists. Each list in the list is the rest of the columns in a
  row. For example, one key-value pair should look like

```
{    ('A002','R051','02-00-00','LEXINGTON AVE'):
[
['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'],
['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'],
['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'],
...
]
}
```

In [21]:
from __future__ import print_function, division

from collections import defaultdict, Counter
import csv
from datetime import datetime
from glob import glob
from itertools import islice
from pprint import pprint
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

In [5]:
!ls 

Untitled.ipynb                    turnstile_160903.txt
chal-01-prob-1-3-pandas.ipynb     turnstile_160910.txt
chal-01-prob-1-3-sampathweb.ipynb turnstile_160917.txt


In [6]:
for filename in glob("turnstile_*.txt"):
    print(filename)

turnstile_160903.txt
turnstile_160910.txt
turnstile_160917.txt


In [22]:
def read_turnstiles(location):
    turnstile_data = defaultdict(list)

    for filename in glob(location):
        print("Extracting File: ", filename)
        with open (filename, 'r') as f:
            reader = csv.reader(f)
            columns = next(reader)  # Python 3
    #         columns = reader.next()  # Python 2
            print(columns)

            for line in reader:
                line[-1] = line[-1].strip()  # Remove the Empty Spaces
                key = tuple(line[:4])
                turnstile_data[key].append(line[4:])
                
    return turnstile_data

In [23]:
turnstile_data = read_turnstiles('turnstile_*.txt')

Extracting File:  turnstile_160903.txt
['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS                                                               ']
Extracting File:  turnstile_160910.txt
['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS                                                               ']
Extracting File:  turnstile_160917.txt
['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS                                                               ']


In [24]:
sample_keys = list(islice(turnstile_data.keys(), 0, 5))
sample_keys

[('N337', 'R255', '00-00-03', 'BRIARWOOD'),
 ('N091', 'R029', '02-05-03', 'CHAMBERS ST'),
 ('R604', 'R108', '03-00-03', 'BOROUGH HALL'),
 ('N098', 'R028', '00-00-02', 'FULTON ST'),
 ('G015', 'R312', '01-05-00', 'W 8 ST-AQUARIUM')]

In [25]:
turnstile_data[('A055', 'R227', '00-00-02', 'RECTOR ST')][:5]

[['R', 'BMT', '08/27/2016', '00:00:00', 'REGULAR', '0117539404', '0016823085'],
 ['R', 'BMT', '08/27/2016', '04:00:00', 'REGULAR', '0117539411', '0016823085'],
 ['R', 'BMT', '08/27/2016', '08:00:00', 'REGULAR', '0117539416', '0016823085'],
 ['R', 'BMT', '08/27/2016', '12:00:00', 'REGULAR', '0117539453', '0016823088'],
 ['R', 'BMT', '08/27/2016', '16:00:00', 'REGULAR', '0117539491', '0016823090']]

In [26]:
for key in sample_keys:
    print(key, "\n", turnstile_data[key][:5])

('N337', 'R255', '00-00-03', 'BRIARWOOD') 
 [['EF', 'IND', '08/27/2016', '00:00:00', 'REGULAR', '0012047600', '0001588152'], ['EF', 'IND', '08/27/2016', '04:00:00', 'REGULAR', '0012047608', '0001588152'], ['EF', 'IND', '08/27/2016', '08:00:00', 'REGULAR', '0012047626', '0001588152'], ['EF', 'IND', '08/27/2016', '12:00:00', 'REGULAR', '0012047680', '0001588158'], ['EF', 'IND', '08/27/2016', '16:00:00', 'REGULAR', '0012047737', '0001588166']]
('N091', 'R029', '02-05-03', 'CHAMBERS ST') 
 [['ACE23', 'IND', '08/27/2016', '00:00:00', 'REGULAR', '0067126165', '0016780871'], ['ACE23', 'IND', '08/27/2016', '04:00:00', 'REGULAR', '0067126191', '0016780874'], ['ACE23', 'IND', '08/27/2016', '08:00:00', 'REGULAR', '0067126207', '0016780896'], ['ACE23', 'IND', '08/27/2016', '12:00:00', 'REGULAR', '0067126294', '0016780944'], ['ACE23', 'IND', '08/27/2016', '16:00:00', 'REGULAR', '0067126497', '0016780999']]
('R604', 'R108', '03-00-03', 'BOROUGH HALL') 
 [['2345R', 'IRT', '08/27/2016', '00:00:00', 'R

#### Challenge 2

- Let's turn this into a time series.

 For each key (basically the control area, unit, device address and
 station of a specific turnstile), have a list again, but let the list
 be comprised of just the point in time and the count of entries.

This basically means keeping only the date, time, and entries fields
in each list. You can convert the date and time into datetime objects
-- That is a python class that represents a point in time. You can
combine the date and time fields into a string and use the
[dateutil](https://labix.org/python-dateutil) module to convert it
into a datetime object. For an example check
[this StackOverflow question](http://stackoverflow.com/questions/23385003/attributeerror-when-using-import-dateutil-and-dateutil-parser-parse-but-no).

Your new dict should look something like
```
{    ('A002','R051','02-00-00','LEXINGTON AVE'):
[
[datetime.datetime(2013, 3, 2, 3, 0), 3788],
[datetime.datetime(2013, 3, 2, 7, 0), 2585],
[datetime.datetime(2013, 3, 2, 12, 0), 10653],
[datetime.datetime(2013, 3, 2, 17, 0), 11016],
[datetime.datetime(2013, 3, 2, 23, 0), 10666],
[datetime.datetime(2013, 3, 3, 3, 0), 10814],
[datetime.datetime(2013, 3, 3, 7, 0), 10229],
...
],
....
}
```

In [27]:
turnstile_entry_counts = defaultdict(list)
for key, items in turnstile_data.items():
    for item in items:
        date_hour = datetime.strptime(item[2] + " " + item[3][:-3], "%m/%d/%Y %H:%M")
        turnstile_entry_counts[key].append((date_hour, int(item[-2])))

In [28]:
pprint(list(islice(turnstile_entry_counts.items(), 0, 5)))

[(('N337', 'R255', '00-00-03', 'BRIARWOOD'),
  [(datetime.datetime(2016, 8, 27, 0, 0), 12047600),
   (datetime.datetime(2016, 8, 27, 4, 0), 12047608),
   (datetime.datetime(2016, 8, 27, 8, 0), 12047626),
   (datetime.datetime(2016, 8, 27, 12, 0), 12047680),
   (datetime.datetime(2016, 8, 27, 16, 0), 12047737),
   (datetime.datetime(2016, 8, 27, 20, 0), 12047794),
   (datetime.datetime(2016, 8, 28, 0, 0), 12047808),
   (datetime.datetime(2016, 8, 28, 4, 0), 12047808),
   (datetime.datetime(2016, 8, 28, 8, 0), 12047819),
   (datetime.datetime(2016, 8, 28, 12, 0), 12047880),
   (datetime.datetime(2016, 8, 28, 16, 0), 12047921),
   (datetime.datetime(2016, 8, 28, 20, 0), 12047941),
   (datetime.datetime(2016, 8, 29, 0, 0), 12047951),
   (datetime.datetime(2016, 8, 29, 4, 0), 12047954),
   (datetime.datetime(2016, 8, 29, 8, 0), 12048046),
   (datetime.datetime(2016, 8, 29, 12, 0), 12048187),
   (datetime.datetime(2016, 8, 29, 16, 0), 12048256),
   (datetime.datetime(2016, 8, 29, 20, 0), 120

In [29]:
pprint(list(islice(turnstile_entry_counts.items(), 0, 5)))

[(('N337', 'R255', '00-00-03', 'BRIARWOOD'),
  [(datetime.datetime(2016, 8, 27, 0, 0), 12047600),
   (datetime.datetime(2016, 8, 27, 4, 0), 12047608),
   (datetime.datetime(2016, 8, 27, 8, 0), 12047626),
   (datetime.datetime(2016, 8, 27, 12, 0), 12047680),
   (datetime.datetime(2016, 8, 27, 16, 0), 12047737),
   (datetime.datetime(2016, 8, 27, 20, 0), 12047794),
   (datetime.datetime(2016, 8, 28, 0, 0), 12047808),
   (datetime.datetime(2016, 8, 28, 4, 0), 12047808),
   (datetime.datetime(2016, 8, 28, 8, 0), 12047819),
   (datetime.datetime(2016, 8, 28, 12, 0), 12047880),
   (datetime.datetime(2016, 8, 28, 16, 0), 12047921),
   (datetime.datetime(2016, 8, 28, 20, 0), 12047941),
   (datetime.datetime(2016, 8, 29, 0, 0), 12047951),
   (datetime.datetime(2016, 8, 29, 4, 0), 12047954),
   (datetime.datetime(2016, 8, 29, 8, 0), 12048046),
   (datetime.datetime(2016, 8, 29, 12, 0), 12048187),
   (datetime.datetime(2016, 8, 29, 16, 0), 12048256),
   (datetime.datetime(2016, 8, 29, 20, 0), 120

#### Challenge 3

- These counts are for every n hours. (What is n?) We want total daily
  entries.

Now make it that we again have the same keys, but now we have a single
value for a single day, which is the total number of passengers that
entered through this turnstile on this day.

In [30]:
def get_diff_entry_counts(end_val, start_val):
    """Returns the diff between prev_val and next_val of trunstile counters
    TODO: Account for counter recycles (next_val < prev_val).
    Currently the logic ignores data when the end val < start val.
    """
    if end_val >= start_val:
        return end_val - start_val
    else:
#         print(end_val, start_val)
        return 0

In [31]:
turnstile_entry_counts[('A007', 'R079', '01-06-01', '5 AV/59 ST')]

[(datetime.datetime(2016, 8, 27, 2, 0), 118040215),
 (datetime.datetime(2016, 8, 27, 6, 0), 118040224),
 (datetime.datetime(2016, 8, 27, 10, 0), 118040265),
 (datetime.datetime(2016, 8, 27, 14, 0), 118040349),
 (datetime.datetime(2016, 8, 27, 18, 0), 118040800),
 (datetime.datetime(2016, 8, 27, 22, 0), 118041253),
 (datetime.datetime(2016, 8, 28, 2, 0), 118041401),
 (datetime.datetime(2016, 8, 28, 6, 0), 118041410),
 (datetime.datetime(2016, 8, 28, 10, 0), 118041443),
 (datetime.datetime(2016, 8, 28, 14, 0), 118041554),
 (datetime.datetime(2016, 8, 28, 18, 0), 118041982),
 (datetime.datetime(2016, 8, 28, 22, 0), 118042495),
 (datetime.datetime(2016, 8, 29, 2, 0), 118042607),
 (datetime.datetime(2016, 8, 29, 6, 0), 118042613),
 (datetime.datetime(2016, 8, 29, 10, 0), 118042684),
 (datetime.datetime(2016, 8, 29, 14, 0), 118042868),
 (datetime.datetime(2016, 8, 29, 18, 0), 118043529),
 (datetime.datetime(2016, 8, 29, 22, 0), 118043975),
 (datetime.datetime(2016, 8, 30, 2, 0), 118044111),


In [55]:
turnstile_last_entry_counts = defaultdict(list)
anomoly_trunstile_entries = defaultdict(list)


for key, items in turnstile_entry_counts.items():
    prev_date, prev_entry = items[0]
    prev_date = prev_date.date()
    for date, entry in items[1:]:
        date = date.date()
        if date != prev_date:
            turnstile_last_entry_counts[key].append((prev_date, prev_entry))
            prev_date = date
        prev_entry = entry

In [56]:
list(islice(turnstile_last_entry_counts.items(), 0, 2))

[(('N337', 'R255', '00-00-03', 'BRIARWOOD'),
  [(datetime.date(2016, 8, 27), 12047794),
   (datetime.date(2016, 8, 28), 12047941),
   (datetime.date(2016, 8, 29), 12048289),
   (datetime.date(2016, 8, 30), 12048594),
   (datetime.date(2016, 8, 31), 12048911),
   (datetime.date(2016, 9, 1), 12049221),
   (datetime.date(2016, 9, 2), 12049506),
   (datetime.date(2016, 9, 3), 12049676),
   (datetime.date(2016, 9, 4), 12049821),
   (datetime.date(2016, 9, 5), 12049950),
   (datetime.date(2016, 9, 6), 12050308),
   (datetime.date(2016, 9, 7), 12050669),
   (datetime.date(2016, 9, 8), 12051004),
   (datetime.date(2016, 9, 9), 12051372),
   (datetime.date(2016, 9, 10), 12051540),
   (datetime.date(2016, 9, 11), 12051660),
   (datetime.date(2016, 9, 12), 12052010),
   (datetime.date(2016, 9, 13), 12052359),
   (datetime.date(2016, 9, 14), 12052718),
   (datetime.date(2016, 9, 15), 12053114)]),
 (('N091', 'R029', '02-05-03', 'CHAMBERS ST'),
  [(datetime.date(2016, 8, 27), 67126743),
   (datetime

In [58]:
turnstile_entry_daily_counts = defaultdict(list)

for key, items in turnstile_last_entry_counts.items():
    prev_date, prev_entry = items[0]
    date_values = defaultdict(int)
    for date, entry in items[1:]:
        turnstile_entry_daily_counts[key].append((date, get_diff_entry_counts(entry, prev_entry)))
        prev_entry = entry

In [61]:
list(islice(turnstile_entry_daily_counts.items(), 0, 2))

[(('N337', 'R255', '00-00-03', 'BRIARWOOD'),
  [(datetime.date(2016, 8, 28), 147),
   (datetime.date(2016, 8, 29), 348),
   (datetime.date(2016, 8, 30), 305),
   (datetime.date(2016, 8, 31), 317),
   (datetime.date(2016, 9, 1), 310),
   (datetime.date(2016, 9, 2), 285),
   (datetime.date(2016, 9, 3), 170),
   (datetime.date(2016, 9, 4), 145),
   (datetime.date(2016, 9, 5), 129),
   (datetime.date(2016, 9, 6), 358),
   (datetime.date(2016, 9, 7), 361),
   (datetime.date(2016, 9, 8), 335),
   (datetime.date(2016, 9, 9), 368),
   (datetime.date(2016, 9, 10), 168),
   (datetime.date(2016, 9, 11), 120),
   (datetime.date(2016, 9, 12), 350),
   (datetime.date(2016, 9, 13), 349),
   (datetime.date(2016, 9, 14), 359),
   (datetime.date(2016, 9, 15), 396)]),
 (('N091', 'R029', '02-05-03', 'CHAMBERS ST'),
  [(datetime.date(2016, 8, 28), 624),
   (datetime.date(2016, 8, 29), 1810),
   (datetime.date(2016, 8, 30), 2243),
   (datetime.date(2016, 8, 31), 2082),
   (datetime.date(2016, 9, 1), 2142),


In [39]:
pprint(list(islice(turnstile_entry_daily_counts.items(), 2)))

[(('N337', 'R255', '00-00-03', 'BRIARWOOD'),
  [(datetime.date(2016, 8, 27), 208),
   (datetime.date(2016, 8, 28), 143),
   (datetime.date(2016, 8, 29), 345),
   (datetime.date(2016, 8, 30), 310),
   (datetime.date(2016, 8, 31), 310),
   (datetime.date(2016, 9, 1), 313),
   (datetime.date(2016, 9, 2), 288),
   (datetime.date(2016, 9, 3), 169),
   (datetime.date(2016, 9, 4), 143),
   (datetime.date(2016, 9, 5), 133),
   (datetime.date(2016, 9, 6), 353),
   (datetime.date(2016, 9, 7), 360),
   (datetime.date(2016, 9, 8), 331),
   (datetime.date(2016, 9, 9), 382),
   (datetime.date(2016, 9, 10), 159),
   (datetime.date(2016, 9, 11), 116),
   (datetime.date(2016, 9, 12), 355),
   (datetime.date(2016, 9, 13), 347),
   (datetime.date(2016, 9, 14), 361),
   (datetime.date(2016, 9, 15), 392)]),
 (('N091', 'R029', '02-05-03', 'CHAMBERS ST'),
  [(datetime.date(2016, 8, 27), 736),
   (datetime.date(2016, 8, 28), 599),
   (datetime.date(2016, 8, 29), 2070),
   (datetime.date(2016, 8, 30), 2188),
 

In [64]:
# Manually Verify that calculation got it right

key = ('A002', 'R051', '02-00-00', '59 ST')
pprint(turnstile_entry_counts[key][:20])

# Take the Last entry for current date and last entry of previous date 
# to count entries and check if that matches with the values for current date

pprint(turnstile_entry_daily_counts[key][:2])
# Manual check
pprint(5800798 - 5800121)
pprint(5802336 - 5800798)

[(datetime.datetime(2016, 8, 27, 0, 0), 5799442),
 (datetime.datetime(2016, 8, 27, 4, 0), 5799463),
 (datetime.datetime(2016, 8, 27, 8, 0), 5799492),
 (datetime.datetime(2016, 8, 27, 12, 0), 5799610),
 (datetime.datetime(2016, 8, 27, 16, 0), 5799833),
 (datetime.datetime(2016, 8, 27, 20, 0), 5800121),
 (datetime.datetime(2016, 8, 28, 0, 0), 5800252),
 (datetime.datetime(2016, 8, 28, 4, 0), 5800281),
 (datetime.datetime(2016, 8, 28, 8, 0), 5800295),
 (datetime.datetime(2016, 8, 28, 12, 0), 5800377),
 (datetime.datetime(2016, 8, 28, 16, 0), 5800572),
 (datetime.datetime(2016, 8, 28, 20, 0), 5800798),
 (datetime.datetime(2016, 8, 29, 0, 0), 5800934),
 (datetime.datetime(2016, 8, 29, 4, 0), 5800947),
 (datetime.datetime(2016, 8, 29, 8, 0), 5800996),
 (datetime.datetime(2016, 8, 29, 12, 0), 5801176),
 (datetime.datetime(2016, 8, 29, 16, 0), 5801495),
 (datetime.datetime(2016, 8, 29, 20, 0), 5802336),
 (datetime.datetime(2016, 8, 30, 0, 0), 5802514),
 (datetime.datetime(2016, 8, 30, 4, 0), 5