    Topic:        Challenge Set 1  
    Subject:      Explore MTA turnstile data  
    Date:         07/09/2018  
    Name:         Courtney  
    Worked with:  Tim, Brandon

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

**Challenge 1**

* Open up a new Jupyter notebook
* Download a few MTA turnstile data files
* Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION).

First, create a function to retrieve the necessary data from April, May and June 2018.

In [2]:
# Data Source: http://web.mta.info/developers/turnstile.html
def get_data(weeks):
    """
    Retrieve and read data from a website
    ---
    Input: list of ints or strings, or single int/string
    """
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    data = []
    for week in weeks:
        file_url = url.format(week)
        data.append(pd.read_csv(file_url))
    return pd.concat(data)

# Pull data from April, May and June 2018
"""week_labels = [180630, 180623, 180616, 180609, 180602, 180526, 
               180519, 180512, 180505, 180428, 180421, 180414, 180407] """
week_labels = [160903, 160910, 160917]
turnstiles_df = get_data(week_labels)

Alternatively if the file is already saved locally in the same
directory as the Jupyter notebook, uncomment & run the below after 
assigning the local data file name to the variable file.

In [None]:
# data_file = 'data_file.txt'
# turnstiles_df = pd.read_csv(data_file)

In [None]:
# turnstiles_df.head()

In [5]:
turnstiles_df.shape

(580895, 11)

**Challenge 2**

In [20]:
turnstiles_df["DATE_TIME"] = pd.to_datetime(turnstiles_df.DATE + " " + turnstiles_df.TIME, format="%m/%d/%Y %H:%M:%S")

In [24]:
turnstiles_df["DATE"] = pd.to_datetime(turnstiles_df.DATE, format="%m/%d/%Y")

In [28]:
time_series_df = pd.DataFrame(turnstiles_df[['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'ENTRIES', 'DATE_TIME', 'DATE']])

In [29]:
time_series_df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,ENTRIES,DATE_TIME,DATE
0,A002,R051,02-00-00,59 ST,NQR456,5799442,2016-08-27 00:00:00,2016-08-27
1,A002,R051,02-00-00,59 ST,NQR456,5799463,2016-08-27 04:00:00,2016-08-27
2,A002,R051,02-00-00,59 ST,NQR456,5799492,2016-08-27 08:00:00,2016-08-27
3,A002,R051,02-00-00,59 ST,NQR456,5799610,2016-08-27 12:00:00,2016-08-27
4,A002,R051,02-00-00,59 ST,NQR456,5799833,2016-08-27 16:00:00,2016-08-27


In [30]:
time_series_df.dtypes

C/A                  object
UNIT                 object
SCP                  object
STATION              object
LINENAME             object
ENTRIES               int64
DATE_TIME    datetime64[ns]
DATE         datetime64[ns]
dtype: object

**Challenge 3**

How many hours apart are the measurements?  
What is the total number of entries per day for each turnstile?  

Subtract a minute from time points to account for 12am registering for the next day.

In [43]:
time_series_df.loc[:,'DATE_TIME'] = time_series.loc[:,'DATE_TIME'] - pd.Timedelta(minutes=1)

In [32]:
time_series_df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,ENTRIES,DATE_TIME,DATE
0,A002,R051,02-00-00,59 ST,NQR456,5799442,2016-08-26 23:59:00,2016-08-27
1,A002,R051,02-00-00,59 ST,NQR456,5799463,2016-08-27 03:59:00,2016-08-27
2,A002,R051,02-00-00,59 ST,NQR456,5799492,2016-08-27 07:59:00,2016-08-27
3,A002,R051,02-00-00,59 ST,NQR456,5799610,2016-08-27 11:59:00,2016-08-27
4,A002,R051,02-00-00,59 ST,NQR456,5799833,2016-08-27 15:59:00,2016-08-27


Create a new Date column with the YYYY-MM-DD format based on the dates after subtracting 1 minute. 

In [39]:
time_series_df.loc[:,'DATE'] = time_series_df.loc[:,'DATE_TIME'].dt.date

In [58]:
time_series_df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,ENTRIES,DATE_TIME,DATE
0,A002,R051,02-00-00,59 ST,NQR456,5799442,2016-08-26 23:59:00,2016-08-26
1,A002,R051,02-00-00,59 ST,NQR456,5799463,2016-08-27 03:59:00,2016-08-27
2,A002,R051,02-00-00,59 ST,NQR456,5799492,2016-08-27 07:59:00,2016-08-27
3,A002,R051,02-00-00,59 ST,NQR456,5799610,2016-08-27 11:59:00,2016-08-27
4,A002,R051,02-00-00,59 ST,NQR456,5799833,2016-08-27 15:59:00,2016-08-27


Let's see what the time intervals are between measurements. 

In [64]:
group_vals = ['C/A', 'UNIT', 'SCP','STATION']

In [67]:
time_intervals = time_series_df.groupby(by=group_vals)['DATE_TIME'].transform(lambda x: x.diff())

In [68]:
time_intervals.head()

0        NaT
1   04:00:00
2   04:00:00
3   04:00:00
4   04:00:00
Name: DATE_TIME, dtype: timedelta64[ns]

In [70]:
time_intervals.value_counts(normalize=True).head()

04:00:00    0.923117
04:12:00    0.050484
08:00:00    0.000921
04:26:00    0.000670
00:01:20    0.000278
Name: DATE_TIME, dtype: float64

In the test dataset, the vast majority of time intervals are 4-hour time intervals.   

---
Now let's work on the total entries per turnstile per day. 

In [71]:
entry_diff_per_turnstile_interval = time_series_df.groupby(by=group_vals)['ENTRIES'].transform(lambda x: x.diff())

**NEXT STEP: ADD THE ENTRY DATA AND TIME INTERVALS TO THE TIME SERIES DF**

In [72]:
entry_diff_per_turnstile_interval.head()

0      NaN
1     21.0
2     29.0
3    118.0
4    223.0
Name: ENTRIES, dtype: float64

In [74]:
entry_diff_per_turnstile_interval[entry_diff_per_turnstile_interval['entry_per_interval'] < 0].head()

KeyError: 'entry_per_interval'

Remove missing data (dates at the start of each set of turnstile data, where we have no prior date to compare the turnstile count to).

In [None]:
time_series = time_series.dropna(subset=['entry_per_interval'])

In [None]:
grouping = ['C/A', 'UNIT', 'SCP', 'STATION', 'DATE']

In [None]:
daily_entries_per_turnstile = time_series.groupby(by=grouping)['entry_per_interval'].sum().reset_index()

In [None]:
daily_entries_per_turnstile.rename(columns={'entry_per_interval':'DAILY_ENTRIES'}, inplace=True)

In [None]:
daily_entries_per_turnstile.head()

**Challenge 4**