    Topic:        Challenge Set 1  
    Subject:      Explore MTA turnstile data  
    Date:         07/09/2018  
    Name:         Courtney  
    Worked with:  Tim, Brandon

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

**Challenge 1**

* Open up a new Jupyter notebook
* Download a few MTA turnstile data files
* Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION).

First, create a function to retrieve the necessary data from April, May and June 2018.

In [55]:
# Data Source: http://web.mta.info/developers/turnstile.html
def get_data(weeks):
    """
    Retrieve and read data from a website
    ---
    Input: list of ints or strings, or single int/string
    """
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    data = []
    for week in weeks:
        file_url = url.format(week)
        data.append(pd.read_csv(file_url))
    return pd.concat(data)

# Pull data from April, May and June 2018
week_labels = [180630, 180623, 180616, 180609, 180602, 180526, 
               180519, 180512, 180505, 180428, 180421, 180414, 180407]
turnstiles_df = get_data(week_labels)

Alternatively if the file is already saved locally in the same
directory as the Jupyter notebook, uncomment & run the below after 
assigning the local data file name to the variable file.

In [60]:
# data_file = 'data_file.txt'
# turnstiles_df = pd.read_csv(data_file)

In [61]:
# turnstiles_df.head()

**Challenge 2**

In [35]:
test_data = data.iloc[0:10000]
test_data.shape

(10000, 11)

In [36]:
time_series = test_data[['C/A', 'UNIT', 'SCP', 'STATION', 'ENTRIES']]

In [37]:
combined_date = test_data.loc[:,'DATE'] + ' ' + test_data.loc[:, 'TIME']

In [38]:
combined_date = pd.to_datetime(combined_date)

In [39]:
time_series['DATE_TIME'] = combined_date

In [40]:
time_series['DATE'] = pd.to_datetime(test_data.loc[:,'DATE'])

In [41]:
print(time_series.head())

    C/A  UNIT       SCP STATION  ENTRIES           DATE_TIME       DATE
0  A002  R051  02-00-00   59 ST  6667150 2018-06-23 00:00:00 2018-06-23
1  A002  R051  02-00-00   59 ST  6667173 2018-06-23 04:00:00 2018-06-23
2  A002  R051  02-00-00   59 ST  6667189 2018-06-23 08:00:00 2018-06-23
3  A002  R051  02-00-00   59 ST  6667305 2018-06-23 12:00:00 2018-06-23
4  A002  R051  02-00-00   59 ST  6667534 2018-06-23 16:00:00 2018-06-23


In [42]:
time_series.dtypes

C/A                  object
UNIT                 object
SCP                  object
STATION              object
ENTRIES               int64
DATE_TIME    datetime64[ns]
DATE         datetime64[ns]
dtype: object

**Challenge 3**

How many hours apart are the measurements?  
What is the total number of entries per day for each turnstile?  

Subtract a minute from time points to account for 12am registering for the next day.

In [43]:
time_series.loc[:,'DATE_TIME'] = time_series.loc[:,'DATE_TIME'] - pd.Timedelta(minutes=1)

In [44]:
time_series.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,ENTRIES,DATE_TIME,DATE
0,A002,R051,02-00-00,59 ST,6667150,2018-06-22 23:59:00,2018-06-23
1,A002,R051,02-00-00,59 ST,6667173,2018-06-23 03:59:00,2018-06-23
2,A002,R051,02-00-00,59 ST,6667189,2018-06-23 07:59:00,2018-06-23
3,A002,R051,02-00-00,59 ST,6667305,2018-06-23 11:59:00,2018-06-23
4,A002,R051,02-00-00,59 ST,6667534,2018-06-23 15:59:00,2018-06-23


Create a new Date column with the YYYY-MM-DD format based on the dates after subtracting 1 minute. 

In [45]:
time_series.loc[:,'DATE'] = time_series.loc[:,'DATE_TIME'].dt.date

In [46]:
time_series.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,ENTRIES,DATE_TIME,DATE
0,A002,R051,02-00-00,59 ST,6667150,2018-06-22 23:59:00,2018-06-22
1,A002,R051,02-00-00,59 ST,6667173,2018-06-23 03:59:00,2018-06-23
2,A002,R051,02-00-00,59 ST,6667189,2018-06-23 07:59:00,2018-06-23
3,A002,R051,02-00-00,59 ST,6667305,2018-06-23 11:59:00,2018-06-23
4,A002,R051,02-00-00,59 ST,6667534,2018-06-23 15:59:00,2018-06-23


Let's see what the time intervals are between measurements. 

In [47]:
group_vals = ['C/A', 'UNIT', 'SCP','STATION']

In [48]:
turnstile_data = time_series.groupby(by=group_vals)

In [49]:
time_diff = turnstile_data['DATE_TIME'].diff()

In [50]:
time_diff.value_counts(normalize=True)

04:00:00    0.997029
08:00:00    0.001742
00:15:33    0.000615
03:44:27    0.000615
Name: DATE_TIME, dtype: float64

In the test dataset, the vast majority of time intervals are 4-hour time intervals.   

---
Now let's work on the total entries per turnstile per day. 

In [51]:
entry_diff_per_turnstile_interval = time_series.groupby(by=group_vals)['ENTRIES'].diff()

In [52]:
time_series['entry_per_interval'] = entry_diff_per_turnstile_interval

In [30]:
time_series.iloc[0:10]

Unnamed: 0,C/A,UNIT,SCP,STATION,ENTRIES,DATE_TIME,DATE,entry_per_interval
1,A002,R051,02-00-00,59 ST,6667173,2018-06-23 03:59:00,2018-06-23,23.0
2,A002,R051,02-00-00,59 ST,6667189,2018-06-23 07:59:00,2018-06-23,16.0
3,A002,R051,02-00-00,59 ST,6667305,2018-06-23 11:59:00,2018-06-23,116.0
4,A002,R051,02-00-00,59 ST,6667534,2018-06-23 15:59:00,2018-06-23,229.0
5,A002,R051,02-00-00,59 ST,6667819,2018-06-23 19:59:00,2018-06-23,285.0
6,A002,R051,02-00-00,59 ST,6667980,2018-06-23 23:59:00,2018-06-23,161.0
7,A002,R051,02-00-00,59 ST,6667999,2018-06-24 03:59:00,2018-06-24,19.0
8,A002,R051,02-00-00,59 ST,6668012,2018-06-24 07:59:00,2018-06-24,13.0
9,A002,R051,02-00-00,59 ST,6668092,2018-06-24 11:59:00,2018-06-24,80.0
10,A002,R051,02-00-00,59 ST,6668269,2018-06-24 15:59:00,2018-06-24,177.0


In [53]:
time_series[time_series['entry_per_interval'] < 0].head()

Unnamed: 0,C/A,UNIT,SCP,STATION,ENTRIES,DATE_TIME,DATE,entry_per_interval
1422,A011,R080,01-00-00,57 ST-7 AV,885910226,2018-06-23 03:59:00,2018-06-23,-10.0
1423,A011,R080,01-00-00,57 ST-7 AV,885910216,2018-06-23 07:59:00,2018-06-23,-10.0
1424,A011,R080,01-00-00,57 ST-7 AV,885910184,2018-06-23 11:59:00,2018-06-23,-32.0
1425,A011,R080,01-00-00,57 ST-7 AV,885910124,2018-06-23 15:59:00,2018-06-23,-60.0
1426,A011,R080,01-00-00,57 ST-7 AV,885910023,2018-06-23 19:59:00,2018-06-23,-101.0


Remove missing data (dates at the start of each set of turnstile data, where we have no prior date to compare the turnstile count to).

In [24]:
time_series = time_series.dropna(subset=['entry_per_interval'])

In [25]:
grouping = ['C/A', 'UNIT', 'SCP', 'STATION', 'DATE']

In [26]:
daily_entries_per_turnstile = time_series.groupby(by=grouping)['entry_per_interval'].sum().reset_index()

In [27]:
daily_entries_per_turnstile.rename(columns={'entry_per_interval':'DAILY_ENTRIES'}, inplace=True)

In [29]:
daily_entries_per_turnstile.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,DATE,DAILY_ENTRIES
0,A002,R051,02-00-00,59 ST,2018-06-23,830.0
1,A002,R051,02-00-00,59 ST,2018-06-24,606.0
2,A002,R051,02-00-00,59 ST,2018-06-25,1378.0
3,A002,R051,02-00-00,59 ST,2018-06-26,1234.0
4,A002,R051,02-00-00,59 ST,2018-06-27,1481.0


**Challenge 4**