#Subway Data

In [1]:
import csv
import datetime
import os
import pandas as pd
import numpy as np

from sets import Set
from dateutil.parser import parse
import matplotlib.pyplot as plt

import requests
import urllib2

from bs4 import BeautifulSoup as bs
from dateutil.parser import parse

%matplotlib inline
pd.options.display.mpl_style = 'default'

## Single File

### Quick Overview

Firstly, I'm going to load in one file from an earlier that I've manually downloaded from the MTA website at http://web.mta.info/developers/turnstile.html. After I get a sense of one subset of the data and how to clean it, I'll make a script to download all of the data files, clean and combine them.

In [73]:
filename = "data/turnstile/turnstile_110528.txt"
df_turnstile = pd.read_csv(filename)
df_turnstile.head(2)

Unnamed: 0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11.1,04:00:00,...,05-22-11,00:00:00.1,REGULAR.6,003170119,001097792,05-22-11.1,04:00:00.1,REGULAR.7,003170146,001097801
0,A002,R051,02-00-00,05-22-11,08:00:00,REGULAR,3170164,1097820,05-22-11,12:00:00,...,05-23-11,08:00:00,REGULAR,3170746,1098069,05-23-11,12:00:00,REGULAR,3170897,1098378
1,A002,R051,02-00-00,05-23-11,16:00:00,REGULAR,3171194,1098447,05-23-11,20:00:00,...,05-24-11,16:00:00,REGULAR,3172689,1099010,05-24-11,20:00:00,REGULAR,3173590,1099055


In [78]:
df_turnstile.shape

(998, 43)

There are 998 rows and an overwhelming 43 columns in this dataset. This file was written in a way where multiple observations share the same row. As a result the MTA data is notoriously difficult to work with. As adaptable as the pandas module is, it can't infer this kind of error and correct it. So I'll have to manually do it myself.

Looking at the time and date columns, it's clear that each file spans exactly one week at 4 hour intervals. The time diference between each consecutive row is 32 hours.

The MTA website labels the data as follows: 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'. The first 3 columns are identification data. Then elements should be chopped from the original data, 5 elements at a time and written into the new file. Next is a sequence of columns with a timestamp, type of report, entry count, and exit count, which repeats 8 times! This figure lines up with the shape of our dataframe (3 + (5 x 8)) = 43.



### Restructuring the data

In [20]:
def fix_turnstile_data(filepath):
    '''
    Filepath is a location of a MTA Subway turnstile text file.A link to an example
    MTA Subway turnstile text file can be seen at the URL below:
    http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
    
    There are numerous data points included in each row of the text file. 

    This function updates each row in the text file so there is only one entry per row.
    A few examples below:
    A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
    A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
    A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
    
    This file is then written into a new related directory.
    '''
    
    for file in filepath:
        
        # Parse the directory and filename from the input.
        splitted = file.split('/')
        directory, filename = splitted[0:2], splitted[2]

        # Read the file into memory.
        r = csv.reader(open(file, 'rb'))

        # Prepare the output directory.
        newpath = "data/turnstile/updated_data"
        #newpath = 'data/turnstile'
        
        if not os.path.exists(newpath): 
            os.makedirs(newpath)

        # Create the output file in the new directory. Overwrite the file if it exists already(wb).
        w = csv.writer(open("{0}/{1}".format(newpath, filename), 'wb'))

        # Write the header row, taken from the mta website.
        w.writerow(['C/A', 'UNIT', 'SCP', 'DATEn', 'TIMEn', 'DESCn', 'ENTRIESn', 'EXITSn'])

        # Loop through the output from the CSV reader a line at a time.
        for line in r:

            # Parse out the elements, and remove them from the row.
            ca = line.pop(0)
            unit = line.pop(0)
            scp = line.pop(0)

            # While there is still new data, parse it.
            while len(line) >= 5:

                # Take the first 5 elements and remove them.
                block, line = line[:5], line[5:]

                # Output the new row.
                w.writerow([ca, unit, scp] + block)

As it turns out, the MTA have been overhauling their data collection. Their website, API and even the data recording method have changed. The new files are all structured nicely, meaning that the previous work was unneccessary. Oh well, it was good data munging practice at least!

In [2]:
# Load in a new, recent record.
newfile = 'data/turnstile/150502.txt'

#df = pd.read_csv(newfile, sep=r"\s+")
#df = pd.read_csv(newfile, skipinitialspace=True)
df = pd.read_csv(newfile)
df.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,00:00:00,REGULAR,5106770,1729635
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,04:00:00,REGULAR,5106810,1729649


In [3]:
df.columns.values

array(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE',
       'TIME', 'DESC', 'ENTRIES',
       'EXITS                                                               '], dtype=object)

There's trailing white space after EXITS, need to manually deal with it.

In [4]:
df.columns = ['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS']

### Observations not on the Hour

In [5]:
filter_onhour = [((pd.to_datetime(df.TIME[n])).minute != 0) for n in range(len(df))]

In [6]:
df[filter_onhour].head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
1707,A013,R081,01-00-00,49 ST-7 AVE,NQR,BMT,04/29/2015,08:12:40,REGULAR,5270306,30939048
1750,A013,R081,01-03-00,49 ST-7 AVE,NQR,BMT,04/29/2015,08:12:40,REGULAR,2911534,3345354


In [7]:
df.ix[[1706, 1707, 1708]]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
1706,A013,R081,01-00-00,49 ST-7 AVE,NQR,BMT,04/29/2015,08:00:00,REGULAR,5270301,30938948
1707,A013,R081,01-00-00,49 ST-7 AVE,NQR,BMT,04/29/2015,08:12:40,REGULAR,5270306,30939048
1708,A013,R081,01-00-00,49 ST-7 AVE,NQR,BMT,04/29/2015,12:00:00,REGULAR,5270372,30940882


In [8]:
df2 = df.drop(list(df.loc[filter_onhour].index))

In [9]:
df2 = df2.reset_index()

In [10]:
df = df2
del df['index']

In [11]:
df.ix[[1706]]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
1706,A013,R081,01-00-00,49 ST-7 AVE,NQR,BMT,04/29/2015,08:00:00,REGULAR,5270301,30938948


###Entries and Exits

The original data set gives running totals for each turnstile instead of just a number of entries or exits, so to get anything useful out of it, I'll need to do some subtraction. Simply subtract the entry tally for one timestamp from the previous reading.

In [113]:
df[df.ENTRIES == 0].head(5)

NameError: name 'df' is not defined

####Taking One Day

In [198]:
df1 = df[df.DATE == '05/01/2015']

In [114]:
df1.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,00:00:00,REGULAR,5106770,1729635
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,04:00:00,REGULAR,5106810,1729649


From the documentation on [MTA](http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt), here's what the variables mean:



````
======================================
Field Description

C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS


C/A      = Control Area (A002) - Booth
UNIT     = Remote Unit for a station (R051)
SCP      = Subunit Channel Position represents an specific address for a device (02-00-00)
STATION  = Represents the station name the device is located at
LINENAME = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND   
DATE     = Represents the date (MM-DD-YY)
TIME     = Represents the time (hh:mm:ss) for a scheduled audit event
DESc     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
           1. Audits may occur more that 4 hours due to planning, or troubleshooting activities. 
           2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered. 
ENTRIES  = The comulative entry register value for a device
EXIST    = The cumulative exit register value for a device



Example:
The data below shows the entry/exit register values for one turnstile at control area (A002) from 09/27/14 at 00:00 hours to 09/29/14 at 00:00 hours


C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,00:00:00,REGULAR,0004800073,0001629137,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,04:00:00,REGULAR,0004800125,0001629149,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,08:00:00,REGULAR,0004800146,0001629162,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,12:00:00,REGULAR,0004800264,0001629264,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,16:00:00,REGULAR,0004800523,0001629328,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-27-14,20:00:00,REGULAR,0004800924,0001629371,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,00:00:00,REGULAR,0004801104,0001629395,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,04:00:00,REGULAR,0004801149,0001629402,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,08:00:00,REGULAR,0004801168,0001629414,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,12:00:00,REGULAR,0004801304,0001629463,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,16:00:00,REGULAR,0004801463,0001629521,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-28-14,20:00:00,REGULAR,0004801737,0001629555,
A002,R051,02-00-00,LEXINGTON AVE,456NQR,BMT,09-29-14,00:00:00,REGULAR,0004801836,0001629574,
======================================
````

In [50]:
df1.groupby(['C/A', 'UNIT', 'SCP']).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
C/A,UNIT,SCP,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389
A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160
A002,R051,02-03-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,488569,1842807
A002,R051,02-03-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4684600,7368230
A002,R051,02-03-02,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4416403,6080309
A002,R051,02-03-03,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4055563,4949206
A002,R051,02-03-04,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5145422,2905435
A002,R051,02-03-05,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,8372370,1198590
A002,R051,02-03-06,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,6528438,462214
A002,R051,02-05-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,1156,0


It seems like C/A, UNIT and SCP are the 3 identifiers that I need to run through in order to get every turnstile. I want to have hourly entries and exits that are initialized at 0 at the start of the day (varies between 00:00:00 to 03:00:00) and are updated on 4 hour intervals by the difference of the cumulative values.

In [52]:
filter_first = (df1['C/A'] == 'A002') & (df1['UNIT'] == 'R051') & (df1['SCP'] == '02-00-00')

In [53]:
df10 = df1[filter_first]
df10 = df1.reset_index()
df10 = df10.drop('index', axis=1)

In [54]:
df10['ENTRIES_hourly'] = df10['ENTRIES'] - df10['ENTRIES'].shift(1)
df10['ENTRIES_hourly'] = df10['ENTRIES_hourly'].fillna(0)

df10['EXITS_hourly'] = df10['EXITS'] - df10['EXITS'].shift(1)
df10['EXITS_hourly'] = df10['EXITS_hourly'].fillna(0)

In [55]:
df10.head(10)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,ENTRIES_hourly,EXITS_hourly
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160,-378137,-698506
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274,165,75


Ok, this simple solution would work for one turnstile on one date. However, it discovers a new turnstile or a new date, then things begin to break down.

I haven't been able to find an elegant vectorized solution for this problem (perhaps it's worth investigating the "apply" function more) so I've made my own function to do the job.

In [2]:
# Load in a new, recent record.
newfile = 'data/turnstile/150502.txt'

df1 = pd.read_csv(newfile)
df1.columns = ['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS']
df1.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,00:00:00,REGULAR,5106770,1729635
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/25/2015,04:00:00,REGULAR,5106810,1729649


In [3]:
newfile = 'data/turnstile/150509.txt'
df2 = pd.read_csv(newfile)
df2.columns = ['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME', 'DESC', 'ENTRIES', 'EXITS']
df2.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,00:00:00,REGULAR,5117130,1732680
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,04:00:00,REGULAR,5117157,1732685



For each turnstile's first entry on May 1st I need to set its entrties and exits to 0. Then calculate the rest of the entries and exits for the day by taking the difference of consecutive ones. Finally for the remaining days I will need to look up what the count was for the last entry on the previous day.

In [4]:
def cumulative_to_hourly(df):
    """
    
    
    """
    
    
    for i, entry in df.iterrows():
        id_ = entry['C/A'], entry['UNIT'], entry['SCP']
        #print entry
        
        
        #If first entry of May 1st.
        
        starting_time = ['00:00:00', '01:00:00', '02:00:00', '03:00:00']
        
        if (entry['DATE'] == '05/01/2015' and (entry['TIME'] in starting_time)):
            
            df.loc[i, 'HOURLY_ENTRIES'] = 0
            df.loc[i, 'HOURLY_EXITS'] = 0
            
            print "\nFIRST ENTRY."
            print "ID = {} \t DATE = {} \t TIME = {}\n".format(id_, entry['DATE'], entry['TIME'])
            
        
        # Otherwise, look up previous entry for that turnstile.
        else:
            print "ID = {} \t DATE = {} \t TIME = {}".format(id_, entry['DATE'], entry['TIME'])
            
            # Get all entries for that turnstile
            previous_entries = df[(df['C/A']==entry['C/A']) & (df['UNIT']==entry['UNIT']) & (df['SCP']==entry['SCP']) & (df.index < i)]
            j = max(previous_entries.index)
            #print previous_entries
            
            df.loc[i, 'HOURLY_ENTRIES'] = df.loc[i, 'ENTRIES'] - df.loc[j, 'ENTRIES']
            df.loc[i, 'HOURLY_EXITS'] = df.loc[i, 'EXITS'] - df.loc[j, 'EXITS']
            
            print "i = {} \t j = {} \t  previous entries = {}".format(i, j, previous_entries.index)
            print "CURRENT ENTRIES - OLD ENTRIES = {} - {}  = ".format(df.loc[i, 'ENTRIES'], df.loc[j, 'ENTRIES'], df.loc[i, 'ENTRIES'] - df.loc[j, 'ENTRIES'])
            print "CURRENT EXITS - OLD EXITS = {} - {}  = \n".format(df.loc[i, 'EXITS'], df.loc[j, 'EXITS'], df.loc[i, 'EXITS'] - df.loc[j, 'EXITS'])
        

Testing on one date, 10 entries.

In [8]:
df_testing= df1[df1['DATE'] == '05/01/2015'][:10]
df_testing = df_testing.reset_index()
df_testing = df_testing.drop('index', axis=1)
df_testing

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274


In [9]:
cumulative_to_hourly(df_testing)
df_testing


FIRST ENTRY.
ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 00:00:00

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 04:00:00
i = 1 	 j = 0 	  previous entries = Int64Index([0], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115480 - 5115461  = 
CURRENT EXITS - OLD EXITS = 1732394 - 1732389  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 08:00:00
i = 2 	 j = 1 	  previous entries = Int64Index([0, 1], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115524 - 5115480  = 
CURRENT EXITS - OLD EXITS = 1732482 - 1732394  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 12:00:00
i = 3 	 j = 2 	  previous entries = Int64Index([0, 1, 2], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115678 - 5115524  = 
CURRENT EXITS - OLD EXITS = 1732624 - 1732482  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 16:00:00
i = 4 	 j = 3 	  previous entries = Int64Index([0, 1, 2, 3], dtype='int64')
CURRENT ENTRIES - OLD EN

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274,165,75


Testing on one date, 1000 entries.

In [12]:
df123 = df1[df1.DATE == '05/01/2015']
df123 = df123.reset_index()
df123 = df123.drop('index', axis=1)
df123 = df123[:1100]

In [13]:
df123

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274


In [14]:
cumulative_to_hourly(df123)


FIRST ENTRY.
ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 00:00:00

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 04:00:00
i = 1 	 j = 0 	  previous entries = Int64Index([0], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115480 - 5115461  = 
CURRENT EXITS - OLD EXITS = 1732394 - 1732389  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 08:00:00
i = 2 	 j = 1 	  previous entries = Int64Index([0, 1], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115524 - 5115480  = 
CURRENT EXITS - OLD EXITS = 1732482 - 1732394  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 12:00:00
i = 3 	 j = 2 	  previous entries = Int64Index([0, 1, 2], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115678 - 5115524  = 
CURRENT EXITS - OLD EXITS = 1732624 - 1732482  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 16:00:00
i = 4 	 j = 3 	  previous entries = Int64Index([0, 1, 2, 3], dtype='int64')
CURRENT ENTRIES - OLD EN

In [49]:
df123

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274,165,75


In [93]:
df123.tail()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS
1095,A047,R087,00-00-01,CITY HALL,R,BMT,05/01/2015,12:00:00,REGULAR,2148803,1055267,53,76
1096,A047,R087,00-00-01,CITY HALL,R,BMT,05/01/2015,16:00:00,REGULAR,2148902,1055309,99,42
1097,A047,R087,00-00-01,CITY HALL,R,BMT,05/01/2015,20:00:00,REGULAR,2149145,1055367,243,58
1098,A047,R087,00-00-02,CITY HALL,R,BMT,05/01/2015,00:00:00,REGULAR,6332543,4628820,0,0
1099,A047,R087,00-00-02,CITY HALL,R,BMT,05/01/2015,04:00:00,REGULAR,6332564,4628822,21,2


In [16]:
df123[df123['HOURLY_ENTRIES'] == np.isnan]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS


In [17]:
testing = df123[pd.isnull(df123).any(1)]
testing

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS


========================================= 
That worked. Now time to do it with multiple dates. So two turnstiles on the first and the 2nd each.

In [5]:
dfa = df1[(df1['C/A']=='A002') & (df1['UNIT']=='R051') & ((df1['SCP']=='02-00-00') | (df1['SCP']=='02-00-01')) & (df1['DATE'] == '05/01/2015')] 
dfb = df2[(df2['C/A']=='A002') & (df2['UNIT']=='R051') & ((df2['SCP']=='02-00-00') | (df2['SCP']=='02-00-01'))]

df12ab = pd.concat([dfa, dfb])
df12ab = df12ab.reset_index()
df12ab = df12ab.drop('index', axis=1)
df12ab.DATE

0     05/01/2015
1     05/01/2015
2     05/01/2015
3     05/01/2015
4     05/01/2015
5     05/01/2015
6     05/01/2015
7     05/01/2015
8     05/01/2015
9     05/01/2015
10    05/01/2015
11    05/01/2015
12    05/02/2015
13    05/02/2015
14    05/02/2015
...
81    05/06/2015
82    05/06/2015
83    05/06/2015
84    05/07/2015
85    05/07/2015
86    05/07/2015
87    05/07/2015
88    05/07/2015
89    05/07/2015
90    05/08/2015
91    05/08/2015
92    05/08/2015
93    05/08/2015
94    05/08/2015
95    05/08/2015
Name: DATE, Length: 96, dtype: object

In [6]:
cumulative_to_hourly(df12ab)


FIRST ENTRY.
ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 00:00:00

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 04:00:00
i = 1 	 j = 0 	  previous entries = Int64Index([0], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115480 - 5115461  = 
CURRENT EXITS - OLD EXITS = 1732394 - 1732389  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 08:00:00
i = 2 	 j = 1 	  previous entries = Int64Index([0, 1], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115524 - 5115480  = 
CURRENT EXITS - OLD EXITS = 1732482 - 1732394  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 12:00:00
i = 3 	 j = 2 	  previous entries = Int64Index([0, 1, 2], dtype='int64')
CURRENT ENTRIES - OLD ENTRIES = 5115678 - 5115524  = 
CURRENT EXITS - OLD EXITS = 1732624 - 1732482  = 

ID = ('A002', 'R051', '02-00-00') 	 DATE = 05/01/2015 	 TIME = 16:00:00
i = 4 	 j = 3 	  previous entries = Int64Index([0, 1, 2, 3], dtype='int64')
CURRENT ENTRIES - OLD EN

In [7]:
df12ab

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,5115461,1732389,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,5115480,1732394,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,5115524,1732482,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,5115678,1732624,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,16:00:00,REGULAR,5115998,1732647,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/01/2015,20:00:00,REGULAR,5116883,1732666,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,00:00:00,REGULAR,4738746,1034160,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,04:00:00,REGULAR,4738761,1034161,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,08:00:00,REGULAR,4738804,1034199,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/01/2015,12:00:00,REGULAR,4738969,1034274,165,75


In [8]:
df12ab.tail(10)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS
86,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/07/2015,08:00:00,REGULAR,4746117,1035600,56,59
87,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/07/2015,12:00:00,REGULAR,4746319,1035749,202,149
88,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/07/2015,16:00:00,REGULAR,4746593,1035806,274,57
89,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/07/2015,20:00:00,REGULAR,4747382,1035858,789,52
90,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,00:00:00,REGULAR,4747624,1035872,242,14
91,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,04:00:00,REGULAR,4747638,1035876,14,4
92,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,08:00:00,REGULAR,4747682,1035915,44,39
93,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,12:00:00,REGULAR,4747852,1036048,170,133
94,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,16:00:00,REGULAR,4748143,1036101,291,53
95,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,05/08/2015,20:00:00,REGULAR,4748938,1036158,795,57


In [9]:
df12ab[pd.isnull(df12ab).any(1)]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,HOURLY_ENTRIES,HOURLY_EXITS


Great, the first week seems to have worked fine. It should work fine for the entire month.

### Different Types

There are different types of "DESC" according to the documentation.

===========================
````
Descn Possible Values (Events):
REGULAR - Regular scheduled audit event
NO-VAL LGN - Not Valid logon
LGF-MAN - Logoff Manual
LGF-DR CLS - Logoff Door Closed
LGF-SHUTDN - Logoff Shutdown
TS BRD CHG - Turnstile Board Change
TS VLT OPN - Turnstile Vault Open
RECOVR AUD - Recovery audit - if REGULAR was not delivered due to communications problems
````
===========================


In [216]:
df.DESC.unique()

array(['REGULAR', 'RECOVR AUD'], dtype=object)

In [224]:
df[df.DESC == 'RECOVR AUD'].head(5)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
1113,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,143112,85613
1155,A010,R080,00-00-01,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,12250490,4537590
1197,A010,R080,00-00-02,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,8936510,3456835
1239,A010,R080,00-00-03,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,1865400,918583
1281,A010,R080,00-00-04,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,2550498,1228635


In [222]:
df.iloc[[1110, 1112, 1113, 1114, 1115]]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
1110,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,00:00:00,REGULAR,142599,84827
1112,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,08:00:00,REGULAR,142744,85024
1113,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,12:00:00,RECOVR AUD,143112,85613
1114,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,16:00:00,REGULAR,143638,85933
1115,A010,R080,00-00-00,57 ST-7 AVE,NQR,BMT,04/28/2015,20:00:00,REGULAR,144687,86304


In this dataset there are only two different types, REGULAR and RECOVR AUD. Fortunately RECOVR AUD does not require any further thought as it gives the same reading as REGULAR, but as a backup communication.

## Scraping Files

I'm choosing the last complete month to do analysis on which is May 2015. The MTA collect data weekly, so I'm also going to take the last week of April, and the first month of June.

The files are available for download here: http://web.mta.info/developers/turnstile.html

I'll be using the python module Beautiful Soup to parse the html in order to find the relevant hyperlinks quickly.

There was an odd error with beautiful soup - some byte code on the page couldn't be converted to ASCII characters. After a quick google, I found the following fix on stackoverflow.

In [132]:
# Work around for byte-ASCII error in bs.
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

In [133]:
# URL of MTA data.
URL = 'http://web.mta.info/developers/turnstile.html'

# Parse the html using bs to find the hyperlinks.
r = requests.get(URL)
soup = bs(r.text)
hyperlinks = soup.findAll('a')

# Initialize array to hold all of the URLs on the page.
urls = []

# Loop through the hyperlinks, parsing just the links themselves.
for each in hyperlinks:
    link = each.get('href')
    
    # Only add links which are in the data directory, and are text files.
    if link and link.endswith('.txt') and link.startswith('data/'):
        urls.append(link)

In [134]:
# Initialize array to hold all of the dates from the URLs collected.
dates = []

# Split the URLs up in order to get just the dates.
for each in urls:
    filename = each.split('/')[-1]
    filename.find('turnstile_')
    index = filename.find('turnstile_')
    date = filename[index+len('turnstile_'):].split('.')[0]
    dates.append(date)

In [135]:
# Initialize set to store the dates to download.
dates_to_download = Set([])

# Loop through the dates, and take the appropriate ones.
for i, date in enumerate(dates):
    # Only 2015.
    if date[:2] == '15':
        # Take all of May, the last week of April, and the first week of June.
        if date[2:4] == '05':
            dates_to_download.update([dates[i-1], dates[i], dates[i+1]])

# Convert the set to an array.            
dates_to_download = list(dates_to_download)

In [136]:
dates_to_download = sorted(dates_to_download)
#dates_to_download = ['150425', '150502', '150509', '150516', '150523', '150530', '150606']
dates_to_download

['150425', '150502', '150509', '150516', '150523', '150530', '150606']

In [137]:
l = len(dates_to_download)

# Loop through the dates, downloading the corresponding file.
for i in range(l):
    date = dates_to_download[i]
    download = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{0}.txt'.format(date)

## Combining Files

Once all the neccessary subway data has been downloaded, it's time to combine it in some fashion. One possbility is to load all the files into invididual dataframes using pandas, and then merging them and writing. Instead I decided to open the files one by one and write the contents into a new file, as this requires less memory usage.

In [160]:
def combine_turnstile_data(filenames):
    """
    Takes the turnstile filenames and writes them one by one into a new
    file.
    """
    
    # Open a new master file, and write in the header row.
    with open('updated_data/master_file.txt', 'w') as master_file:
        master_file.write('C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESCn,ENTRIES,EXITS\n')
    
        # Open all the files and write thme into master file.
        for filename in filenames:
            with open('data/turnstile/{0}.txt'.format(filename),'rb') as f:
                for row in f:
                    # Ignore the the header row.
                    if row.startswith('C/A'):
                        continue
                    master_file.write(row)

In [161]:
combine_turnstile_data(dates_to_download)

In [11]:
df_turnstile_master = pd.read_csv('updated_data/master_file.txt')

In [12]:
df_turnstile_master.head(1)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESCn,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,00:00:00,REGULAR,5095940,1725998


##Preparing the Data

### Dates

It's very useful to have a datetime object within the DF in order to make calculations. I'll make a new columns named "DATE-TIME". This function automatically assumes the timestamp is 

In [13]:
# Convert to datetime.
df_turnstile_master['DATE-TIME'] = pd.to_datetime(df_turnstile_master['DATE'] + " " + df_turnstile_master['TIME'])

#df_turnstile_master = df_turnstile_master.drop(['DATE', 'TIME'], axis=1)

In [14]:
df_turnstile_master = df_turnstile_master.drop(['DATE', 'TIME'], axis=1)

In [15]:
df_turnstile_master.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5095940,1725998,2015-04-18 00:00:00
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5095981,1726007,2015-04-18 04:00:00
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5096001,1726039,2015-04-18 08:00:00
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5096137,1726130,2015-04-18 12:00:00
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5096419,1726208,2015-04-18 16:00:00


In [16]:
df_turnstile_master['DATE-TIME'].iloc[0].date()

datetime.date(2015, 4, 18)

In [17]:
df_turnstile_master['DATE-TIME'].iloc[0].time()

datetime.time(0, 0)

### Removing Non-May Dates

In [18]:
# Checking if the dates are in order.
pd.Series(df_turnstile_master['DATE-TIME'].ravel()).unique()

array(['2015-04-18T01:00:00.000000000+0100',
       '2015-04-18T05:00:00.000000000+0100',
       '2015-04-18T09:00:00.000000000+0100', ...,
       '2015-06-01T08:37:09.000000000+0100',
       '2015-06-05T08:45:28.000000000+0100',
       '2015-06-01T08:37:13.000000000+0100'], dtype='datetime64[ns]')

Now I've got to remove the April and June dates from the data. 

In [19]:
df_turnstile_master['DATE-TIME'].iloc[0]

Timestamp('2015-04-18 00:00:00')

In [20]:
before = df_turnstile_master.shape[0]

In [21]:
# Convert the date_n col to a datetime and extract the month.
month_filter =  pd.DatetimeIndex(df_turnstile_master['DATE-TIME']).month

In [22]:
month_filter

array([4, 4, 4, ..., 6, 6, 6])

In [23]:
# Convert to a Series in order to do filter operation on the df.
month_filter = pd.Series(month_filter)

In [24]:
# Subset the df, based on the month being May.
df_turnstile_master = df_turnstile_master[month_filter == 5]

In [25]:
after = df_turnstile_master.shape[0]

In [26]:
before

1343584

In [27]:
after

850343

In [29]:
df_turnstile_master.head(1)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME
191038,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01


In [32]:
starting_time = [datetime.time(0, 0), datetime.time(1, 0), datetime.time(2, 0), datetime.time(2, 0)]

print (df_turnstile_master['DATE-TIME'].iloc[0].date() == datetime.date(2015, 5, 1))
print (df_turnstile_master['DATE-TIME'].iloc[0].time() in starting_time)

True
True


In [35]:
df_turnstile_master = df_turnstile_master.reset_index()
df_turnstile_master = df_turnstile_master.drop(['index'], axis=1)

In [None]:
df_turnstile_master = df_turnstile_master.drop(['level_0'], axis=1)

In [39]:
df_turnstile_master.head(1)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01


In [2]:
def cumulative_to_hourly(df):
    """
    
    
    """
    
    
    for i, entry in df.iterrows():
        id_ = entry['C/A'], entry['UNIT'], entry['SCP']
        #print entry
        
        
        # Check the DF for entries for this turnstile.

        previous_entries = df[(df['C/A']==entry['C/A']) & (df['UNIT']==entry['UNIT']) & 
                              (df['SCP']==entry['SCP']) & (df.index < i)]
        
        
        
        
        #print previous_entries
        
        # If this is the first entry then initialize to zero.
        if len(previous_entries) == 0:
        # if (entry['DATE-TIME'].date() == datetime.date(2015, 5, 1) and (entry['DATE-TIME'].time() in starting_time)):
            
            df.loc[i, 'HOURLY_ENTRIES'] = 0
            df.loc[i, 'HOURLY_EXITS'] = 0
            
            #starting_time = [datetime.time(0, 0), datetime.time(1, 0), datetime.time(2, 0), datetime.time(3, 0)]
            #print "\nFIRST ENTRY."
            #print "ID = {} \t DATE = {}\n".format(id_, entry['DATE-TIME'])

     
        # Otherwise, look up previous entry for that turnstile.
        else:
            #print "ID = {} \t DATE = {}".format(id_, entry['DATE-TIME'])
            
            # Get all the PREVIOUS entries for that turnstile.
            j = max(previous_entries.index)
            
            df.loc[i, 'HOURLY_ENTRIES'] = df.loc[i, 'ENTRIES'] - df.loc[j, 'ENTRIES']
            df.loc[i, 'HOURLY_EXITS'] = df.loc[i, 'EXITS'] - df.loc[j, 'EXITS']
            
            #print "i = {} \t j = {} \t  previous entries = {}".format(i, j, previous_entries.index)
            #print "CURRENT ENTRIES - OLD ENTRIES = {} - {}  = ".format(df.loc[i, 'ENTRIES'], df.loc[j, 'ENTRIES'], df.loc[i, 'ENTRIES'] - df.loc[j, 'ENTRIES'])
            #print "CURRENT EXITS - OLD EXITS = {} - {}  = \n".format(df.loc[i, 'EXITS'], df.loc[j, 'EXITS'], df.loc[i, 'EXITS'] - df.loc[j, 'EXITS'])
        

In [109]:
df_t100 = df_turnstile_master[:100]
df_t100.head(5)


Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23


In [110]:
cumulative_to_hourly(df_t100)

In [111]:
df_t100

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75


In [112]:
df_t1000 = df_turnstile_master[:1000]
cumulative_to_hourly(df_t1000)
# Worked Fine.

In [113]:
df_t1000

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75


In [116]:
df_t2000 = df_turnstile_master[:2000]
cumulative_to_hourly(df_t2000)

In [117]:
df_t2000

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75


In [121]:
df_t2000[pd.isnull(df_t2000).any(1)]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS


In [119]:
def check_turnstile(df, CA, UNIT, SCP):
    
    return df[(df['C/A']==CA) & (df['UNIT']==UNIT) & (df['SCP']==SCP)]

In [120]:
a = check_turnstile(df_t2000, 'A077', 'R028', '03-00-00')
a

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
1632,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1307486,377050,2015-05-01 03:00:00,0,0
1633,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1307500,377064,2015-05-01 07:00:00,14,14
1634,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1307593,377176,2015-05-01 11:00:00,93,112
1635,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1307734,377208,2015-05-01 15:00:00,141,32
1636,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1308360,377260,2015-05-01 19:00:00,626,52
1637,A077,R028,03-00-00,FULTON ST,ACJZ2345,BMT,REGULAR,1308519,377279,2015-05-01 23:00:00,159,19


In [153]:
df_turnstile_master[pd.isnull(df_turnstile_master).any(1)] # 158739 - 165325 - 200000 - 200269 - 210000

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS


850343

Did this on 22/07 after realizing only the date was exported... have to redo the cumulative to hourly function.

"master_file" still has March dates etc need to redo that.

In [5]:
df_turnstile_master = pd.read_csv('updated_data/master_file.txt')

In [6]:
df_turnstile_master.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESCn,ENTRIES,EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,00:00:00,REGULAR,5095940,1725998
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,04:00:00,REGULAR,5095981,1726007
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,08:00:00,REGULAR,5096001,1726039
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,12:00:00,REGULAR,5096137,1726130
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04/18/2015,16:00:00,REGULAR,5096419,1726208


In [7]:
l = len(df_turnstile_master)
l

1343584

In [8]:
base, step = 0, 1000
for i in range(base, l, step):
    print "{}\t {} \t STARTED AT {}".format(base - step, i, datetime.datetime.now().time())
    cumulative_to_hourly(df_turnstile_master[base - step:i])
    print "DONE AT {}".format(datetime.datetime.now().time())
    base += step
    

-1000	 0 	 STARTED AT 19:35:51.106000
DONE AT 19:35:51.107000
0	 1000 	 STARTED AT 19:35:51.107000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


KeyboardInterrupt: 

In [53]:
df_turnstile_master.loc[27465]['DATE-TIME'].date() 

datetime.date(2015, 5, 1)

In [154]:
df_turnstile_master

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75


###Adding Weekday

In [157]:
df_turnstile_master['WEEKDAY'] = df_turnstile_master['DATE-TIME'].apply(lambda x: x.weekday())

In [158]:
df_turnstile_master

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS,WEEKDAY
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0,4
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5,4
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88,4
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142,4
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23,4
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19,4
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0,4
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1,4
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38,4
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75,4


### Removing Non-Hour Entries

In [174]:
df_turnstile_master.shape

(850343, 13)

In [179]:
filter_onhour = [((pd.to_datetime(df_turnstile_master['DATE-TIME'][n])).minute != 0) for n in range(len(df_turnstile_master))]
df_turnstile_master2 = df_turnstile_master.drop(list(df_turnstile_master.loc[filter_onhour].index))

In [180]:
df_turnstile_master2.shape

(780713, 13)

##Exporting Final File

In [181]:
df_turnstile_master2.to_csv('data/final/turnstile_final.csv', index=False, date_format='%Y-%m-%d')

In [182]:
df_turnstile_master2.head(1)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS,WEEKDAY
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01,0,0,4


Testing if all on hour

In [164]:
d1000 = df_turnstile_master[:5000]

In [165]:
filter_onhour = [((pd.to_datetime(d1000['DATE-TIME'][n])).minute != 0) for n in range(len(d1000))]
filter_onhour

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,


In [166]:
d1000[filter_onhour]

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS,WEEKDAY
3045,C016,R278,00-00-00,25 ST,R,BMT,REGULAR,1176050,1060328,2015-05-01 10:01:09,168,83,4
3052,C016,R278,00-00-01,25 ST,R,BMT,REGULAR,350318,88577,2015-05-01 10:01:09,230,64,4
3059,C016,R278,00-00-02,25 ST,R,BMT,REGULAR,8790739,2489748,2015-05-01 10:01:09,449,125,4
4299,H008,R248,01-00-00,1 AVE,L,BMT,REGULAR,528598,7653778,2015-05-01 07:40:54,5,280,4
4306,H008,R248,01-00-01,1 AVE,L,BMT,REGULAR,1702825,9762109,2015-05-01 07:40:54,16,322,4
4313,H008,R248,01-00-02,1 AVE,L,BMT,REGULAR,7053093,17351657,2015-05-01 07:40:54,36,279,4
4320,H008,R248,01-00-03,1 AVE,L,BMT,REGULAR,13423202,11072730,2015-05-01 07:40:54,92,239,4
4327,H008,R248,01-00-04,1 AVE,L,BMT,REGULAR,3224700,353131,2015-05-01 07:40:54,200,89,4
4994,JFK03,R536,00-00-01,JFK JAMAICA CT1,E,IND,RECOVR AUD,5930,6496,2015-05-01 06:40:45,9,6,4
4995,JFK03,R536,00-00-01,JFK JAMAICA CT1,E,IND,RECOVR AUD,5942,6506,2015-05-01 07:30:25,12,10,4


In [167]:
d1000 = d1000.drop(list(d1000.loc[filter_onhour].index))

In [168]:
d1000

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS,WEEKDAY
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115461,1732389,2015-05-01 00:00:00,0,0,4
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115480,1732394,2015-05-01 04:00:00,19,5,4
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115524,1732482,2015-05-01 08:00:00,44,88,4
3,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115678,1732624,2015-05-01 12:00:00,154,142,4
4,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5115998,1732647,2015-05-01 16:00:00,320,23,4
5,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,REGULAR,5116883,1732666,2015-05-01 20:00:00,885,19,4
6,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738746,1034160,2015-05-01 00:00:00,0,0,4
7,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738761,1034161,2015-05-01 04:00:00,15,1,4
8,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738804,1034199,2015-05-01 08:00:00,43,38,4
9,A002,R051,02-00-01,LEXINGTON AVE,NQR456,BMT,REGULAR,4738969,1034274,2015-05-01 12:00:00,165,75,4


In [173]:
df_turnstile_master.shape

(850343, 13)

In [188]:
a = df_turnstile_master[df_turnstile_master['HOURLY_ENTRIES'] < 0]
a

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESCn,ENTRIES,EXITS,DATE-TIME,HOURLY_ENTRIES,HOURLY_EXITS,WEEKDAY
205,A011,R080,01-00-00,57 ST-7 AVE,NQR,BMT,REGULAR,888974896,494721163,2015-05-01 04:00:00,-97,-46,4
206,A011,R080,01-00-00,57 ST-7 AVE,NQR,BMT,REGULAR,888974791,494720674,2015-05-01 08:00:00,-105,-489,4
207,A011,R080,01-00-00,57 ST-7 AVE,NQR,BMT,REGULAR,888974326,494719381,2015-05-01 12:00:00,-465,-1293,4
208,A011,R080,01-00-00,57 ST-7 AVE,NQR,BMT,REGULAR,888973687,494718932,2015-05-01 16:00:00,-639,-449,4
209,A011,R080,01-00-00,57 ST-7 AVE,NQR,BMT,REGULAR,888972389,494718437,2015-05-01 20:00:00,-1298,-495,4
229,A011,R080,01-00-04,57 ST-7 AVE,NQR,BMT,REGULAR,1895700979,1778236012,2015-05-01 04:00:00,-74,-11,4
230,A011,R080,01-00-04,57 ST-7 AVE,NQR,BMT,REGULAR,1895700936,1778235799,2015-05-01 08:00:00,-43,-213,4
231,A011,R080,01-00-04,57 ST-7 AVE,NQR,BMT,REGULAR,1895700701,1778234886,2015-05-01 12:00:00,-235,-913,4
232,A011,R080,01-00-04,57 ST-7 AVE,NQR,BMT,REGULAR,1895700337,1778234423,2015-05-01 16:00:00,-364,-463,4
233,A011,R080,01-00-04,57 ST-7 AVE,NQR,BMT,REGULAR,1895699392,1778233881,2015-05-01 20:00:00,-945,-542,4


In [192]:
df2 = df_turnstile_master[df_turnstile_master['HOURLY_ENTRIES'] >= 0]

In [194]:
df2.shape

(845693, 13)

### Using Both Datasets

I thought about combining both dataframes into a master dataset that would contain both subway and weather information. However after I implemented it, the whole thing proved to be redundant. The weather data contains one observation for the entire day, while the subway data contains data for every 4 hours. Combining them would mean putting a lot of repeated data into each hour for each day. This only offers some ease of access by having everything in one file, but it bloats the data. So instead I'm choosing to keep them separate, and keeping the weather dataframe as a sort of lookup table to find dates corresponding to events (such as fog) and then getting the subway data for these dates.

But first I'll index them both by date to align them and make it easier for lookups.