#Subway Data

## Links

http://experimenting.alastair.is/citibike/ 

https://www.udacity.com/course/viewer#!/c-ud359/l-732399471/e-698029633/m-698029634 

http://chriswhong.com/open-data/visualizing-the-mtas-turnstile-data/ 

https://github.com/chriswhong/nycturnstiles 

http://web.mta.info/developers/index.html 

https://saxenarajat99.wordpress.com/2014/09/21/impact-of-rain-on-nyc-subway-ridership-udacity-course-project/ 

http://spotofdata.com/subway-weather-udacity/ 

http://web.mta.info/developers/turnstile.html

http://www.jasondamiani.com/portfolio/analyzing-mta-subway-data/ Good analysis



In [None]:
import csv
import datetime
import os
import pandas as pd

from sets import Set

import requests
import urllib2

from bs4 import BeautifulSoup as bs

## Single File

### Quick Overview

Firstly, I'm going to load in one file from an earlier that I've manually downloaded from the MTA website at http://web.mta.info/developers/turnstile.html. After I get a sense of one subset of the data and how to clean it, I'll make a script to download all of the data files, clean and combine them.

In [15]:
filename = "data/turnstile/turnstile_110528.txt"
df_turnstile = pd.read_csv(filename)
df_turnstile.head(2)

Unnamed: 0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11.1,04:00:00,...,05-22-11,00:00:00.1,REGULAR.6,003170119,001097792,05-22-11.1,04:00:00.1,REGULAR.7,003170146,001097801
0,A002,R051,02-00-00,05-22-11,08:00:00,REGULAR,3170164,1097820,05-22-11,12:00:00,...,05-23-11,08:00:00,REGULAR,3170746,1098069,05-23-11,12:00:00,REGULAR,3170897,1098378
1,A002,R051,02-00-00,05-23-11,16:00:00,REGULAR,3171194,1098447,05-23-11,20:00:00,...,05-24-11,16:00:00,REGULAR,3172689,1099010,05-24-11,20:00:00,REGULAR,3173590,1099055


In [16]:
df_turnstile.shape

(998, 43)

There are 998 rows and an overwhelming 43 columns in this dataset. This file was written in a way where multiple observations share the same row. As a result the MTA data is notoriously difficult to work with. As adaptable as the pandas module is, it can't infer this kind of error and correct it. So I'll have to manually do it myself.

Looking at the time and date columns, it's clear that each file spans exactly one week at 4 hour intervals. The time diference between each consecutive row is 32 hours.

The MTA website labels the data as follows: 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'. The first 3 columns are identification data. Then elements should be chopped from the original data, 5 elements at a time and written into the new file. Next is a sequence of columns with a timestamp, type of report, entry count, and exit count, which repeats 8 times! This figure lines up with the shape of our dataframe (3 + (5 x 8)) = 43.



### Restructuring the data

In [20]:
def fix_turnstile_data(filepath):
    '''
    Filepath is a location of a MTA Subway turnstile text file.A link to an example
    MTA Subway turnstile text file can be seen at the URL below:
    http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
    
    There are numerous data points included in each row of the text file. 

    This function updates each row in the text file so there is only one entry per row.
    A few examples below:
    A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
    A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
    A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
    
    This file is then written into a new related directory.
    '''
    
    for file in filepath:
        
        # Parse the directory and filename from the input.
        splitted = file.split('/')
        directory, filename = splitted[0], splitted[1]

        # Read the file into memory.
        r = csv.reader(open(file, 'rb'))

        # Prepare the output directory.
        newpath = "data/turnstile/updated_{0}".format(directory)
        #newpath = 'data/turnstile'
        
        if not os.path.exists(newpath): 
            os.makedirs(newpath)

        # Create the output file in the new directory. Overwrite the file if it exists already(wb).
        w = csv.writer(open("{0}/{1}.txt".format(newpath, filename), 'wb'))

        # Write the header row, taken from the mta website.
        w.writerow(['C/A', 'UNIT', 'SCP', 'DATEn', 'TIMEn', 'DESCn', 'ENTRIESn', 'EXITSn'])

        # Loop through the output from the CSV reader a line at a time.
        for line in r:

            # Parse out the elements, and remove them from the row.
            ca = line.pop(0)
            unit = line.pop(0)
            scp = line.pop(0)

            # While there is still new data, parse it.
            while len(line) >= 5:

                # Take the first 5 elements and remove them.
                block, line = line[:5], line[5:]

                # Output the new row.
                w.writerow([ca, unit, scp] + block)

In [21]:
filename = ['data/turnstile/turnstile_110528.txt']
fix_turnstile_data(filename)
updated_filename = 'data/turnstile/turnstile_110528.txt'
df_turnstile = pd.read_csv(updated_filename)
df_turnstile.head(2)

Unnamed: 0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11.1,04:00:00,...,05-22-11,00:00:00.1,REGULAR.6,003170119,001097792,05-22-11.1,04:00:00.1,REGULAR.7,003170146,001097801
0,A002,R051,02-00-00,05-22-11,08:00:00,REGULAR,3170164,1097820,05-22-11,12:00:00,...,05-23-11,08:00:00,REGULAR,3170746,1098069,05-23-11,12:00:00,REGULAR,3170897,1098378
1,A002,R051,02-00-00,05-23-11,16:00:00,REGULAR,3171194,1098447,05-23-11,20:00:00,...,05-24-11,16:00:00,REGULAR,3172689,1099010,05-24-11,20:00:00,REGULAR,3173590,1099055


In [96]:
df_turnstile.shape

(191938, 8)

As it turns out, the MTA have been overhauling their data collection. Their website, API and even the data recording method have changed. The new files are all structured nicely, meaning that the previous work was unneccessary. Oh well, it was good data munging practice at least!

## Scraping Files

I'm choosing the last complete month to do analysis on which is May 2015. The MTA collect data weekly, so I'm also going to take the last week of April, and the first month of June.

The files are available for download here: http://web.mta.info/developers/turnstile.html

I'll be using the python module Beautiful Soup to parse the html in order to find the relevant hyperlinks quickly.

There was an odd error with beautiful soup - some byte code on the page couldn't be converted to ASCII characters. After a quick google, I found the following fix on stackoverflow.

In [22]:
# Work around for byte-ASCII error in bs.
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

In [166]:
# URL of MTA data.
URL = 'http://web.mta.info/developers/turnstile.html'

# Parse the html using bs to find the hyperlinks.
r = requests.get(URL)
soup = bs(r.text)
hyperlinks = soup.findAll('a')

# Initialize array to hold all of the URLs on the page.
urls = []

# Loop through the hyperlinks, parsing just the links themselves.
for each in hyperlinks:
    link = each.get('href')
    
    # Only add links which are in the data directory, and are text files.
    if link and link.endswith('.txt') and link.startswith('data/'):
        urls.append(link)

In [168]:
# Initialize array to hold all of the dates from the URLs collected.
dates = []

# Split the URLs up in order to get just the dates.
for each in urls:
    filename = each.split('/')[-1]
    filename.find('turnstile_')
    index = filename.find('turnstile_')
    date = filename[index+len('turnstile_'):].split('.')[0]
    dates.append(date)

In [None]:
# Initialize set to store the dates to download.
dates_to_download = Set([])

# Loop through the dates, and take the appropriate ones.
for i, date in enumerate(dates):
    # Only 2015.
    if date[:2] == '15':
        # Take all of May, the last week of April, and the first week of June.
        if date[2:4] == '05':
            dates_to_download.update([dates[i-1], dates[i], dates[i+1]])

# Convert the set to an array.            
dates_to_download = list(dates_to_download)

In [161]:
dates_to_download

['150509', '150530', '150502', '150606', '150516', '150523', '150425']

In [None]:
l = len(dates_to_download)

# Loop through the dates, downloading the corresponding file.
for i in range(l):
    date = dates_to_download[i]
    download = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{0}.txt'.format(date)

## Combining Files

Once all the neccessary subway data has been downloaded, it's time to combine it in some fashion. One possbility is to load all the files into invididual dataframes using pandas, and then merging them and writing. Instead I decided to open the files one by one and write the contents into a new file, as this requires less memory usage.

In [159]:
def combine_turnstile_data(filenames):
    """
    Takes the turnstile filenames and writes them one by one into a new
    file.
    """
    
    # Open a new master file, and write in the header row.
    with open('updated_data/master_file.txt', 'w') as master_file:
       master_file.write('C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
    

       for filename in filenames:
            with open('data/{0}.txt'.format(filename),'rb') as f:
                for row in f:
                    # Ignore the the header row.
                    if row.startswith('C/A'):
                        continue
                    master_file.write(row)

In [143]:
filenames = ['150509', '150530', '150502', '150606', '150516', '150523', '150425']

In [170]:
combine_turnstile_data(dates_to_download)

In [8]:
df_turnstile_master = pd.read_csv('updated_data/master_file.txt')

In [9]:
df_turnstile_master.head(3)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,00:00:00,REGULAR,5117130,1732680
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,04:00:00,REGULAR,5117157,1732685
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,08:00:00,REGULAR,5117176,1732693


##Preparing the Data

###Tallying the counts

Additionally the original data set gives running totals for each turnstile instead of just a number of entries or exits, so to get anything useful out of it, I'll need to do some subtraction. Simply subtract the entry tally for one timestamp from the previous reading.

In [10]:
df_turnstile_master['ENTRIESn_hourly'] = df_turnstile_master['ENTRIESn'] - df_turnstile_master['ENTRIESn'].shift(1)
df_turnstile_master['ENTRIESn_hourly'] = df_turnstile_master['ENTRIESn_hourly'].fillna(0)

df_turnstile_master['EXITSn_hourly'] = df_turnstile_master['EXITSn'] - df_turnstile_master['EXITSn'].shift(1)
df_turnstile_master['EXITSn_hourly'] = df_turnstile_master['EXITSn_hourly'].fillna(0)
    
df_turnstile_master.head(3)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,00:00:00,REGULAR,5117130,1732680,0,0
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,04:00:00,REGULAR,5117157,1732685,27,5
2,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,05/02/2015,08:00:00,REGULAR,5117176,1732693,19,8


### Dates

Subway = 2011-05-01
MTA = 05-21-11

In [11]:
df_turnstile_master['date'] = pd.to_datetime(df_turnstile_master['DATEn'])
df_turnstile_master.tail()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly,date
1343579,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,04/24/2015,05:00:00,REGULAR,5554,202,0,0,2015-04-24
1343580,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,04/24/2015,09:00:00,REGULAR,5554,202,0,0,2015-04-24
1343581,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,04/24/2015,13:00:00,REGULAR,5554,202,0,0,2015-04-24
1343582,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,04/24/2015,17:00:00,REGULAR,5554,204,0,2,2015-04-24
1343583,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,04/24/2015,21:00:00,REGULAR,5554,204,0,0,2015-04-24


### Removing redundant columns

In [12]:
df_turnstile_master = df_turnstile_master.drop(["DATEn", "ENTRIESn", "EXITSn"], axis=1)
df_turnstile_master.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,TIMEn,DESCn,ENTRIESn_hourly,EXITSn_hourly,date
0,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,00:00:00,REGULAR,0,0,2015-05-02
1,A002,R051,02-00-00,LEXINGTON AVE,NQR456,BMT,04:00:00,REGULAR,27,5,2015-05-02


In [13]:
df_turnstile_master = df_turnstile_master.set_index('date')

In [14]:
df_turnstile_master.shape

(1343584, 10)

Finally the datais ready; all 1.3 million rows are ready to be analyzed

### Using Both Datasets

I thought about combining both dataframes into a master dataset that would contain both subway and weather information. However after I implemented it, the whole thing proved to be redundant. The weather data contains one observation for the entire day, while the subway data contains data for every 4 hours. Combining them would mean putting a lot of repeated data into each hour for each day. This only offers some ease of access by having everything in one file, but it bloats the data. So instead I'm choosing to keep them separate, and keeping the weather dataframe as a sort of lookup table to find dates corresponding to events (such as fog) and then getting the subway data for these dates.

But first I'll index them both by date to align them and make it easier for lookups.

In [184]:
df_weather = df_weather.set_index('date')
df_turnstile_master = df_turnstile_master.set_index('date')

KeyError: 'date'

In [19]:
filter_fog = df_weather['fog'] == 1
a = df_weather[filter_fog]
a.index


Index([u'2011-05-15', u'2011-05-18', u'2011-05-19', u'2011-05-23', u'2011-05-24'], dtype='object')

In [20]:
df_turnstile['2011-05-21']

Unnamed: 0_level_0,C/A,UNIT,SCP,TIMEn,DESCn,ENTRIESn_hourly,EXITSn_hourly
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-05-21,A002,R051,02-00-00,00:00:00,REGULAR,0,0
2011-05-21,A002,R051,02-00-00,04:00:00,REGULAR,24,3
2011-05-21,A002,R051,02-00-00,08:00:00,REGULAR,16,19
2011-05-21,A002,R051,02-00-00,12:00:00,REGULAR,75,79
2011-05-21,A002,R051,02-00-00,16:00:00,REGULAR,187,48
2011-05-21,A002,R051,02-00-00,20:00:00,REGULAR,305,35
2011-05-21,A002,R051,02-00-01,00:00:00,REGULAR,-61468,-436253
2011-05-21,A002,R051,02-00-01,04:00:00,REGULAR,36,6
2011-05-21,A002,R051,02-00-01,08:00:00,REGULAR,15,11
2011-05-21,A002,R051,02-00-01,12:00:00,REGULAR,67,50


In [None]:
df_weather.iloc[0]

In [None]:
df_weather.iloc[-1]

In [None]:
df_turnstile.iloc[0]

In [None]:
df_turnstile.iloc[1]

So I have the turnstile data for the 21st May 2011, and I have the weather data for all of May.

Next step = download more data.

In [1]:
# Remove non-May dates