#Subway Data

## Links

http://experimenting.alastair.is/citibike/ 

https://www.udacity.com/course/viewer#!/c-ud359/l-732399471/e-698029633/m-698029634 

http://chriswhong.com/open-data/visualizing-the-mtas-turnstile-data/ 

https://github.com/chriswhong/nycturnstiles 

http://web.mta.info/developers/index.html 

https://saxenarajat99.wordpress.com/2014/09/21/impact-of-rain-on-nyc-subway-ridership-udacity-course-project/ 

http://spotofdata.com/subway-weather-udacity/ 

http://web.mta.info/developers/turnstile.html

http://www.jasondamiani.com/portfolio/analyzing-mta-subway-data/ Good analysis



# Weather Data

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import pandas
import pandasql

In [2]:
df_weather = pandas.read_csv("data/weather_underground.csv")

In [3]:
df_weather.head(2)

Unnamed: 0,date,maxpressurem,maxdewptm,maxpressurei,maxdewpti,since1julheatingdegreedaysnormal,heatingdegreedaysnormal,since1sepcoolingdegreedaysnormal,hail,since1julsnowfallm,...,precipi,snowfalli,since1jancoolingdegreedaysnormal,precipm,snowfallm,thunder,monthtodateheatingdegreedays,meantempi,maxvism,meantempm
0,2011-05-01,1026,6,30.31,42,4646,8,,0,157.23,...,0,0,13,0,0,0,5,60,16,16
1,2011-05-02,1026,10,30.31,50,4653,7,,0,157.23,...,0,0,14,0,0,0,13,57,16,14


### Rainy Days

In [4]:
q = """
    SELECT COUNT(rain) 
    FROM df_weather
    WHERE rain = 1
    ;
    """

#Execute your SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())

print rainy_days

   count(rain)
0           10


###Foggy Days

In [5]:
q = """
    SELECT fog, maxtempi 
    FROM df_weather
    GROUP BY fog
    ;
    """
    
#Execute your SQL command against the pandas frame
foggy_days = pandasql.sqldf(q.lower(), locals())

print foggy_days

   fog  maxtempi
0    0        86
1    1        81


###Weekend mean temperature

In [6]:
q = """
    SELECT  AVG(meantempi)
    FROM df_weather
    WHERE cast(strftime('%w', date) as integer) = 6
    OR cast(strftime('%w', date) as integer) = 0
    ;
    """
    

#Execute your SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())

print mean_temp_weekends

   avg(meantempi)
0       65.111111


### Mean Temperature on Rainy Days

In [7]:
q = """
    SELECT AVG(cast(mintempi as integer)) 
    FROM df_weather
    WHERE rain = 1
    AND cast(mintempi as integer) > 55
    ;
    """
    
#Execute your SQL command against the pandas frame
avg_min_temp_rainy = pandasql.sqldf(q.lower(), locals())

print avg_min_temp_rainy

   avg(cast(mintempi as integer))
0                           61.25


## Turnstiles Data

### Quick Overview

Firstly, I'm going to load in one file that I've manually downloaded from the MTA website at http://web.mta.info/developers/turnstile.html. After I get a sense of one subset of the data, I'll combine it with the rest and then analyze it in tandem with the weather data.

In [8]:
filename = "data/turnstile_110528.txt"
df_turnstile = pandas.read_csv(filename)
df_turnstile.head(2)

Unnamed: 0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11.1,04:00:00,...,05-22-11,00:00:00.1,REGULAR.6,003170119,001097792,05-22-11.1,04:00:00.1,REGULAR.7,003170146,001097801
0,A002,R051,02-00-00,05-22-11,08:00:00,REGULAR,3170164,1097820,05-22-11,12:00:00,...,05-23-11,08:00:00,REGULAR,3170746,1098069,05-23-11,12:00:00,REGULAR,3170897,1098378
1,A002,R051,02-00-00,05-23-11,16:00:00,REGULAR,3171194,1098447,05-23-11,20:00:00,...,05-24-11,16:00:00,REGULAR,3172689,1099010,05-24-11,20:00:00,REGULAR,3173590,1099055


In [9]:
df_turnstile.shape

(998, 43)

There are 998 rows and an overwhelming 43 columns in this dataset. This file was written in a way where multiple observations share the same row. As a result the MTA data is notoriously difficult to work with. As adaptable as the pandas module is, it can't infer this kind of error and correct it. So I'll have to manually do it myself.

Looking at the time and date columns, it's clear that each file spans exactly one week at 4 hour intervals. The time diference between each consecutive row is 32 hours.

The MTA website labels the data as follows: 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'. The first 3 columns are identification data. Then elements should be chopped from the original data, 5 elements at a time and written into the new file. Next is a sequence of columns with a timestamp, type of report, entry count, and exit count, which repeats 8 times! This figure lines up with the shape of our dataframe (3 + (5 x 8)) = 43.



### Restructuring the data

In [10]:
import csv
import datetime
import os

In [11]:
def fix_turnstile_data(filepath):
    '''
    Filepath is a location of a MTA Subway turnstile text file.A link to an example
    MTA Subway turnstile text file can be seen at the URL below:
    http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
    
    There are numerous data points included in each row of the text file. 

    This function updates each row in the text file so there is only one entry per row.
    A few examples below:
    A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
    A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
    A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
    
    This file is then written into a new related directory.
    '''
    
    # Parse the directory and filename from the input.
    splitted = filepath.split("/")
    directory, filename = splitted[0], splitted[1]
    
    # Read the file into memory.
    r = csv.reader(open(filepath, 'rb'))

    # Prepare the output directory.
    newpath = "updated_{0}".format(directory)
    if not os.path.exists(newpath): 
        os.makedirs(newpath)
    
    # Create the output file in the new directory. Overwrite the file if it exists already(wb).
    w = csv.writer(open("{0}/{1}".format(newpath, filename), 'wb'))
    
    # Write the header row, taken from the mta website.
    w.writerow(['C/A', 'UNIT', 'SCP', 'DATEn', 'TIMEn', 'DESCn', 'ENTRIESn', 'EXITSn'])
    
    # Loop through the output from the CSV reader a line at a time.
    for line in r:
        
        # Parse out the elements, and remove them from the row.
        ca = line.pop(0)
        unit = line.pop(0)
        scp = line.pop(0)
        
        # While there is still new data, parse it.
        while len(line) >= 5:
            
            # Take the first 5 elements and remove them.
            block, line = line[:5], line[5:]
              
            # Output the new row.
            w.writerow([ca, unit, scp] + block)

In [12]:
fix_turnstile_data(filename)

In [13]:
updated_filename = "updated_data/turnstile_110528.txt"
df_turnstile = pandas.read_csv(updated_filename)
df_turnstile.head(2)

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,3169391,1097585
1,A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,3169415,1097588


In [14]:
df_turnstile.shape

(7343, 8)

### Combining Turnstile Data.

###Tallying the counts

Additionally the original data set gives running totals for each turnstile instead of just a number of entries or exits, so to get anything useful out of it, I'll need to do some subtraction. Simply subtract the entry tally for one timestamp from the previous reading.

In [15]:
df_turnstile['ENTRIESn_hourly'] = df_turnstile['ENTRIESn'] - df_turnstile['ENTRIESn'].shift(1)
df_turnstile['ENTRIESn_hourly'] = df_turnstile['ENTRIESn_hourly'].fillna(0)

df_turnstile['EXITSn_hourly'] = df_turnstile['EXITSn'] - df_turnstile['EXITSn'].shift(1)
df_turnstile['EXITSn_hourly'] = df_turnstile['EXITSn_hourly'].fillna(0)
    
df_turnstile.head(3)

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly
0,A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,3169391,1097585,0,0
1,A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,3169415,1097588,24,3
2,A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,3169431,1097607,16,19


### Dates

Subway = 2011-05-01
MTA = 05-21-11

In [16]:
df_turnstile['date'] = pandas.to_datetime(df_turnstile['DATEn'])
df_turnstile.tail()

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly,date
7338,A042,R086,01-00-04,05-27-11,04:00:00,REGULAR,915737,4179675,13,28,2011-05-27
7339,A042,R086,01-00-04,05-27-11,08:00:00,REGULAR,915743,4179820,6,145,2011-05-27
7340,A042,R086,01-00-04,05-27-11,12:00:00,REGULAR,915806,4180746,63,926,2011-05-27
7341,A042,R086,01-00-04,05-27-11,16:00:00,REGULAR,916090,4181746,284,1000,2011-05-27
7342,A042,R086,01-00-04,05-27-11,20:00:00,REGULAR,916517,4182631,427,885,2011-05-27


### Removing redundant columns

In [17]:
df_turnstile = df_turnstile.drop(["DATEn", "ENTRIESn", "EXITSn"], axis=1)
df_turnstile.head(2)

Unnamed: 0,C/A,UNIT,SCP,TIMEn,DESCn,ENTRIESn_hourly,EXITSn_hourly,date
0,A002,R051,02-00-00,00:00:00,REGULAR,0,0,2011-05-21
1,A002,R051,02-00-00,04:00:00,REGULAR,24,3,2011-05-21


### Using Both Datasets

I thought about combining both dataframes into a master dataset that would contain both subway and weather information. However after I implemented it, the whole thing proved to be redundant. The weather data contains one observation for the entire day, while the subway data contains data for every 4 hours. Combining them would mean putting a lot of repeated data into each hour for each day. This only offers some ease of access by having everything in one file, but it bloats the data. So instead I'm choosing to keep them separate, and keeping the weather dataframe as a sort of lookup table to find dates corresponding to events (such as fog) and then getting the subway data for these dates.

But first I'll index them both by date for easier lookups.

In [18]:
df_weather = df_weather.set_index('date')
df_turnstile = df_turnstile.set_index('date')

In [19]:
filter_fog = df_weather['fog'] == 1
a = df_weather[filter_fog]
a.index


Index([u'2011-05-15', u'2011-05-18', u'2011-05-19', u'2011-05-23', u'2011-05-24'], dtype='object')

In [20]:
df_turnstile['2011-05-21']

Unnamed: 0_level_0,C/A,UNIT,SCP,TIMEn,DESCn,ENTRIESn_hourly,EXITSn_hourly
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-05-21,A002,R051,02-00-00,00:00:00,REGULAR,0,0
2011-05-21,A002,R051,02-00-00,04:00:00,REGULAR,24,3
2011-05-21,A002,R051,02-00-00,08:00:00,REGULAR,16,19
2011-05-21,A002,R051,02-00-00,12:00:00,REGULAR,75,79
2011-05-21,A002,R051,02-00-00,16:00:00,REGULAR,187,48
2011-05-21,A002,R051,02-00-00,20:00:00,REGULAR,305,35
2011-05-21,A002,R051,02-00-01,00:00:00,REGULAR,-61468,-436253
2011-05-21,A002,R051,02-00-01,04:00:00,REGULAR,36,6
2011-05-21,A002,R051,02-00-01,08:00:00,REGULAR,15,11
2011-05-21,A002,R051,02-00-01,12:00:00,REGULAR,67,50


In [None]:
df_weather.iloc[0]

In [None]:
df_weather.iloc[-1]

In [None]:
df_turnstile.iloc[0]

In [None]:
df_turnstile.iloc[1]

So I have the turnstile data for the 21st May 2011, and I have the weather data for all of May.

Next step = download more data.

In [22]:
# Check Dates?

In [23]:
### Downloading the data.

In [24]:
import requests
from bs4 import BeautifulSoup as bs
import urllib2

In [25]:
# Work around for byte-ASCII error in bs.
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

In [26]:
# URL of MTA data.
URL = 'http://web.mta.info/developers/turnstile.html'

# Parse the html using bs to find the hyperlinks.
r = requests.get(URL)
soup = bs(r.text)
hyperlinks = soup.findAll('a')

# Initialize array to hold all of the URLs on the page.
urls = []

# Loop through the hyperlinks, parsing just the links themselves.
for each in hyperlinks:
    link = each.get('href')
    
    # Only add links which are in the data directory, and are text files.
    if link and link.endswith('.txt') and link.startswith('data/'):
        urls.append(link)

In [27]:
# Initialize array to hold all of the dates from the URLs collected.
dates = []

# Split the URLs up in order to get just the dates.
for each in urls:
    filename = each.split('/')[-1]
    filename.find('turnstile_')
    index = filename.find('turnstile_')
    date = filename[index+len('turnstile_'):].split('.')[0]
    dates.append(date)

In [28]:
from sets import Set

# Initialize set to store the dates to download.
dates_to_download = Set([])

# Loop through the dates, and take the appropriate ones.
for i, date in enumerate(dates):
    # Only 2015.
    if date[:2] == '15':
        # Take all of May, the last week of April, and the first week of June.
        if date[2:4] == '05':
            dates_to_download.update([dates[i-1], dates[i], dates[i+1]])

# Convert the set to an array.            
dates_to_download = list(dates_to_download)

In [45]:
dates_to_download

['150509', '150530', '150502', '150606', '150516', '150523', '150425']

In [47]:
a = ['150509', '150530', '150502', '150606', '150516', '150523']

In [30]:
l = len(dates_to_download)

for i in range(l):
    date = dates_to_download[i]
    download = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{0}.txt'.format(date)
    
    #rq = urllib2.Request(download)
    #res = urllib2.urlopen(rq)

#res

<addinfourl at 220956104L whose fp = <socket._fileobject object at 0x000000000D129318>>

In [42]:
url_download = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{0}.txt'.format(date)
response  = urllib2.urlopen(url_download)

with open('data/{0}.txt'.format(date), 'wb') as f:
    f.write(response.read())
    
    

In [43]:
a = pandas.read_csv('data/150425.txt')