<h2>Turnstile Usage for July 2013</h2>

The data set consits of entry and exit counters measured every four hours at every turnstile available in stations in the NYC Subway System. <p>

The format of the data for the 2013 measurements is described here: http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt <p>

In this exercise we define a score for "turnstile busy-ness" which is the sum of entries and exits for one turnstile at a given period of time. A score for "station busy-ness" is defined as the sum of the values for each of its turnstile machines. <p>

This document computes the busiest and least-busy stations for July 2013. 

The following link provides the list of web documents containing the data: http://web.mta.info/developers/turnstile.html

<h4>Software Description</h4>
The <b>filtered_data_frame()</b> function takes a URL pointing to raw data and generates a pandas dataframe with entry and exit numbers for all turnstile machines for a particular day and time. This function has the following arguments: <p>

Args: <p>
<b>target_url</b>: (str) URL pointing to one of the raw data files<p>
<b>day</b>: (str) The target day for which we investigate turnstile data. Format: MM-DD-YY<p>
<b>time</b>: (str) The target time for which we investigate turnstile data. Format: HH-00-00<p>
<b>num_lines</b>: (int) The number of lines that the function scans in the raw-data file. A value of -1 indicates all lines. <p>
Returns:<p>
A pandas dataframe whose columns are identified by the following strings: <p>
"ca", "unit", "scp", "month", "day", "year", "hour", "ent", "exit"<p>

Columns "ca", "unit" and "scp" hold string values identifying respectively the station, the unit within the station, and the turnstile machine. Columns "month", "day", "year" and "hour" hold integers identifying the target day/time for turnstile data. Finally columns "ent" and "exit" hold integers that correspond to the count of entries and count of exits at the specifed day/time. 


In [1]:
import urllib2
import pandas as pd

constants = {"url_0803": "http://web.mta.info/developers/data/nyct/turnstile/turnstile_130803.txt",
             "url_0703": "http://web.mta.info/developers/data/nyct/turnstile/turnstile_130706.txt"}


def filtered_dataframe(target_url, day, time, num_lines):
    webdata = urllib2.urlopen(target_url)

    filtered_data = []
    line_counter = 0
    for line in webdata:
        values = line.strip().split(",")

        temp_array = convert_to_array2(values, day, time)

        filtered_data.extend(temp_array)

        if num_lines != -1 and line_counter >= num_lines:
            break
        else:
            line_counter += 1

    print "[Info] Total number of scanned lines: ", line_counter

    return pd.DataFrame(filtered_data, columns=["ca", "unit", "scp", "month", "day", "year", "hour", "ent", "exit" ])

The <b>convert_to_array2()</b> function is a helper function invoked from the <b>filtered_dataframe()</b> function. The <b>convert_to_array2()</b> function actually searches for turnstile information for the target day and time in the raw-data file and returns the data as an array of lists. Each list in the array contains a single measurement for a turnstile machine. 

In [2]:
def convert_to_array2(values, day, time):
    reduced = []
    sta = values[0]
    unit = values[1]
    scp = values[2]

    for i in range(3, len(values), 5):
        vec = values[i:i+5]

        if vec[0] == day and vec[1] == time and vec[2] == 'REGULAR':
            monval, dayval, yearval = [int(val) for val in day.split("-")]
            timeval = int(time.split(":")[0])
            filtered_data = [sta, unit, scp, monval, dayval, yearval, timeval, int(vec[3]), int(vec[4])]
            reduced.append(filtered_data)

    return reduced

The following process invokes the <b>filtered_dataframe()</b> function twice. The first call requests all data for 7/1/13 at time 00:00. The second call requests all data for 8/1/13 at time 00:00. Then, we use pandas <b>merge()</b> method to join the two data sources. Finally we calculate the differece in entry counters to measure the actual number of entries during July. Similarly, we calsulate the difference in exit counters to measure the actual number of exits during July. We also compute the "busy-ness" metric as the sum of entries and exits. 

In [3]:
nlines = -1

res1 = filtered_dataframe(constants["url_0703"], day="07-01-13", time="00:00:00", num_lines=nlines)

print "[Info] Total data records from the initial data source: ", res1.shape[0]

res2 = filtered_dataframe(constants["url_0803"], day="08-01-13", time="00:00:00", num_lines=nlines)

print "[Info] Total data records from the ending data source: ", res2.shape[0]

# print res1

# print res2
print "[Info] Merging the initial and final data sources using 'ca', 'unit', and 'scp' as the key..."
mdf = pd.merge(res1, res2, on=["ca", "unit", "scp"])

# print "-------------------- MDF -------------------------------------------"
# print mdf

mdf["ent_diff"] = mdf["ent_y"] - mdf["ent_x"]
mdf["exit_diff"] = mdf["exit_y"] - mdf["exit_x"]
mdf["busy"] = mdf["ent_diff"] + mdf["exit_diff"]


[Info] Total number of scanned lines:  30964
[Info] Total data records from the initial data source:  2357
[Info] Total number of scanned lines:  29427
[Info] Total data records from the ending data source:  2360
[Info] Merging the initial and final data sources using 'ca', 'unit', and 'scp' as the key...


The following code segment determines if there are machines whose entry counter for 8/1 is smaller than than for 7/1. These machines have had their counters reset, which often signals a problem with the machine. The data from these machines is removed from the data set. The program displays the list of machines whose data is being removed.<p>

After removing data based on reset problems with the entry counter, the code does the same thing for the exit counter. If there are machines whose exit counter has been reset during this month, the code removes their data from the data set. The code prints a list of machines removed due to problems with a reset in their exit counters. 

In [5]:
# Process machines whose entry counts during the interval show negative values
reset1 = mdf[mdf["ent_diff"] < 0]
if len(reset1) > 0:
    print "[Note] Machines that had a reset of the entry counter. Data for these machines will be removed:"
    print reset1[["ca", "unit", "scp", "ent_x", "ent_y"]]
    mdf2 = mdf[mdf["ent_diff"] > 0]
else:
    mdf2 = mdf


# Process machines whose exit counts during the interval show negative values
reset2 = mdf2[mdf2["exit_diff"] < 0]
if len(reset2) > 0:
    print "[Note] Remaining machines showing a reset of the exit counter. Data for these machines will be removed:"
    print reset2[["ca", "unit", "scp", "exit_x", "exit_y"]]
    mdf3 = mdf2[mdf2["exit_diff"] > 0]
else:
    mdf3 = mdf2

[Note] Machines that had a reset of the entry counter. Data for these machines will be removed:
         ca  unit       scp    ent_x    ent_y
438    H016  R250  00-00-01   587823     1916
444    H019  R294  00-00-01   550398    19089
639    N057  R188  00-03-00  1016029     9931
674   N063A  R011  00-00-09  9908932     1297
703    N070  R012  04-00-01   413508     4231
788    N094  R029  01-00-01  2740810    14315
1002   N324  R018  00-02-00     1719        8
1006   N324  R018  00-03-02  8004575    21481
1153  N408A  R256  00-03-00  2125498    25440
1170   N422  R318  00-05-01        8        0
1730   R210  R044  00-06-00  1259990    18959
1809   R244  R050  00-06-00  4244577  2552875
2026   R512  R092  00-00-02  1799443     8767
2190   R608  R056  00-03-02  4697007     6743
[Note] Remaining machines showing a reset of the exit counter. Data for these machines will be removed:
        ca  unit       scp   exit_x  exit_y
2270  R625  R062  01-06-00  2303327     576


After excluding data from machines whose entry and exit counters have been reset, we are ready to compute which station was the busiest for the month of July, and which station whas the least busy during the same time period. We group the data set by station ("ca" column), and sum the values in the "busy" column. <p>

The previous operation gives a list of stations and their aggregated busy scores (in the form of pandas Series). We can now select the maximum and minimum from this series and display the results.

In [7]:
print "--------------------------------------------------------------------"
compdf = mdf3[["ca", "unit", "scp", "ent_diff", "exit_diff", "busy"]]

grouped = compdf.groupby("ca")["busy"].sum()
# print grouped

print "\n[Result] The busiest station on July is ", grouped.idxmax()
print "[Result] The busy score for this station is ", grouped[grouped.idxmax()]

print "\n[Result] The least-busy station on July is ", grouped.idxmin()
print "[Result] The busy score for this station is ", grouped[grouped.idxmin()]

--------------------------------------------------------------------

[Result] The busiest station on July is  R238
[Result] The busy score for this station is  3292989

[Result] The least-busy station on July is  N181A
[Result] The busy score for this station is  265
