<h2>Turnstile Usage for Aug 1, 2013</h2>

The data set consits of entry and exit counters measured every four hours at every turnstile available in stations in the NYC Subway System. 

The format of the data for the 2013 measurements is described here: http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt

In this exercise we compute statistics for entry and exit counts for a single day: August 1st, 2013. The data samples collected for this day can be found here: 
http://web.mta.info/developers/data/nyct/turnstile/turnstile_130803.txt

This exercise computes the following statistics: 

1) Total number of entries registered for 8/1/13
2) Total number of exits registered for 8/1/13
3) The busiest station for 8/1/13

Because of turnstile problems, the statistics are approximate values. Data from turnstile machines that showed problems during the day are excluded from the computations.


<h3> Software Description </h3>
The function called <b>filtered_entries_exits_per_day()</b> extracts relevant data from raw data using a second function called <b>convert_to_array()</b>. The function looks at two entries: the counter values at 00:00 on 8/1 and at 00:00 on 8/2. The function discards records that are not deemed 'REGULAR' (i.e. records captured outside the nomral cycle). <p>
Args: <p>
    <b>target_url</b>: (string) The URL that contains the data for the specified day <p>
    <b>day </b>: (string) The target day for analysis. It used raw-data format: MM-DD-YY. Example: 08-01-13 <p>
    <b>num_lines</b>: (int) The number of lines to process. A value of -1 indicates all lines should be processed <p>
Returns: <p>
    An array of lists. Each list represents a measurement for a turnstile for the specified day at either time = 0 (00:00 on 8/1) or at time = 24 (00:00 on 8/2). Each list has the following ordered elements: <p>
       
      ca (string), unit (string), scp (string), day (integer), time (either 0 or 24), entries (int), exits (int)

In [73]:
import urllib2
import pandas as pd

constants = {"url_0803": "http://web.mta.info/developers/data/nyct/turnstile/turnstile_130803.txt"}

def filtered_entries_exits_per_day(target_url, day, num_lines):
    webdata = urllib2.urlopen(target_url)

    filtered_data = []
    line_counter = 0
    for line in webdata:
        values = line.strip().split(",")

        temp_array = convert_to_array(values, day)
        filtered_data.extend(temp_array)

        if num_lines != -1 and line_counter >= num_lines:
            break
        else:
            line_counter += 1

    print "[Info] Total number of lines: ", line_counter
    return filtered_data

In [74]:
def convert_to_array(values, day):
    reduced = []
    sta = values[0]
    unit = values[1]
    scp = values[2]

    for i in range(3, len(values), 5):
        vec = values[i:i+5]

        if vec[0] == "08-01-13" and vec[1] == "00:00:00" and vec[2] == 'REGULAR':
            filtered_data = [sta, unit, scp, 31, 0, int(vec[3]), int(vec[4])]
            reduced.append(filtered_data)

        if vec[0] == "08-02-13" and vec[1] == "00:00:00" and vec[2] == 'REGULAR':
            filtered_data = [sta, unit, scp, 31, 24, int(vec[3]), int(vec[4])]
            reduced.append(filtered_data)
    return reduced

In [75]:
def test_consecutive_pair(datalist):
    i = 0
    unpaired = False

    while i < len(datalist) - 1:

        j = i + 1
        if datalist[i][0] != datalist[j][0]:
            # print "line pair has different sta:", i, j
            unpaired = True
        if datalist[i][1] != datalist[j][1]:
            # print "line pair has different unit:", i, j
            unpaired = True
        if datalist[i][2] != datalist[j][2]:
            # print "line pair has different scp:", i, j
            unpaired = True

        if unpaired:
            print "[Note] Unpaired entry. Index: " + str(i) + "  data: " + str(datalist[i])
            unpaired = False
            i += 1

        else:
            i = i + 2

The <b>join_consecutive_pairs()</b> function merges turnstile records at time = 0 with the corresponding record at time = 24. It creates a single record from these two records. <p>

Some tunrstile machines have a record at time = 0 but no record at time = 24. Others have the opposite problem. Machines that show this problem are excluded from the data using this function. <b>Note: </b> In the future we may re-insert these machines especially if they show availability of measurements slightly later than t = 0 or slightly earlier than t = 24. <p>

Args: <p>
    datalist: (list of lists) The data extracted using the <b>filtered_entries_exits_per_day()</b> function <p>
Returns: <p>
    An array of lists where each list represents data for a single turnstitle measure at time t = 0 and t = 24. The format is: <p>
    
  ca (str), unit (str), scp (str), day (int), entries_0 (int), exits_0 (int), entries_24 (int), exits_24 (int)
    

In [76]:
def join_consecutive_pair(datalist):
    i = 0
    unpaired = False

    joined_list = []

    while i < len(datalist) - 1:

        j = i + 1
        if datalist[i][0] != datalist[j][0]:
            # print "line pair has different sta:", i, j
            unpaired = True
        if datalist[i][1] != datalist[j][1]:
            # print "line pair has different unit:", i, j
            unpaired = True
        if datalist[i][2] != datalist[j][2]:
            # print "line pair has different scp:", i, j
            unpaired = True

        if unpaired:
            print "[Note] Excluding unpaired item... Index: " + str(i) + "  data: " + str(datalist[i])
            unpaired = False
            i += 1

        else:
            vec = [datalist[i][0], datalist[i][1], datalist[i][2], datalist[i][3],
                   datalist[i][5], datalist[i][6], datalist[j][5], datalist[j][6]]
            joined_list.append(vec)
            i = i + 2

    return joined_list

Execute the <b>filtered_entries_exits_per_day()</b> function. Remember that num_lines controls the number of lines to process from the raw-data file (URL). A value of -1 forces processing of all lines. 

In [77]:
parsed_res = filtered_entries_exits_per_day(constants["url_0803"], day="08-01-13", num_lines=-1)

print "[Info] Number of lines with info about the selected day: ", len(parsed_res)

[Info] Total number of lines:  29427
[Info] Number of lines with info about the selected day:  4725


Execute the <b>join_consecutive_pair()</b> function. Remember that this function will exclude machines for which we do not have a complete pair of measurements at time = 0 or at time = 24. Running this function displays the list of excluded turnstile machines. <p>

After running the function, the data, which existed as an array of lists, is converted into a pandas dataframe for analysis. In addition to the original data, we compute the following information (added as columns to the dataframe): 

<ul>
<li> Column 'ent_diff' contains the difference of entries between time = 24 and time = 0. This number represents the actual entries during a single day.
<li> Column 'exit_diff' contains the different of exit counts between time = 24 and time = 0. This number represents the actual exits during a single day.
<li> Column 'busy' contains the sum of 'ent_diff' and 'exit_diff'. It measures how busy the turnstile machine was on a given day. 

</ul>



In [78]:
dataout = join_consecutive_pair(parsed_res)

pdf = pd.DataFrame(dataout, columns=['ca', 'unit', 'scp', 'day', 'ent_init', 'exit_init', 'ent_final', 'exit_final'])

pdf["ent_diff"] = pdf["ent_final"] - pdf["ent_init"]
pdf["exit_diff"] = pdf["exit_final"] - pdf["exit_init"]
pdf["busy"] = pdf["ent_diff"] + pdf["exit_diff"]

[Note] Excluding unpaired item... Index: 204  data: ['A039', 'R085', '01-00-02', 31, 24, 958191, 669993]
[Note] Excluding unpaired item... Index: 741  data: ['E004', 'R234', '00-00-00', 31, 0, 3639800, 3479320]
[Note] Excluding unpaired item... Index: 742  data: ['E004', 'R234', '00-00-01', 31, 0, 3991477, 4088425]
[Note] Excluding unpaired item... Index: 743  data: ['E004', 'R234', '00-00-02', 31, 0, 2027227, 2095590]
[Note] Excluding unpaired item... Index: 834  data: ['H008', 'R248', '01-00-00', 31, 0, 343928, 4846118]
[Note] Excluding unpaired item... Index: 1551  data: ['N091', 'R029', '02-00-00', 31, 24, 2829184, 837727]
[Note] Excluding unpaired item... Index: 1552  data: ['N091', 'R029', '02-00-01', 31, 24, 4794622, 2900031]
[Note] Excluding unpaired item... Index: 1553  data: ['N091', 'R029', '02-00-02', 31, 24, 5994828, 4800702]
[Note] Excluding unpaired item... Index: 1554  data: ['N091', 'R029', '02-00-03', 31, 24, 4985914, 4981513]
[Note] Excluding unpaired item... Index: 

This block of code checks to see if there are machines whose counters have been reset during the target day. A machine has been reset if the counter at the end of the day is smaller than the counter at the beginning of the day (for entries or for exits). Because we do not know why the machine has been reset (probably due to failure), these machines do not contain clean information about entries and exits. Consequently, the difference data (entries, exits) and the busy-ness index for these machines are set to 0. The machines do not contribute to the aggregated daily statistics. 

In [79]:
idxlist1 = pdf[pdf["ent_diff"] < 0].index
if len(idxlist1) > 0:
    print "\n[Note:] Number of machines whose entry counter has been reset on 8/1/13:", len(idxlist1)
    print "[Note:] These machines cannot be used for computing daily stats."
    print "List of machines with entry reset values:"
    for ind in idxlist1:
        print pdf.ix[[ind], ["ca", "unit", "scp", "ent_init", "ent_final"]]
        pdf.set_value(ind, "ent_diff", 0)
        pdf.set_value(ind, "exit_diff", 0)
        pdf.set_value(ind, "busy", 0)

idxlist2 = pdf[pdf["exit_diff"] < 0].index
if len(idxlist2) > 0:
    print "\n[Note:] Number of machines whose exit counter has been reset on 8/1/13:", len(idxlist2)
    print "[Note:] These machines cannot be used for computing daily stats. "
    print "List of machines with exit reset values:"
    for ind in idxlist2:
        print pdf.ix[[ind], ["ca", "unit", "scp", "ent_init", "ent_final"]]


[Note:] Number of machines whose entry counter has been reset on 8/1/13: 1
[Note:] These machines cannot be used for computing daily stats.
List of machines with entry reset values:
        ca  unit       scp  ent_init  ent_final
1225  N504  R021  02-00-04   5156751       4075


<h4>Results: Total number of entries and exists registered on 8/1/13</h4>

In [80]:
print "[Result] Sum of all entries registerd for all machines on 8/1/13: ", sum(pdf.ent_diff)
print "[Result] sum of all exits registered for all machines on 8/1/13: ", sum(pdf.exit_diff)

[Result] Sum of all entries registerd for all machines on 8/1/13:  3185978
[Result] sum of all exits registered for all machines on 8/1/13:  2518992


<h4> Results: Busiest station on 8/1/13</h4>

In [81]:
grouped = pdf.groupby("ca")["busy"].sum()

print "[Result] The busiest station on 8/1/13 is ", grouped.idxmax()
print "[Result] The busy score for this station is ", grouped[grouped.idxmax()]

[Result] The busiest station on 8/1/13 is  R238
[Result] The busy score for this station is  130881


<h4>Results: Busiest turnstile</h4>

In [82]:
max_ind = pdf.busy.idxmax()
# print "index of busiest entry: ", max_ind
resultvals =  pdf.ix[max_ind, ["ca", "unit", "scp"]]

print "[Result] Busiest turnstile: "
print "         ca: %s   unit: %s   scp: %s" %(resultvals["ca"], resultvals["unit"], resultvals["scp"])

[Result] Busiest turnstile: 
         ca: N063A   unit: R011   scp: 00-00-00
