# Repeatedly querying Hubway status

In [1]:
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (8,6)

import urllib
from lxml import etree
import datetime
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import pandas as pd
import schedule
import time

## Periodically print the most recently updated station

Simplest thing is just to reload and reanalyze the XML file periodically:

In [2]:
def printmostrecent():
    data = etree.parse(urllib.urlopen('http://www.thehubway.com/data/stations/bikeStations.xml'))
    stations = data.findall('station')
    everything = [[elt.text for elt in station.getchildren()] for station in stations]
    df = pd.DataFrame(everything, columns = [elt.tag for elt in data.find('station')]).convert_objects(convert_numeric=True)
    df.set_index('name', inplace=True)
    mostrecent = df.sort('latestUpdateTime', ascending=False).head(1)
    recentname = mostrecent.index.to_native_types()[0]
    recenttime = datetime.datetime.fromtimestamp(mostrecent['latestUpdateTime']/1.e3)
    print "Latest updated station was {} at {}.".format(recentname, recenttime)

Try using python `schedule` module to repeatedly run one of these lookups.

In [3]:
def repeatmostrecent(seconds):
    schedule.clear()
    schedule.every(seconds).seconds.do(printmostrecent)
    while True:
        schedule.run_pending()
        time.sleep(1)

Similarly, can also count how many stations have been updated recently.

In [4]:
def enumerateupdated(minutes):
    data = etree.parse(urllib.urlopen('http://www.thehubway.com/data/stations/bikeStations.xml'))
    stations = data.findall('station')
    everything = [[elt.text for elt in station.getchildren()] for station in stations]
    df = pd.DataFrame(everything, columns = [elt.tag for elt in data.find('station')]).convert_objects(convert_numeric=True)
    df.set_index('name', inplace=True)
    timeago = (time.time() - df['latestUpdateTime']/1e3)
    updated = timeago <= minutes * 60
    numupdated = len(df[updated].index)
    print "In the past {} minutes, {} stations have updated.".format(minutes, numupdated)

In [5]:
# repeatmostrecent(15)

In [6]:
def repeatupdated(secondsrefresh, minutesback):
    schedule.clear()
    schedule.every(secondsrefresh).seconds.do(enumerateupdated, minutesback)
    while True:
        schedule.run_pending()
        time.sleep(1)

In [7]:
# repeatupdated(15, 5)

A little more complicated is to compare the reloaded data to the previous data. This lets you figure out, e.g., if a bike has been checked out or returned.

## Compare data from two retrievals

Convenience function to grab DataFrame.

In [9]:
def getupdate():
    data = etree.parse(urllib.urlopen('http://www.thehubway.com/data/stations/bikeStations.xml'))
    stations = data.findall('station')
    everything = [[elt.text for elt in station.getchildren()] for station in stations]
    df = pd.DataFrame(everything, columns = [elt.tag for elt in data.find('station')]).convert_objects(convert_numeric=True)
    df.set_index('name', inplace=True)
    return df

Manual calculation of difference

In [10]:
df = getupdate()
df.head(3)

Unnamed: 0_level_0,id,terminalName,lastCommWithServer,lat,long,installed,locked,installDate,removalDate,temporary,public,nbBikes,nbEmptyDocks,latestUpdateTime
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Colleges of the Fenway,3,B32006,1443734123763,42.340021,-71.100812,True,False,0,,False,True,8,6,1443733641110
Tremont St. at Berkeley St.,4,C32000,1443734211158,42.345392,-71.069616,True,False,0,,False,True,12,3,1443734210382
Northeastern U / North Parking Lot,5,B32012,1443734129638,42.341814,-71.090179,True,False,0,,False,True,1,13,1443733736775


In [11]:
def comparedata(new, old):
    bikediff = new['nbBikes'] - old['nbBikes']
    return bikediff

In [12]:
# run this cell a few minutes later...
df2 = getupdate()

In [17]:
diff = comparedata(df2, df)
lostbikes = diff[diff < 0]
gainbikes = diff[diff > 0]
print len(lostbikes)
print len(gainbikes)

17
13


In [21]:
def gainloss(delay):
    df = getupdate()
    time.sleep(delay) # delay in s
    df2 = getupdate()
    diff = comparedata(df2, df)
    lostbikes = diff[diff < 0]
    gainbikes = diff[diff > 0]
    return lostbikes, gainbikes

In [22]:
gainloss(60)

(name
 Cambridge St. at Joy St.                         -1
 Seaport Square - Seaport Blvd. at Boston Wharf   -1
 The Esplanade - Beacon St. at Arlington St.      -1
 MIT at Mass Ave / Amherst St                     -1
 One Broadway / Kendall Sq at Main St / 3rd St    -1
 Brookline Village - Station Street @ MBTA        -1
 Cambridge St - at Columbia St / Webster Ave      -2
 Andrew Station - Dorchester Ave at Humboldt Pl   -1
 Kendall Street                                   -1
 Name: nbBikes, dtype: int64, name
 Colleges of the Fenway                                          1
 Northeastern U / North Parking Lot                              1
 Ruggles Station / Columbus Ave.                                 1
 Aquarium Station - 200 Atlantic Ave.                            2
 Prudential Center / Belvidere                                   1
 Washington St. at Waltham St.                                   1
 TD Garden - Causeway at Portal Park #2                          1
 Central Squa

In [24]:
def monitorgainloss(interval):
    df1 = getupdate()
    while True:
        time.sleep(interval) # interval in s
        df2 = getupdate()
        diff = comparedata(df2, df1)
        lostbikes = diff[diff < 0]
        gainbikes = diff[diff > 0]
        print "In past {} seconds, there have been {} bikes checked out and {} bikes returned.".format(interval, len(lostbikes), len(gainbikes))
        df1 = df2

In [26]:
monitorgainloss(60)

In past 60 seconds, there have been 10 bikes checked out and 9 bikes returned.
In past 60 seconds, there have been 7 bikes checked out and 7 bikes returned.
In past 60 seconds, there have been 8 bikes checked out and 9 bikes returned.
In past 60 seconds, there have been 10 bikes checked out and 11 bikes returned.
In past 60 seconds, there have been 7 bikes checked out and 12 bikes returned.
In past 60 seconds, there have been 11 bikes checked out and 9 bikes returned.
In past 60 seconds, there have been 5 bikes checked out and 13 bikes returned.
In past 60 seconds, there have been 10 bikes checked out and 12 bikes returned.
In past 60 seconds, there have been 5 bikes checked out and 10 bikes returned.
In past 60 seconds, there have been 6 bikes checked out and 11 bikes returned.
In past 60 seconds, there have been 7 bikes checked out and 8 bikes returned.


KeyboardInterrupt: 

Alternating 0 values / nonzero values with interval = 30 s suggests some kind of weird behavior -- either bug in my code, or infrequent updates from XML. Behavior disappears with interval = 60 s. Suggests that update frequency of station status to XML is slower than every 30 s, but at least every 60 s.

Also, at least around 17:45 on a weekday, there's tons of traffic every 60 seconds! Something like 10 bikes in/10 bikes out every minute, which may somewhat underestimate the true flux (since one checkout and one return happening at the same station within the same interval cancels out).

### Future directions

- Can I map where bikes are coming from/going? Drawing the map from scratch each time is pretty slow, so it would be nice to precalculate the base map somehow.

- Collect gain and loss data for a full diurnal cycle -- expect to see rush hours, morning inflow/evening outflow, maybe other patterns (lunch, bars?). Requires saving each `comparedata` return to an array.