# Data Retreival

I am working with several datasets in this project. Several are publicly available as flat files, while others I have had to construct myself from scraped sites. This notebook walks through all the data retrieval.

In [8]:
import numpy as np
import pandas as pd
from pandas.io.common import get_filepath_or_buffer
import geopandas as gpd
import json
import re
from datetime import datetime
from requests_futures.sessions import FuturesSession
import requests
from bs4 import BeautifulSoup
import dill
from tqdm import tqdm_notebook

### Turnstile and Fare Card Data

In [29]:
def get_dated_links(url, select):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "lxml")
    tags = soup.select(select)
            
    links = [(tag.attrs['href'], datetime.strptime(tag.contents[0], '%A, %B %d, %Y').date()) for tag in tags]
    return links

def filter_by_date(links, cutoff=datetime(2016,1,1).date()):
    return list(filter(lambda x: x[1] >= cutoff, links))

def dl_MTA_data(url,links,out):
    basePage = re.match(r'(.+?/)\w+.html', url).group(1)
    pages = [basePage + link[0] for link in links]

    session = FuturesSession(max_workers=3)
    futures = [(session.get(page), page) for page in pages]

    s = re.compile(r'.+/(\w+\.\w{3})')
    for future in tqdm_notebook(futures):
        with open('../data/' + s.search(future[1]).group(1), 'w') as f:
            f.write(future[0].result().text)

You might have to inspect the HTML of the pages to make sure the correct files are getting retrieved.

In [14]:
# turnstile data
url = 'http://web.mta.info/developers/turnstile.html'

links = filter_by_date(get_dated_links(url, 'div.span-84 a'))
dl_MTA_data(url, links, out)





In [30]:
# farecard data
url = 'http://web.mta.info/developers/fare.html'

links = filter_by_date(get_dated_links(url, 'div.span-19 a'))
dl_MTA_data(url, links, out)




### Census Data

Unfortunately the API for American Community Survey is not well documented and I've had a hard time getting it to give me the data I want. I ultimately went to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml and requested the 'DISABILITY CHARACTERISTICS' table from the 2015 ACS 5-year survey at the block level for New York, Bronx, Kings, and Queens counties. I've joined these 4 tables into one, which I'll be using in the other notebooks, and which I've included in the `data` directory in this repo.

### Station Geo Data

In [4]:
# this feature will hopefully be merged into geopandas
def read_geojson_url(url):
    buffer, _, _ = get_filepath_or_buffer(url)
    geojson = json.loads(buffer.read())
    return gpd.GeoDataFrame.from_features(geojson['features'])

In [5]:
url = 'https://data.cityofnewyork.us/resource/kk4q-3rt2.geojson'

read_geojson_url(url).to_pickle('../data/station_geodata.pkd')

### Platform Accessibility Data
There might be better ways to get it, but as far as I can tell, no one offers a convenient table of which platforms are accessible, so I scraped it from Wikipedia. Keep in mind some stations may contain both accessible and inaccessible platforms.

In [6]:
def get_wiki_data():
    page = requests.get('https://en.wikipedia.org/wiki/List_of_accessible_New_York_City_Subway_stations')
    soup = BeautifulSoup(page.text, "lxml")
    tables = soup.select('table.wikitable')

    allStations = []
    allLines = []
    for table in tables[:-1]:
        tags = table.select('tr')[1:]
        stations = [re.match(r'([^(]+)\s', tag.select('th')[0].select('a')[0].attrs['title']).group(1) for tag in tags]
        lines = [set([re.match(r'\w+', item.attrs['title']).group() for item in tag.select('th')[1].select('a')]) for tag in tags]
        allStations.extend(stations)
        allLines.extend(lines)
    
    allStations = [item.replace('\xe2\x80\x93',' - ') for item in allStations]
    # since our DataFrame contains sets, pickle more convenient than CSV
    return pd.DataFrame(list(zip(allStations,allLines)),columns=['stations','lines'])

In [7]:
get_wiki_data().to_pickle('../data/platform_accesibility.pkd')

### Platform connectivity data
I want to know how platforms are connected so I can analyze the network graph.

In [12]:
def get_links(url, select):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "lxml")
    tags = soup.select(select)

    links = [url + tag.attrs['href'] for tag in tags]
    return links

def parse_station(tag, selectors, pattern):
    contents = []
    for selector in selectors:
        if tag.select(selector):
            for selected in tag.select(selector):
                contents.extend(selected.contents)
            break
    
    for content in contents:
        if isinstance(content, str):
            match = pattern.match(content)
            if match:
                return match.group(1)
    
    return None

def get_line_data(links):
    session = FuturesSession(max_workers=3)
    futures = [(session.get(link), link) for link in links]
    
    selectors = ['span.emphasized strong', 'span.emphasized', 'strong', 'td']
    stat_re = re.compile(r'[^\w]*(.+\w)')
    
    services = {}
    for future in futures:
        page = future[0].result()
        soup = BeautifulSoup(page.text, "lxml")
        tags = soup.findAll('table', {'summary' : re.compile(r'.*Subway Line Stops')})
        tags = tags[0].select('tr[height="25"]')
        service = []
        for tag in tags:
            stat = parse_station(tag, selectors, stat_re)
            trans = []
            trans = re.findall(r'a href="(\w+?).htm"',str(tag.select('a')))
            service.append((stat,trans))

        services[re.match(r'.+/(\w+).htm', future[1]).group(1)] = service
    
    return services

In [13]:
url = 'http://web.mta.info/nyct/service/'
select = 'div.roundCorners p a'

links = get_links(url, select)

with open('../data/subway_line_data.pkd', 'wb') as f:
    dill.dump(get_line_data(links), f)

### Travel time data
The best way to determine which stations to prioritize for accessibility upgrades would be to measure the travel times from point to point and see how they change as stations are made accessible or inaccessible. While I haven't yet figured out a way to find travel times from point to point while including or omitting individual stations, a Google project called [Sidewalk Labs](https://www.sidewalklabs.com/) that allows querying travel times from a point to all points in NYC either with or without accessible stations. A friend of mine, [Micha Gorelick](https://github.com/mynameisfiber/), reverse engineered their API for [predicting impact of the L-train shutdown](https://github.com/mynameisfiber/lpocolypse) with a metric called [Earth mover's distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance). We can do the same for accessible stations.