# Data Retreival

I am working with several datasets in this project. Several are publicly available as flat files, while others I have had to construct myself from scraped sites. This notebook walks through all the data retrieval.

In [None]:
import numpy as np
import pandas as pd
from pandas.io.common import get_filepath_or_buffer
import geopandas as gpd
import json
import re
from datetime import datetime
from requests_futures.sessions import FuturesSession
import requests
from bs4 import BeautifulSoup
import dill
from tqdm import tqdm_notebook

### Turnstile and Fare Card Data

In [None]:
def get_dated_links(url, select):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "lxml")
    tags = soup.select(select)
            
    links = [(tag.attrs['href'], datetime.strptime(tag.contents[0], '%A, %B %d, %Y').date()) for tag in tags]
    return links

def filter_by_date(links, cutoff=datetime(2016,1,1).date()):
    return list(filter(lambda x: x[1] >= cutoff, links))

def dl_MTA_data(url,links,out):
    basePage = re.match(r'(.+?/)\w+.html', url).group(1)
    pages = [basePage + link[0] for link in links]

    session = FuturesSession(max_workers=3)
    futures = [(session.get(page), page) for page in pages]

    s = re.compile(r'.+/(\w+\.\w{3})')
    for future in tqdm_notebook(futures):
        with open('../data/' + s.search(future[1]).group(1), 'w') as f:
            f.write(future[0].result().text)

You might have to inspect the HTML of the pages to make sure the correct files are getting retrieved.

In [None]:
# turnstile data
url = 'http://web.mta.info/developers/turnstile.html'

links = filter_by_date(get_dated_links(url, 'div.span-84 a'))
dl_MTA_data(url, links, out)

In [None]:
# farecard data
url = 'http://web.mta.info/developers/fare.html'

links = filter_by_date(get_dated_links(url, 'div.span-19 a'))
dl_MTA_data(url, links, out)

### Census Data

Unfortunately the API for American Community Survey is not well documented and I've had a hard time getting it to give me the data I want. I ultimately went to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml and requested the 'DISABILITY CHARACTERISTICS' table from the 2015 ACS 5-year survey at the block level for New York, Bronx, Kings, and Queens counties. I've joined these 4 tables into one, which I'll be using in the other notebooks, and which I've included in the `data` directory in this repo.

### Station Geo Data