# Retrieve boroughs

In this notebook I will try to retrieve the borough of each school in the dataset.

To do so, I'll scrap a table of zip codes, and then use it to convert the zip codes present in the dataset into borough names.

Good luck!

## Scrap the table

The table is present here:

https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm

HTML structure seems pretty easy to retrieve.

In [1]:
import parsel
import requests

URL = r'https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm'
r = requests.get(URL)
s = parsel.Selector(r.text)

In [2]:
import re

results = {}  # dictionary the maps (zip code => borough)

cur_borough = None
_rows = s.css('tr')[1:]
for _r in _rows:
    borough = _r.css('[headers="header1"]::text').extract_first()
    if borough:
        cur_borough = borough    
    zip_codes = _r.css('[headers="header3"]::text').extract_first()
    zip_codes = zip_codes.strip()  # remove beginning space    
    zip_codes = re.split(r',\s?', zip_codes)  # split on the comma
    for zc in zip_codes:
        results[zc] = cur_borough

results

{'10453': 'Bronx',
 '10457': 'Bronx',
 '10460': 'Bronx',
 '10458': 'Bronx',
 '10467': 'Bronx',
 '10468': 'Bronx',
 '10451': 'Bronx',
 '10452': 'Bronx',
 '10456': 'Bronx',
 '10454': 'Bronx',
 '10455': 'Bronx',
 '10459': 'Bronx',
 '10474': 'Bronx',
 '10463': 'Bronx',
 '10471': 'Bronx',
 '10466': 'Bronx',
 '10469': 'Bronx',
 '10470': 'Bronx',
 '10475': 'Bronx',
 '10461': 'Bronx',
 '10462': 'Bronx',
 '10464': 'Bronx',
 '10465': 'Bronx',
 '10472': 'Bronx',
 '10473': 'Bronx',
 '11212': 'Brooklyn',
 '11213': 'Brooklyn',
 '11216': 'Brooklyn',
 '11233': 'Brooklyn',
 '11238': 'Brooklyn',
 '11209': 'Brooklyn',
 '11214': 'Brooklyn',
 '11228': 'Brooklyn',
 '11204': 'Brooklyn',
 '11218': 'Brooklyn',
 '11219': 'Brooklyn',
 '11230': 'Brooklyn',
 '11234': 'Brooklyn',
 '11236': 'Brooklyn',
 '11239': 'Brooklyn',
 '11223': 'Brooklyn',
 '11224': 'Brooklyn',
 '11229': 'Brooklyn',
 '11235': 'Brooklyn',
 '11201': 'Brooklyn',
 '11205': 'Brooklyn',
 '11215': 'Brooklyn',
 '11217': 'Brooklyn',
 '11231': 'Brooklyn

## Check if all entries were fulfilled

In [3]:
import pandas as pd

df = pd.read_csv('../data/raw/2016 School Explorer.csv')
df[~df.Zip.apply(lambda x: str(x) in results)].Zip.unique()

array([10282, 11001, 11109, 10311])

Nope...

Let's input the missing values manually (using a site like this: https://statisticalatlas.com/zip/10282/Overview)

In [4]:
missing_zip_codes = {
    '10282': 'Manhattan',
    '11001': 'Queens',
    '11109': 'Queens',
    '10311': 'Staten Island'    
}
results.update(missing_zip_codes)

In [5]:
# all fine

df[~df.Zip.apply(lambda x: str(x) in results)].Zip.unique()

array([], dtype=int64)