# Toronto Neighborhoods

## Scraping Neighborhood Data

This notebook scrapes the neighborhood data from Wikipedia

The scraping is done using the BeautifulSoup library.

After checking the page I asssume the following:
 - There will be only one table with the class _wikitable_ on the page.
 - The _Not assigned_ value is always written in the same casing.
 - There is only one row for each postal code
 
 

> **Note** that the wiki page has been changed since the assignment was created. Now there is only one row for each postal code, and the different neighborhoods under the same postal code are separated by /.


In [50]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [48]:
WIKI_URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

hoods = pd.DataFrame(columns=['Postal code', 'Borough', 'Neighborhood'])

soup = BeautifulSoup(requests.get(WIKI_URL).text, 'html.parser')
for row in soup.select_one('table.wikitable').find_all('tr'):
    cols = row.find_all('td')
    if len(cols) < 3:
        continue
    if cols[1].get_text().strip() == 'Not assigned':
        continue
    hoods = hoods.append({'Postal code': cols[0].get_text().strip(), 'Borough': cols[1].get_text().strip(), 'Neighborhood': cols[2].get_text().strip() if cols[2].get_text().strip() != 'Not assigned' else cols[1].get_text().strip()}, ignore_index=True)


In [53]:
hoods.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [54]:
hoods.shape

(103, 3)

In [118]:
import geocoder

In [121]:
MAX_RETRIES = 10
def get_lat_lon(hood):
    
    retries = MAX_RETRIES
    location = None
    while retries > 0 and location is None:
        location = geocoder.osm('{}, Toronto, Ontario'.format(hood)).latlng
        retries -= 1
        
    if location is None:        
        return { 'Latitude': None, 'Longitude': None }

    return { 'Latitude': location[0], 'Longitude': location[1] }

> **Note:** Here I ended up using the OSM provider instead of Google and the neighborhood names insted of the postal codes. With google and postal codes almost no result was returned even retrying every data point 50 times, this way only a few of them are missing, those will be removed from the data set.

In [130]:
lat_lon_df = pd.DataFrame(columns=['Latitude', 'Longitude'])

for hood in hoods['Neighborhood']:
    lat_lon_df = lat_lon_df.append(get_lat_lon(hood.split('/')[0].strip()), ignore_index=True)

hoods_with_loc = pd.concat([hoods, lat_lon_df], axis=1, sort=False)

original_size = hoods_with_loc.shape[0]

hoods_with_loc = hoods_with_loc.dropna()

print('Removed {} rows due to missing location data'.format(original_size - hoods_with_loc.shape[0]))

Removed 8 rows due to missing location data


In [132]:
hoods_with_loc.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.660706,-79.360457
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.722079,-79.437507
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.659659,-79.39034
5,M9A,Etobicoke,Islington Avenue,43.622575,-79.514215
6,M1B,Scarborough,Malvern / Rouge,43.809196,-79.221701
7,M3B,North York,Don Mills,43.775347,-79.345944
8,M4B,East York,Parkview Hill / Woodbine Gardens,43.653482,-79.383935
10,M6B,North York,Glencairn,43.708712,-79.440685
