### DOWNLOAD

Downloaded the wikipedia file then created an instance of Beautiful Soup using the LXML parser.

In [26]:
import pandas as pd
import lxml
import wget

from bs4 import BeautifulSoup

#wget.download('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M','Data/postal_codes.html')
html_file = open('Data/Postal_Codes_Canada.html', encoding="utf8")
soup = BeautifulSoup(html_file, 'lxml')

### CLEAN DATA

Created a dictionary `data` that has postal codes as *keys* and borough / neighborhoods tuples as *values*. Data with no assigned boroughs were discarded. Data with boroughs but no assigned neighborhoods changed the neighborhoods to be the boroughs.

In [27]:
# match = soup.title.text
data = {}

tr_table = soup.find('table',class_='wikitable sortable')
for tr in tr_table.find_all('tr'):
    tup = ()
    for elem in tr.find_all('td'):
        tup += (elem.text.strip(),)
    if len(tup) == 3:
        code,bor,nei = tup
        if bor.lower() != 'not assigned':
            if nei.lower() == 'not assigned':
                nei = bor
            val = data.get(code)
            if val:
                data[code] = (bor, nei+', '+val[1])
            else:
                data[code] = (bor, nei)

### DATAFRAME

Use the `data` dictionary to populate a dataframe

In [30]:
cols = ['Postal Code','Borough','Neighborhood']
pcodes = pd.DataFrame(columns=cols)

for k,v in data.items():
    pcodes = pcodes.append({'Postal Code':k,'Borough':v[0],'Neighborhood':v[1]},
                  ignore_index=True)

### M5V

This postal code had multiple neighborhoods.

In [31]:
pcodes.loc[pcodes['Postal Code'] == 'M5V']

Unnamed: 0,Postal Code,Borough,Neighborhood
87,M5V,Downtown Toronto,"South Niagara, Railway Lands, King and Spadina..."


### M7A

THis postal code had a borough, but no neighborhood.

In [33]:
pcodes.loc[pcodes['Postal Code'] == 'M7A']

Unnamed: 0,Postal Code,Borough,Neighborhood
4,M7A,Queen's Park,Queen's Park


### DATA SHAPE

There **103** rows of data.

In [34]:
pcodes.shape

(103, 3)

### GEOCODER

Had trouble with GEOCODER. The sample code given went into a non-terminating loop.

In [18]:
import geocoder

postal_code = 'M5V'
lat_lng_coords = None

while(lat_lng_coords is None):
    print('in loop')
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

imported geocoder
in loop


Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=M5V%2C+Toronto%2C+Ontario&bounds=&components=&region=&language= (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x0000020F7F0564A8>, 'Connection to maps.googleapis.com timed out. (connect timeout=5.0)'))


in loop


Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=M5V%2C+Toronto%2C+Ontario&bounds=&components=&region=&language= (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x0000020F7F056E10>, 'Connection to maps.googleapis.com timed out. (connect timeout=5.0)'))


in loop


KeyboardInterrupt: 

### CSV

Used the CSV file to load the latitude / longitude coordinates into a DataFrame. Then merged the two DataFrames on 'Postal Code'.

In [48]:
geospatial = pd.read_csv('Data/Geospatial_Coordinates.csv')

result = pd.merge(pcodes, geospatial, on='Postal Code')
result.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
