# Week 3 Lab

In this assignment, it is required to explore and cluster the neighborhoods in Toronto.

In [149]:
import numpy as np 
import pandas as pd 

Start by making sure we have a good html-parser.

In [150]:
!pip install BeautifulSoup4



### Scrape the raw data from Wikipedia.

Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [151]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfl = pd.read_html(url, attrs={'class': "wikitable sortable"}, header=0)
dfl[0] # 'cause it's a list of dataframes, even if there's only 1 element in the list.. 

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


### Assigning it to a new DataFrame.

To make it easier to follow.

In [152]:
df = dfl[0]

### Only process the cells that have an assigned borough. 

Ignore cells with a borough that is *Not assigned*.

In [153]:
no_bor = df.drop(df[df.Borough == 'Not assigned'].index) 
no_bor.index = range(len(no_bor)) #re-initialize index
no_bor

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


### Combine neighbourhoods with same postcode and borough.


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [154]:
neigh_joined = no_bor.groupby(['Postcode','Borough'])['Neighbourhood'].apply(' , '.join).reset_index()
neigh_joined

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge , Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village , Martin Grove Gardens , Ric..."
101,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate..."


### Add borough name if neighbourhood name is not assigned.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [155]:
#first checking if such even exists
neigh_bor = neigh_joined[neigh_joined.Neighbourhood == 'Not assigned']
neigh_bor

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


In [156]:
#copying the borough content
neigh_joined.Neighbourhood = np.where(neigh_joined.Neighbourhood == 'Not assigned', neigh_joined.Borough, neigh_joined.Neighbourhood)

In [157]:
#first checking if such even exists
neigh_bor2 = neigh_joined[neigh_joined.Neighbourhood == 'Not assigned']
neigh_bor2

Unnamed: 0,Postcode,Borough,Neighbourhood


In [158]:
# now verifying even, that the right value got copied and exists:
neigh_bor3 = neigh_joined[neigh_joined.Neighbourhood == 'Queen\'s Park']
neigh_bor3

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park
93,M9A,Downtown Toronto,Queen's Park


Looks like no "not assigned" values left. All cleanups done.

### Print the shape.

In [159]:
neigh_joined.shape

(103, 3)

### Get the latitude and longitude of each postal code.


Trying first tto use the suggested API.

In [160]:
#!pip install geocoder

In [161]:
#import geocoder # import geocoder

In [162]:
### FIRST TEST, FAILS TO RESPOND. REVERTING TO using CSV.
# initialize your variable to None
"""
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format('M3A'))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
"""

"\nlat_lng_coords = None\n\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format('M3A'))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n"

Pulling the CSV instead.

In [163]:
url ='https://cocl.us/Geospatial_data'
geodf = pd.read_csv(url,header=0)
geodf

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [164]:
neigh_joined.join(geodf.set_index('Postal Code'), on='Postcode')

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village , Martin Grove Gardens , Ric...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate...",43.739416,-79.588437


### Last steps.

Saving as a new notebook and pushing to Github.

_Author: Ele-Kaja Gildemann 2019_