For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

Start by creating a new Notebook for this assignment.

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

First, importing everything needed:

In [205]:
import requests
import urllib.request
from bs4 import BeautifulSoup

We begin by reading the source code for a given web page and creating a BeautifulSoup (soup)object with the BeautifulSoup function.

In [206]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

r = requests.get(url)

html_content = r.text

soup = BeautifulSoup(html_content,'lxml')

Now we have to extract the table from the imported html file content.

In [207]:
table = soup.find('table',{'class':'wikitable sortable'})

Time to find the headings of the imported table.

In [208]:
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
headings

['Postcode', 'Borough', 'Neighbourhood']

Now we need to use the find_all() function two times: first to get each line of the table and then to get each element of each line.

In [218]:
postcode = []
borough = []
neighbourhood = []
for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if (not tds) or (tds[1].text == 'Not assigned'):
            continue      
    postcode.append(tds[0].text)
    borough.append(tds[1].text) 
    if (tds[2].text == 'Not assigned'):
        neighbourhood.append(tds[1].text)
    else:
        neighbourhood.append(tds[2].text.strip('\n'))


Time to join the lists into a Pandas DataFrame.

In [219]:
import pandas as pd
df = pd.DataFrame({headings[0]:postcode,headings[1]:borough,headings[2]:neighbourhood})

Now, we have to group and join the table lines according to the Postcode key.

In [220]:
grouped = df.groupby([headings[0], headings[1]])[headings[2]].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', ', ').reset_index()
grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [221]:
grouped.shape

(103, 3)

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [222]:
geo_data = pd.read_csv("https://cocl.us/Geospatial_data") 
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we have to use the same key name in both dataframes (grouped and geo_data)

In [223]:
geo_data.rename({'Postal Code': headings[0]}, axis=1, inplace=True) 
geo_data.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


It is necessary to chose the same primary key in both dataframes (grouped and geo_data)

In [224]:
grouped.set_index(headings[0], inplace=True)
grouped.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge, Malvern"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [225]:
geo_data.set_index(headings[0], inplace=True)
geo_data.head()

Unnamed: 0_level_0,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Finally, we merge both dataframes (grouped and geo_data).

In [226]:
result = pd.merge(grouped, geo_data, left_index=True, right_index=True, how='inner')
result.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476
