# Capstone project for IBM Data Science Professional Certificate
# Week 3: Clustering of Toronto's neighborhoods

**Author: Elitza Maneva**

This notebook is for the Week 3 assignment of the Capstone Project course.

In [32]:
import pandas as pd
import numpy as np

## Scraping zip-code data

We scrape the zip-code data for Toronto from the Wikipedia page [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

In [33]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)

print(len(dfs))

3


At the time of writing this notebook, the first table on the webpage is the table we are interested in. We check this by printing it.

In [21]:
dfs[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [34]:
zip_df = dfs[0]

In [35]:
zip_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


First, we drop all rows that have as Borough "Not assigned" and reset the index.

In [36]:
zip_df.replace('Not assigned', np.NaN, inplace = True)
zip_df.dropna(subset = ["Borough"], axis = 0, inplace = True)
zip_df.reset_index(inplace = True)
zip_df.drop(["index"], axis = 1, inplace = True)
zip_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [76]:
zip_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [38]:
zip_df.describe(include = "all")

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,10,99
top,M1S,North York,Downsview
freq,1,24,4


In [39]:
zip_df.shape

(103, 3)

## Getting coordinates for each postal code 

In [41]:
coord_df = pd.DataFrame(["Latitude", "Longitude"])

In [43]:
import geocoder # import geocoder

for postal_code in zip_df["Postal Code"]:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    coord_df.append({"Latitude": latitude, "Longitude": longitude})

coord_df

KeyboardInterrupt: 

Unfortunatelly, the service keeps returning None. Trying it for a signgle zip code also failed.

In [47]:
while(lat_lng_coords is None):
    g = geocoder.google('M3A, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng

KeyboardInterrupt: 

We use the workaround from the assignment description, and downloaded the .csv file with the coordinates from http://cocl.us/Geospatial_data 

In [50]:
coord_df = pd.read_csv("Geospatial_Coordinates.csv")
coord_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [56]:
coord_df.set_index("Postal Code", inplace = True)

In [74]:
toronto_df = zip_df.join (coord_df, on = ['Postal Code'])
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Let's check some coordinates of zip codes we know.

In [73]:
toronto_df.loc[toronto_df["Postal Code"] == 'M5G']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [75]:
toronto_df.loc[toronto_df["Postal Code"] == 'M2H']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
27,M2H,North York,Hillcrest Village,43.803762,-79.363452


## Getting data about businesses in Toronto neighborhoods from Foursquare.com