<a href="https://colab.research.google.com/github/aziesel/Coursera_Capstone/blob/master/ADSC_week3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Part 1: Creating a postal code/neighbourhood dataframe

Start by installing BeautifulSoup (bs4 package) and importing other required modules: requests for retrieving HTML data, and pandas for creating a data frame.

In [125]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Firstly, we use request to retrieve the contents of the specified Wikipedia page. Then we create a BeautifulSoup object ('to_soup') to parse out the table containing information and specifically the row contents ('to_postal' and 'to_postal_table'). 

For each row in to_postal_table, we collect each of the entries on that row (marked with <td> </td> flags in the HTML), remove any newline characters, and append each row as one element into a temporary list.

Next, we convert the temporary list to a DataFrame object, remove an errant empty row, and remove any postal codes for which the Borough or Neighbourhood is 'Not assigned'. Finally we reset the index of the DataFrame to adjust for removed rows.

In [126]:
to_data = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
to_soup = BeautifulSoup(to_data, "html5lib")
to_postal = to_soup.find('table', {'class':'wikitable sortable'})
to_postal_table = to_postal.find_all('tr')

temp = []
for row in to_postal_table:
    entry = row.find_all('td')
    row = [row.text.strip() for row in entry]
    temp.append(row)

to_df = pd.DataFrame(temp, columns=["PostalCode", "Borough", "Neighbourhood"])
to_df = to_df.iloc[1:]
to_df = to_df[to_df.Borough != 'Not assigned']
to_df = to_df[to_df.Neighbourhood != 'Not assigned']
to_df.reset_index()
#print(to_df)

Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,3,M3A,North York,Parkwoods
1,4,M4A,North York,Victoria Village
2,5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,6,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,166,M4Y,Downtown Toronto,Church and Wellesley
100,169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Lastly, the shape of our DataFrame is:

In [127]:
to_df.shape

(103, 3)

Part 2: Finding latitude/longitude for each postal code

Since the geocoder package is being fussy, we'll use the helpfully supplied .csv file specified in the assignment.

We start by adding "latitude" and "longitude" columns to our Toronto data dataframe, and then compare the PostalCode value from our Toronto data with the provided .csv to retrieve the coordinate data for each postal code. This isn't an elegant solution, involving two loops, but they are two relatively short lists to compare.

In [128]:
coords = pd.read_csv('/content/Geospatial_Coordinates.csv')
coords.head()

to_df["latitude"] = ""
to_df["longitude"] = ""

for i, row in to_df.iterrows():
  for j, entry in coords.iterrows():
    if coords.loc[j,"Postal Code"] == to_df.loc[i,"PostalCode"]:
      to_df.loc[i, "latitude"] = coords.iloc[j,1]
      to_df.loc[i, "longitude"] = coords.iloc[j,2]
to_df = to_df.dropna()
to_df.rename(columns={"3": "latitude", "4": "longitude"})
print(to_df)

    PostalCode           Borough  ... latitude longitude
3          M3A        North York  ...  43.7533  -79.3297
4          M4A        North York  ...  43.7259  -79.3156
5          M5A  Downtown Toronto  ...  43.6543  -79.3606
6          M6A        North York  ...  43.7185  -79.4648
7          M7A  Downtown Toronto  ...  43.6623  -79.3895
..         ...               ...  ...      ...       ...
161        M8X         Etobicoke  ...  43.6537  -79.5069
166        M4Y  Downtown Toronto  ...  43.6659  -79.3832
169        M7Y      East Toronto  ...  43.6627  -79.3216
170        M8Y         Etobicoke  ...  43.6363  -79.4985
179        M8Z         Etobicoke  ...  43.6288   -79.521

[103 rows x 5 columns]


Part 3: Neighbourhood clustering

We will do similar clustering as the example provided on the New York neighbourhood data in this week's lab. Start by importing required modules. We've already created the Toronto version of 'newyork_data.json' earlier in this week's project, so we can skip ahead to using Folium to create a map of Toronto with its postal code/neighbourhoods annotated; to do this, first intiate a Folium object using Toronto's latitude and longitude.

# New Section

In [129]:
import numpy as np
import json
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

In [130]:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="to_explorer")
to_loc = geolocator.geocode(address)
to_lat = to_loc.latitude
to_long = to_loc.longitude
print("The geographical coordinates of Toronto ON are {} {}.".format(to_lat, to_long))

The geographical coordinates of Toronto ON are 43.6534817 -79.3839347.


In [131]:
to_map = folium.Map(location=[to_lat, to_long], zoom_start=10)
for lat, long, borough, neighborhood in zip(to_df['latitude'], to_df['longitude'], to_df['Borough'], to_df['Neighbourhood']):
  label = '{}, {}'.format(neighborhood, borough)
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        parse_html=False).add_to(to_map)
  
to_map

Let's focus on North York.

In [132]:
noyo_df = to_df[to_df['Borough'] == 'North York'].reset_index(drop=True)
noyo_address = 'North York, Ontario'
geolocator = Nominatim(user_agent="to_explorer")
noyo_loc = geolocator.geocode(noyo_address)
noyo_lat = noyo_loc.latitude
noyo_long = noyo_loc.longitude
print("The geographical coordinates of Downtown Toronto are {}, {}.".format(noyo_lat, noyo_long))

noyo_map = folium.Map(location=[noyo_lat, noyo_long], zoom_start=11)
for lat, long, label in zip (noyo_df['latitude'], noyo_df['longitude'], noyo_df['PostalCode']):
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
      [lat, long],
      radius=5,
      popup = label,
      color='red',
      fill = True,
      fill_color = '#f99494',
      fill_opacity = 0.7,
      parse_html = False).add_to(noyo_map)

noyo_map

The geographical coordinates of Downtown Toronto are 43.7543263, -79.44911696639593.


Next we collect Foursquare data on the venues within a 500 meter radius of the postal code centers, including exact location and venue category for each. We'll work from the 'getNearbyVenues' function introduced in this week's lab.

In [133]:
CLIENT_ID="KIEIJGM0YYHLGH3XXKDTDRNKB0YBRET5FQMKOKMWDA0XUJZS"
CLIENT_SECRET="XSC31TKQBQZCGA4HBAK0UBUE5BBMRDWHNHI50U1ATT0L4HZD"
VERSION='20180605'
LIMIT=100

def getNearbyVenues(names, lats, longs, radius=500):
  venues_list = []
  for name, lat, lng in zip(names, lats, longs):
    #print(name, lat, lng)
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        lng,
        radius,
        LIMIT)

    results = requests.get(url).json()["response"]['groups'][0]['items']
    venues_list.append([(
        name,
        lat,
        lng,
        v['venue']['name'],
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 'PC latitude', 'PC longitude', 'Venue', 'V latitude', 'V longitude', 'Category']

  return(nearby_venues)

In [134]:
noyo_venues = getNearbyVenues(names=noyo_df['PostalCode'],lats=noyo_df['latitude'],longs=noyo_df['longitude'])
#print(noyo_venues)
print("There are {} unique venue categories in North York.".format(len(noyo_venues['Category'].unique())))

There are 103 unique venue categories in North York.


Now we can begin analyzing the postal code regions. Again we work from this week's lab, generating a one-hot encoding table for each region and venue category, and then calculating mean frequencies of each type of venue. This will be the quantitative basis for our clustering later on.

In [164]:
noyo_onehot = pd.get_dummies(noyo_venues[['Category']], prefix = "", prefix_sep = "")
noyo_onehot['PostalCode'] = noyo_venues['PostalCode']
noyo_fixed_cols = [noyo_onehot.columns[-1]] + list(noyo_onehot.columns[:-1])
noyo_onehot = noyo_onehot[noyo_fixed_cols]

noyo_grouped = noyo_onehot.groupby('PostalCode').mean().reset_index()

def return_most_common_venues(row, num_top_venues):
  row_categories = row.iloc[1:]
  row_categories_sorted = row_categories.sort_values(ascending=False)
  return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
  try:
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
  except:
    columns.append('{}th Most Common Venue'.format(ind+1))

noyo_venues_sorted = pd.DataFrame(columns=columns)
noyo_venues_sorted['PostalCode'] = noyo_grouped['PostalCode']
for ind in np.arange(noyo_grouped.shape[0]):
  noyo_venues_sorted.iloc[ind, 1:] = return_most_common_venues(noyo_grouped.iloc[ind, :], num_top_venues)

noyo_venues_sorted

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,Mediterranean Restaurant,Golf Course,Dog Run,Pool,Women's Store,Department Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping
1,M2J,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Japanese Restaurant,Shoe Store,Food Court,Bank,Bakery,Toy / Game Store
2,M2K,Chinese Restaurant,Café,Bank,Japanese Restaurant,Dim Sum Restaurant,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
3,M2M,Park,Women's Store,Chinese Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cupcake Shop
4,M2N,Ramen Restaurant,Pizza Place,Café,Restaurant,Sandwich Place,Coffee Shop,Shopping Mall,Middle Eastern Restaurant,Pet Store,Movie Theater
5,M2P,Park,Construction & Landscaping,Convenience Store,Women's Store,Chinese Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop,Cupcake Shop
6,M2R,Coffee Shop,Pharmacy,Pizza Place,Grocery Store,Discount Store,Women's Store,Department Store,Clothing Store,Comfort Food Restaurant,Construction & Landscaping
7,M3A,Park,Food & Drink Shop,Women's Store,Dessert Shop,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
8,M3B,Japanese Restaurant,Caribbean Restaurant,Café,Gym,Women's Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop
9,M3C,Coffee Shop,Beer Store,Gym,Restaurant,Dim Sum Restaurant,Discount Store,Clothing Store,Japanese Restaurant,Italian Restaurant,Bike Shop


Now that we have our frequency table of venue categories for each postal code, we can perform k-means clustering. As with this week's lab, we'll use 5 clusters.

In [161]:
noyo_grouped_clustering = noyo_grouped.drop('PostalCode', 1)
kmeans = KMeans(n_clusters=5, random_state=0).fit(noyo_grouped_clustering)

noyo_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
noyo_merged = noyo_df
noyo_merged = noyo_merged.join(noyo_venues_sorted.set_index('PostalCode'), on='PostalCode')
noyo_merged.drop(index=12, inplace=True) #index 12 had no venues listed and so couldn't be clustered!

In [163]:
noyo_clusters = folium.Map(location=[noyo_lat, noyo_long], zoom_start=11)
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lng, poi, cluster in zip(noyo_merged['latitude'], noyo_merged['longitude'], noyo_merged['PostalCode'], noyo_merged['Cluster Labels']):
  label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
  folium.CircleMarker(
      [lat, lng],
      radius=5,
      popup=label,
      color=rainbow[int(cluster)-1],
      fill=True,
      fill_color=rainbow[int(cluster)-1],
      fill_opacity=0.7).add_to(noyo_clusters)
noyo_clusters

And now we can see most neighbourhoods in North York are similar (purple dots), with a small number having a different distribution of venues (red, blue, orange and green dots).