# IBM Applied Data Science Capstone
## WEEK 3 - PEER-GRADED ASSIGNMENT 

__Author__: Bart Onkenhout  
__Created__: 2019-03-25  
__Modified__: 2019-03-25

### Introduction
This Python notebook will primarily be used for the capstone project in Weeks 1 through 5 of the IBM Applied Data Science Capstone, which counts towards the IBM Data Science Professional Certificate.

### Assignment
This week's assignment is to create a notebook in which we:
1. scrape a Wikipedia page on Canadian postal codes,
2. create a _pandas_ DataFrame out of the condensed & scraped data,
3. process and clean the DataFrame according to our specifications,
4. append geolocation data to each postal code,
5. cluster the neighborhoods,
7. visualize and explain our analysis, and
8. push the notebook to GitHub.

### Clear as mud? Great! Let's get started.

---

## Step 0: Setup

#### Installs & imports
Let's start by importing the necessary libraries. As usual, we'll need Numpy and _pandas_. However, the assignment recommends that we also make use of the BeautifulSoup library in order to parse the XML & HTML returned from our web scrape. According to the documentation, BeautifulSoup can be imported from the bs4 module.

We'll also import the geocoder library to get the latitude and longitude for our postal codes, as well as supplementing our API calls if necessary. Since I'm developing this notebook in Watson Studio, I'll need to run a meta command to get the pip package manager to install it before it becomes available for import.

Finally, we also to import the requests library in order to handle all our HTTP requests.

In [1]:
!pip install folium
!pip install geocoder

Requirement not upgraded as not directly required: folium in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: numpy in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: jinja2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: branca>=0.3.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests->folium)
Requirement not upgraded as not directly required: idna<2.7,

In [2]:
from bs4 import BeautifulSoup
import folium
from folium import plugins
import geocoder
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from sklearn.cluster import KMeans

#### Constants
Years ago, a programmer friend of mine was looking over my shoulder as I was developing some code and let out an ear-piercing wail when he found out I was defining my constants in the code as I went, instead of putting them all at the head of the script so I could easily change them later. To this day I still follow his advice. So let's put our constants at the start, lest I draw his ire. Again.

In [3]:
URI = {}
URI['WIKI_POSTAL_CODES']    = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
URI['BACKUP_LAT_LONG'] = r'https://cocl.us/Geospatial_data'

## Step 1: Scrape the Wikipedia page

Let's go ahead and instantiate a BeautifulSoup parser. This will help us scrape the postal codes from Wikipedia. Looking at the documentation, BeautifulSoup takes html_doc as an argument. However, we need to be careful because we can't just plug in our Wikipedia URL and call it a day. First we need to fetch the document, transform it into a string, and _then_ hand it off to the parser.

In [4]:
wiki_doc = requests.get(URI['WIKI_POSTAL_CODES']).content
soup = BeautifulSoup(wiki_doc, 'html.parser')

Okay, let's take a look at our results.

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":876823784,"wgRevisionId":876823784,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

What a mess. All this for a Wikipedia page. But!!--it looks like we're in luck -- the table we want is stored as a standard HTML ```<table>``` element, which we can extract using the .table attribute. So let's take a look at that.

In [6]:
print(soup.table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

## Step 2: Create a _pandas_ DataFrame out of the scraped data

Luckily, _pandas_ has the ability to read html tables using the read_html() method. First, however, the soup.table must be cast to a string. We'll also need to override the inferred header row and column indeces (R0, C0) with the row and column we want to use. Since read_html() returns a _list_ of DataFrame objects, even if only one list is returned, we will also need to provide the list index of the desired DataFrame (in this case, 0).

In [7]:
raw_table = str(soup.table)
df = pd.read_html(raw_table, header=0)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Step 3: Process and clean the DataFrame

That's looking an awful lot like the table we want! Now let's clean it up according to the assignment rules. Let's drop any ```Not assigned``` __boroughs__, set any ```Not assigned``` __neighborhoods__ equal to their borough name, and then group any neighborhoods with the same postcode together. The last part is a bit tricky. However, we can use a group-by together with a lambda function on the 'Neighbourhood' column to concatenate everything together.

Finally, we'll need to reset the index to get everything back into a nice table format.

In [8]:
# Drop all non-assigned Boroughs
df = df[~df.Borough.isin(['Not assigned'])]

# Replace all Neighbourhod values with their Borough value if they are 'Not assigned'
df['Neighbourhood'].replace('Not assigned', df['Borough'], inplace=True)

# Group on Postcode and Borough, and use a lambda function to build a comma-separated list out of all Neighbourhoods in the same postal code
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda n: '%s' % ', '.join(n)).reset_index()

Finally, let's print the shape. Note that the assignment is unclear as to whether to print the shape of the complete DataFrame, or just the DataFrame containing the postal codes listed on the assignment page. I'm going to assume we want the full DataFrame.

In [9]:
print(df.shape[0], 'rows,', df.shape[1], 'columns')

103 rows, 3 columns


## Step 4: Append geolocation data to each postal code

Looking at the DataFrame, it seems that the first three characters in the Postalcode will probably be the most useful in clustering the neighborhoods. After all, this is how we do most population analyses in the United States -- by ZIP code and ZIP code stem. I've already tried both the Nominatim and Geocoder libraries, and their results are too unreliable, so I will grab the desired coordinates from the provided file.

In [10]:
!wget -O geospatial_data.csv https://cocl.us/Geospatial_data

--2019-03-25 21:27:54--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-25 21:27:55--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-25 21:27:55--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-03-25 

In [11]:
df_coords = pd.read_csv('geospatial_data.csv')
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Looks like my hunch was right -- latitude and longitude are indexed on postal code here. So let's go ahead and merge the two DataFrames on that key. Note that the CSV provides a different header name for the column as the one scraped from Wikipedia. So we'll have to make a slight tweak to the column headers in the newly-read DataFrame just to keep things simple when merging with the master DataFrame.

In [12]:
df_coords.columns = ['Postcode', 'Latitude', 'Longitude']
df = df.merge(right=df_coords, how='left')
df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Looks like it worked. Again, I tried to use the Geocoder library, but it kept returning ```None``` results no matter what. Then I tried the Nominatim equivalent from OSM, but this generated too many unreliable results. So using the pre-filled CSV file provided by the instructor was the most efficient move in my case. YMMV.

## Step 5: Cluster the neighborhoods

At this point I've gone back up to the start of the notebook and imported the cluster and pyplot packages from sklearn and matplotlib, respectively. This is purely for aesthetic reasons -- I like to have all my imports at the start of the script in case I need to change things around.

Before we initialize our KMeans model, we need to first determine which parameters we are going to use. KMeans takes _n_clusters_ and _n_init_ as arguments, which represent the number of clusters to form and the number of iterations the model will run to converge on the best initial centroid coordinate. This is done purely by feel, so we'll need to experiment around a bit.

My hunch, based on absolutely no evidence whatsoever, is we could probably first try generating as many clusters as there are total Boroughs. Then, let's iterate with, say, 20 different centroid seeds to get started.

In [13]:
n_boroughs = len(df['Borough'].unique())
n_init = 20

k_means_model = KMeans(init='k-means++', n_clusters=n_boroughs, n_init=20)

Now we need to fit the model to a _feature matrix_. In this case, our feature matrix is actually the latitude and longitude, since those are the two dimensions along which the postal codes are segmented. Right? Right??

In [14]:
feature_matrix = df.iloc[:,[3,4]].values
feature_matrix.shape

(103, 2)

Luckily, we can easily cast to a Numpy array from a DataFrame using the .iloc and .values commands. Our feature matrix has the same number of rows as the original DataFrame, so let's throw it into the model and see what comes out.

In [15]:
k_means_model.fit(feature_matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=11, n_init=20, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Okay, looks like we have our first fit model. Let's first visualize some stuff before we see what the model created for us.

## Step 6: Visualize

I've gone back to the very top of the notebook again and installed + imported the Folium library. Now let's define a base map centered around Toronto, which according to Google, is centered around __43.761539 N__ and __79.411079 W__. So let's plug those in as initial parameters and set the zoom level to 12. Then we'll add all the neighborhoods in our original dataset as icons.

In [16]:
toronto_map = folium.Map(location=[43.761539, -79.411079], zoom_start=12)

for lat, lng, hood in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    folium.Marker(
        [lat, lng],
        tooltip=hood
    ).add_to(toronto_map)
    
toronto_map

I seriously cannot believe that worked the first time around! Looks pretty good, but it's missing the neighborhood clusters. Using the fitted k-means model we can extract a response vector labe for each neighborhood cluster, add those as MarkerClusters, color the underlying icons according to each neighborhood cluster, and see what happens.

In [17]:
# Create a new DataFrame containing the Neighbourhood clusters generated by our k-means model
df_clust = df.join(pd.DataFrame(k_means_model.labels_, columns=['Neighbourhood_cluster']))

# Instantiate a dict to hold all our clusters -- this will work better than an array, since it allows us to select on any arbitrary key, not just an index
map_hood_clust = {}

# Instantiate a new map
toronto_map = folium.Map(location=[43.761539, -79.411079], zoom_start=10)

# Iterate over all the unique cluster names generated by the k-means model and create a MarkerCluster object for each, keyed on Neighborhood cluster name (i.e., the cluster number)
# Populate the MarkerCluster with the latitude & longitude values and the popup with the concatenated Neighborhood names
for clust_num in df_clust['Neighbourhood_cluster'].unique():
    map_hood_clust[clust_num] = folium.plugins.MarkerCluster().add_to(toronto_map)

# Add each neighborhood to its corresponding MarkerCluster on the map
for postcode, lat, lng, hood, hood_clust in zip(df_clust['Postcode'], df_clust['Latitude'], df_clust['Longitude'], df_clust['Neighbourhood'], df_clust['Neighbourhood_cluster']):
    # Create a nice HTML table listing all the Neighbourhoods in each marker -- especially nice for those neighborhoods we've grouped together
    ttip_ul = ''
    for l in hood.split(', '): ttip_ul = '%s<li>%s</li>' % (ttip_ul, l)
    
    # Wrap the table in an unordered list tag
    ttip_ul = '<ul>%s</ul>' % ttip_ul
    
    # Build the rest of the tooltip
    ttip='<p><b>Postal code:</b> %s</p><p><b>Neighborhood(s):</b>%s</p><p><b>Neighborhood cluster #: </b>%s</p>' % (postcode, ttip_ul, hood_clust)
    
    # And plot on the map
    folium.Marker(location=[lat, lng],
                  tooltip=ttip).add_to(map_hood_clust[hood_clust])

# Show that bad boy!
toronto_map

<img src="https://media1.tenor.com/images/1dd03671ab0311a6ec446dd1ce4d91a9/tenor.gif" />

Looks like the clustering algorithm worked pretty well the first time around. Zooming out to level 10 you can see all polygons around each of the neighborhood clusters when hovering over each MarkerCluster. And playing around with the zoom levels of the map really shows how the algorithm decided to fit each cluster centroid. There's also a pretty dense clustering around downtown Toronto, with a very small distance between each node.

But this makes intuitive sense -- since we are using the postal code the foundational grouping key for the base dataset, it is to be expected that the boroughs with the highest number of unique postal codes in the same general proximity will have the densest clusters. The downtown zones of most major metropolitan areas tend to have more postal codes, since there are a higher number of physical addresses to serve and therefore the national postal system needs to be able to spread this work over more service zones.

You can see this in action by aggregating the total number of postal codes per borough:

In [18]:
print(df.groupby('Borough')['Postcode'].count().sort_values(ascending=False))

Borough
North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Queen's Park         1
Mississauga          1
Name: Postcode, dtype: int64


But enough philosophizing. We we want to know where all the neighborhoods are centered.

So let's go ahead and take the next step and plot __only the cluster centroids__ our model generated. This will give us a sense of which latitude and longitude we are going to base our Foursquare API calls around in the next module in order to find the best 'hood in Toronto!

In [19]:
k_means_model.cluster_centers_

array([[ 43.76583949, -79.41024904],
       [ 43.72860417, -79.26565143],
       [ 43.65431045, -79.38859451],
       [ 43.6337401 , -79.54939869],
       [ 43.7226207 , -79.55085645],
       [ 43.78255129, -79.3284457 ],
       [ 43.70479501, -79.39765512],
       [ 43.73068796, -79.47922513],
       [ 43.69012922, -79.32687005],
       [ 43.66382021, -79.4744486 ],
       [ 43.78989244, -79.20966014]])

In [20]:
toronto_map = folium.Map(location=[43.761539, -79.411079], zoom_start=12)

for coords in k_means_model.cluster_centers_:
    folium.Marker([coords[0], coords[1]]).add_to(toronto_map)
    
toronto_map

## Et voilá!
I'm pretty happy with this first result, so I think I'll push this version to GitHub. However, I might tinker around with some of the parameters later on, such as increasing the number of clusters the algorithm spits out. This is because I do see a gaping hole without a neighborhood cluster around York. Again, this is probably because of bias in the way the postal system codes on that particular borough -- it might be sparsely populated (such as a suburb), or just simply lack more than a handful of postal zones due to historical reasons.

In many ways, Toronto reminds me of Chicago, my current home city -- on one of the Great Lakes, _very_ dense downtown area that quickly ramps down into sparse suburbs, highly gridded streets, multiple airports, and an almost unaffordable cost of living ;). Let's see if that's an aspect we can explore in the next module!

---

\- B.