<h1>Coursera Capstone: Week 3</h1>

<h1>NOTE: This workbook includes all three parts of assignment</h1>

<h2>Part 1 of assigment: Get postal code data and remove duplicates etc.</h2>

<h3>First, import libraries and set Pandas options</h3>

In [23]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 500)
pd.set_option('display.expand_frame_repr', False)

print('Libraries imported.')

Libraries imported.


<h3>Now use BeautifulSoup to scrape the Wikipedia page for postcode/borough/neighborhood data</h3>

In [24]:
import urllib.request
from bs4 import BeautifulSoup

# request contents of page from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = urllib.request.urlopen(url)
article = req.read().decode()

# load article, turn into soup and get the <table>
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

# search through the tables for the one with the headings we want
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:3] == ['Postcode', 'Borough', 'Neighborhood']:
        break

# create a new dataframe and populate
df = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighborhood'])
for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if not tds:
        continue
    postcode, borough, neighborhood = [td.text.strip() for td in tds[:3]]
    if 'Not assigned' not in borough:
        if 'Not assigned' in neighborhood:
            neighborhood = borough
        df.loc[len(df)] = [postcode, borough, neighborhood]


In [25]:
# display our table
from tabulate import tabulate
print(tabulate(df, headers='keys', tablefmt='psql'))

+-----+------------+------------------+---------------------------------------------------+
|     | Postcode   | Borough          | Neighborhood                                      |
|-----+------------+------------------+---------------------------------------------------|
|   0 | M3A        | North York       | Parkwoods                                         |
|   1 | M4A        | North York       | Victoria Village                                  |
|   2 | M5A        | Downtown Toronto | Harbourfront                                      |
|   3 | M5A        | Downtown Toronto | Regent Park                                       |
|   4 | M6A        | North York       | Lawrence Heights                                  |
|   5 | M6A        | North York       | Lawrence Manor                                    |
|   6 | M7A        | Queen's Park     | Queen's Park                                      |
|   7 | M9A        | Etobicoke        | Islington Avenue                        

<h3>Next, we create a dataframe to concatenate neighborhoods and eliminate duplicates</h3>

In [26]:
# create final dataframe to be used to concatenate neighborhoods while eliminating duplicates
df_final = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighborhood'])

# now iterate and find any duplicate rows
for index, row in df.iterrows():
    dupe_df = df_final.loc[df_final['Postcode'] == row['Postcode']]
    if dupe_df.empty:
        # no duplicate Postcode found - add as a new row
        # print ("No duplicate - adding as new row: " + row['Neighborhood']) # uncomment for debugging
        df_final.loc[len(df_final)] = [row['Postcode'], row['Borough'], row['Neighborhood']]
    else:
        # duplicate Postcode was found - concatenate the Neighborhood to the existing row
        # print ("Duplicate found for " + row['Postcode'] + " - adding neighborhood: " + dupe_row['Neighborhood']) # uncomment for debugging
        for dupe_index, dupe_row in dupe_df.iterrows():
            df_final.loc[dupe_index] = [row['Postcode'], row['Borough'], row['Neighborhood'] + "," + dupe_row['Neighborhood']]
        
print(tabulate(df_final, headers='keys', tablefmt='psql'))

+-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------+
|     | Postcode   | Borough          | Neighborhood                                                                                                                    |
|-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------|
|   0 | M3A        | North York       | Parkwoods                                                                                                                       |
|   1 | M4A        | North York       | Victoria Village                                                                                                                |
|   2 | M5A        | Downtown Toronto | Regent Park,Harbourfront                                                                                      

<h3>And, now display the shape of our generated dataframe</h3>

In [27]:
# display the shape of our final dataframe
df_final.shape

(103, 3)

<h2>Part 2 of assignment: Add coordinates to dataframe</h2>

<h3>Attempt to use 'geopy' libraries to get coordinates first - please be patient - this takes a minute or two!</h3>

<h3>NOTE: The Nominatim interface only returns a small number of coordinates no matter how many times we try - so next step will be to use .csv file to fill in the blanks</h3>

In [28]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't loaded geopy previously

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# add new columns and initialize
df_final['Latitude'] = 0.0
df_final['Longitude'] = 0.0

for index, row in df_final.iterrows():
    address = "{} {}, Toronto, Ontario".format(row['Postcode'], row['Borough'])
    geolocator = Nominatim(user_agent="ny_explorer")
    location = None
    
    #while (location is None): #note - this retry logic doesn't seem to work - many postcodes won't return a valid lat/long response via 'Nominatim' calls
    location = geolocator.geocode(address)
    
    if location:
        latitude = location.latitude
        longitude = location.longitude
        print ("O", end=""), # let the user know that we found coordinates
        df_final.loc[index, 'Latitude'] = latitude
        df_final.loc[index, 'Longitude'] = longitude
    else:
        print (".", end=""), # let the user know we didn't find coordinates
        
print("") # add cr/lf and start new line        
print(tabulate(df_final, headers='keys', tablefmt='psql'))

....O.O.....OO...O..O.......O.O..OO........O........O......O...O.....O.......O...O........O...O........
+-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------+------------+-------------+
|     | Postcode   | Borough          | Neighborhood                                                                                                                    |   Latitude |   Longitude |
|-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------+------------+-------------|
|   0 | M3A        | North York       | Parkwoods                                                                                                                       |     0      |      0      |
|   1 | M4A        | North York       | Victoria Village                                                    

<h3>Finally, fill in the blanks with the provided csv list of postal codes/coordinates</h3>

In [29]:
import io
import requests

# read the file which contains the Postal Codes / Coords
url = "http://cocl.us/Geospatial_data"
df_codes_coordinates = pd.read_csv(url)

# now iterate and replace coords in our dataframe
for index, row in df_final.iterrows():
    if row['Latitude'] == 0.0:
        coords_df = df_codes_coordinates.loc[df_codes_coordinates['Postal Code'] == row['Postcode']]
        for coords_index, coords_row in coords_df.iterrows():
            df_final.loc[index] = [row['Postcode'], row['Borough'], row['Neighborhood'], coords_row['Latitude'], coords_row['Longitude']]

print(tabulate(df_final, headers='keys', tablefmt='psql'))


+-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------+------------+-------------+
|     | Postcode   | Borough          | Neighborhood                                                                                                                    |   Latitude |   Longitude |
|-----+------------+------------------+---------------------------------------------------------------------------------------------------------------------------------+------------+-------------|
|   0 | M3A        | North York       | Parkwoods                                                                                                                       |    43.7533 |    -79.3297 |
|   1 | M4A        | North York       | Victoria Village                                                                                                                |    43.7259 |    -79.3156 |
|   2 | M5A    

<h2>Part 3 of assignment: Visualize neighborhoods</h2>

<h3>Display map with raw data - no clustering</h3>

In [30]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't installed folium previously
import folium # map rendering library

# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

<h3>Perform clustering using KMeans</h3>

In [31]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = df_final.drop('Neighborhood', 1)
toronto_grouped_clustering = toronto_grouped_clustering.drop('Borough', 1)
toronto_grouped_clustering = toronto_grouped_clustering.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 0, 1, 0, 4, 2, 3, 2, 0], dtype=int32)

<h3>Finally, display our clustered neighborhoods - you should see 5 different clusters</h3>

In [32]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create exact copy of df_final so we can re-run the code below if needed
df_final_merged = df_final.copy()

# add clustering labels
df_final_merged.insert(0, 'Cluster Labels', kmeans.labels_)

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_final_merged['Latitude'], df_final_merged['Longitude'], df_final_merged['Neighborhood'], df_final_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>NOTE: I found that this group of 5 clusters seems to provide a good grouping of the different neighborhoods</h3>