# Applied Data Science Capstone Project
This Jupyter Notebook is part of my Capstone Project for the IBM Data Sciece Professional Certificate.

In this Notebook, we explore the neighborhoods of Toronto, Canada.

The assignment is broken down into three parts.  Section headers mark the beginning of my work for each part of the assignment.  If you're running the Notebook in IBM Watson or similar, you can click on the link to go to the top of that section, otherwise, please scroll down to find each section:
* [Section 1 - Data Collection](#section-1)
* [Section 2 - Data Enrichment](#section-2)
* [Section 3 - Exploration and Clustering](#section-3)

## Section 1 - Data Collection<a id='section-1'></a>
In this section we will build a Pandas Dataframe of Postal Code data for Canada from the [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "WikiPedia List of postal codes of Canada: M") WikiPedia page.  Additionally, some data cleansing is required, first to remove rows with Borough of "Not assigned", and second, for any remaining Neighbourhood of "Not assigned", set the Neighborhood to match thr Borough.

In [1]:
# Import Pandas
import pandas as pd

# Use Pandas to process the web page's HTML
source_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

# The data we're interested in is the first table in the collection
df = source_data[0]
df.shape

(180, 3)

In [2]:
# We have to clean the data
# First, remove rows with Borough = "Not assigned"
df = df[df.Borough != 'Not assigned']
df.shape

(103, 3)

In [3]:
# Next we have to update remaining rows where Neighbourhood is "Not assigned" - it turns out there are no such entries
df[df.Neighbourhood == 'Not assigned'].shape

(0, 3)

In [4]:
# If there were Neighbourhood values of "Not assigned", replacing them with the Borough would be done by the following
# Note, using this syntax avoids warnings about updating slices
# df = df.Neighbourhood.replace("Not assigned", df.Borough)

In [5]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


This result satisfies the requirement for the first part of the assignment.

## Section 2 - Data Enrichment<a id='section-2'></a>
In this section we enrich the Postal Code Dataframe, adding latitude and longitude data marking the aproximate center of the area covered by the Postal Code.  The core code for looking up the coordinates was provided in the assignment, and use here with comments.

In [6]:
# Install geocoder
!pip install geocoder



In [7]:
# First, add the new columns to the Dataframe with zeros - this technique avoids warnings about updating slices
df = df.assign(Latitude=[0.0 for _ in range(len(df))])
df = df.assign(Longitude=[0.0 for _ in range(len(df))])

df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,0.0,0.0
3,M4A,North York,Victoria Village,0.0,0.0
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0
5,M6A,North York,"Lawrence Manor, Lawrence Heights",0.0,0.0
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0.0,0.0


In [8]:
# Import geocoder
import geocoder

# Iterate of the rows of the databrame
for index, row in df.iterrows():
    # For each row, lookup the latitude and longitude value
    # !!! Begin - This code was provided largely in the assignment
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        # NOTE: Google failed to return results, but ArcGIS was very good at finding the coordinates
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(row['Postal Code']))
        lat_lng_coords = g.latlng
        
    df.loc[index,'Latitude'] = lat_lng_coords[0]
    df.loc[index,'Longitude'] = lat_lng_coords[1]
    # !!! End - This code was provided largely in the assignment

In [9]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.75245,-79.32991
3,M4A,North York,Victoria Village,43.73057,-79.31306
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


This result satisfies the requirement for the first part of the assignment.

## Section 3 - Exploration and Clustering<a id='section-3'></a>
Now we explor the Boroughs and Neighborhoods of Toronto using Foursquare's API.

First, let's get a separate set of data to work with just Toronto a little later.

In [10]:
df_toronto = df[df.Borough.str.contains('Toronto')]
df_toronto.shape

(39, 5)

In [11]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
22,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
30,M4E,East Toronto,The Beaches,43.67709,-79.29547


### Let's create maps of greater Toronto and Toronto showing the Neighborhoods

We need to get the coordinates for the center Toronto as a whole using ArcGIS.

In [12]:
# Use ArcGIS to find Toronto's latitude and longitude

lat_lng_coords = None
# loop until you get the coordinates
while(lat_lng_coords is None):
    # NOTE: Google failed to return results, but ArcGIS was very good at finding the coordinates
    g = geocoder.arcgis('Toronto, Ontario')
    lat_lng_coords = g.latlng
      
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

Now let's get folium installed so we can create maps.

In [13]:
# Install and import folium
!pip install folium



First, let's create a basic map of the Toronto greater metropilitan area, with the centers of each neighbouhood using Folium

In [14]:
# Create a basic map of the Toronto greater metropolitan area
import folium

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11, control_scale = True)

# add markers to map
for index, row in df.iterrows():
    label = '{}, {}'.format(row['Neighbourhood'], row['Borough'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's now create a map just of those Boroughs containing Toronto in the name.

In [15]:
# Next, create basic map of Toronto with the centers of each neighbourhood
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12, control_scale = True)

# add markers to map
for index, row in df_toronto.iterrows():
    label = '{}, {}'.format(row['Neighbourhood'], row['Borough'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Time for a little Exploratoin

Let's gather information on the venues around Toronto from Foursquare.  Note that, from this point forward, any code cells that may contain credentials has been marked as @hidden_cell.  Markdown cells will be present, and as much code as possible will be presented.

First, let's establish a connection to FourSqare and pull back venue data.

In [16]:
# The code was removed by Watson Studio for sharing.

In [17]:
VERSION = '20180605' # Foursquare API version
LIMIT = 300 # A default Foursquare API limit value
MAX_RADIUS = 1000

We're reusing several functions from one of the pracitce exerciese.  Credit to [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lyn](https://www.linkedin.com/in/polonglin) for their work on the [DS0701EN-3-3-2-Neighborhoods-New-York](https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DS0701EN/DS0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb) practice lab.

In [18]:
# Import requests to process HTML requests
import requests

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# function to collect venues for multiple locations
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Get to top Venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Using that function, we collect Venue data from FourSquare for the neighbourhoos of Toronto.

In [19]:
# Use the getNearbyVenues funtion to collect the nearby venue data within a kilometer
toronto_venues = getNearbyVenues(names=df_toronto['Neighbourhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude'],
                                   radius=MAX_RADIUS
                                  )

print(toronto_venues.shape)
toronto_venues.head()

(3331, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65512,-79.36264,Rooster Coffee,43.6519,-79.365609,Coffee Shop
3,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
4,"Regent Park, Harbourfront",43.65512,-79.36264,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


Now let's turn this raw data into content we can used to cluster the neighbourhoods, collecting the frequency of each Venue Category in each Neighbourhood.  This is a two stage process, transposing Venue Category items into Column headers and marking each Neighbourhood-Venue Category pair.

In [20]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,...,Train Station,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
toronto_onehot.shape

(3331, 278)

Next we compute the frequency values.

In [22]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,...,Train Station,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.03,...,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.01,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.04
4,Central Bay Street,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02


In [23]:
toronto_grouped.shape

(39, 278)

Build a Dataframe showing the top 10 venues for each Neighbourhood so we can review.

In [24]:
# Imporot numpy
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Japanese Restaurant,Restaurant,Hotel,Park,Gastropub,Liquor Store,Art Gallery,Beer Bar
1,"Brockton, Parkdale Village, Exhibition Place",Coffee Shop,Café,Bar,Restaurant,Furniture / Home Store,Bakery,Gift Shop,Thrift / Vintage Store,Supermarket,Diner
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Café,Hotel,Art Gallery,Gym,Beer Bar,Theater,Plaza,Pizza Place,Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Gym,Yoga Studio,Park,Café,Sandwich Place,Italian Restaurant,Spa,Bar,Brewery
4,Central Bay Street,Coffee Shop,Café,Hotel,Gastropub,Ramen Restaurant,Yoga Studio,Sushi Restaurant,Park,Theater,Pizza Place


### Let's do some Clustering

Now that we have the data in a format we can process it with, we can do k-means clustering on the Neighbourhoods.

In [25]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 0, 2, 0, 2, 0, 0, 2, 2], dtype=int32)

Now we put the Cluster information into the Neighbourhood review table.

In [26]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Park,Theater,Pub,Thai Restaurant,Bakery,Diner
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,0,Coffee Shop,Park,Café,Hotel,Sushi Restaurant,Pizza Place,Japanese Restaurant,Restaurant,Italian Restaurant,Middle Eastern Restaurant
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,0,Coffee Shop,Gastropub,Café,Japanese Restaurant,Pizza Place,Theater,Hotel,Italian Restaurant,Cosmetics Shop,Middle Eastern Restaurant
22,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,0,Coffee Shop,Café,Gastropub,Restaurant,Italian Restaurant,Seafood Restaurant,Farmers Market,Hotel,Pizza Place,Bakery
30,M4E,East Toronto,The Beaches,43.67709,-79.29547,2,Pub,Coffee Shop,Breakfast Spot,Pizza Place,Park,Japanese Restaurant,Sandwich Place,Caribbean Restaurant,Bar,BBQ Joint


Now let's see what the map looks like adding the clustering data.

In [27]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for index, row in toronto_merged.iterrows():
    cluster = row['Cluster Labels']
    label = folium.Popup(str(row['Borough'] + ', ' + row['Neighbourhood']) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

This ends the exploration of Toronto by grouping neihborhoods based on their 10 most common types of venues.