# Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto

This notebook is **Alexis Raymond**'s submission to the *Segmenting and Clustering Neighborhoods in Toronto* portion of the data capstone project of the IBM Data Science Professional Certificate.

### Importing Useful Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Color libraries
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans # import k-means from clustering stage

import folium # map rendering library

### 1. Download and Explore Dataset

Since the dataset needed for this project is not available to download, I need to scrape the web for the data on the neighborhoods in Toronto. Luckily, there is a wikipedia page that contains all the information needed.

In [2]:
# Get the wikipedia page's source code
toronto_postal_codes = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
# Keep only the table from the source code
toronto_postal_codes = toronto_postal_codes[toronto_postal_codes.find('<tbody>')+7:toronto_postal_codes.find('</tbody>')]

In [4]:
# Split the table's source code in a list of rows
toronto_postal_codes = toronto_postal_codes.split('<tr>')[1:]

In [5]:
# Define the column names of the dataframe that will contain the Toronto neighborhoods
column_names = ['PostalCode','Borough','Neighborhood']

# Create the empty dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
# Loop through all the rows of the table and append the appropriate values to the dataframe
for i in range(1, len(toronto_postal_codes)) : 
     
    # Split the row in a list containing the postal code, the borough and the neighborhood
    neighborhood_data = toronto_postal_codes[i].split('\n')[1:-2]

    # Capture the value of the postal_code
    postal_code = neighborhood_data[0][4:-5]

    # Find start and end point of the name of the borough
    if neighborhood_data[1][4:-5].find('>') == -1 : # If the borough is not a link
        start = 0
        end = len(neighborhood_data[1][4:-5])

    else : # If the borough is a link
        start = neighborhood_data[1][4:-5].find('>') + 1
        end = -4

    # Capture the value of the borough
    borough = neighborhood_data[1][4:-5][start:end]

    # Find start and end point of the name of the neighborhood
    if neighborhood_data[2][4:].find('>') == -1 : # If the neighborhood is not a link
        start = 0
        end = len(neighborhood_data[2][4:])

    else : # If the neighborhood is a link
        start = neighborhood_data[2][4:].find('>') + 1
        end = -4

    # Capture the value of the neighborhood
    neighborhood = neighborhood_data[2][4:][start:end]

    # Append data to dataframe
    neighborhoods = neighborhoods.append({'PostalCode': postal_code, 'Borough': borough, 'Neighborhood': neighborhood}, ignore_index = True)

In [7]:
# Loop through all rows
for i in range(neighborhoods.shape[0]-1, -1, -1) :
    
    # Assign the borough's value if the neighborhood is not assigned
    if neighborhoods.iloc[i]['Neighborhood'] == 'Not assigned' :
        neighborhoods.iloc[i]['Neighborhood'] = neighborhoods.iloc[i]['Borough']
        
    # Combine rows with the same postal code
    if neighborhoods.iloc[i]['PostalCode'] == neighborhoods.iloc[i-1]['PostalCode'] :
        neighborhoods.iloc[i]['Borough'] = 'Not assigned'
        neighborhoods.iloc[i-1]['Neighborhood'] += ', ' + neighborhoods.iloc[i]['Neighborhood']
        
# Drop rows with unidentified boroughs
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned']

# Reset the indexes
neighborhoods.reset_index(drop = True, inplace = True)

In [8]:
# Show the first 5 entries in the dataframe
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [9]:
# Print the number of rows in the dataframe
print('There are ' + str(neighborhoods.shape[0]) + ' rows in the dataframe.')

There are 103 rows in the dataframe.


### 2. Find coordinates for each neighborhood

In [10]:
# Import coordinates CSV file
coordinates = pd.read_csv('Geospatial_Coordinates.csv')

# Rename column names
coordinates.columns = ['PostalCode', 'Latitude', 'Longitude']

# Show the first 5 entries of the dataframe
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
# Merge neighborhoods dataframe with coordinates dataframe
neighborhoods = pd.merge(neighborhoods, coordinates, on = 'PostalCode')

In [12]:
# Show the first 5 entries of the dataframe
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


### 3. Cluster Toronto neighborhoods

In [13]:
# Store credentials
CLIENT_ID = 'XINGNND4JIVSG3QCPPOYLVDNAGEUAIJJ0K5DFLJCW5JCSZXF'
CLIENT_SECRET = '3QOKG5X3SPUO4LX1RSFVO1GS5QNPW2LLQTF4RWAJ01LLOUP1'
LIMIT = 100
VERSION = '20190219'

In [14]:
# Define function to retrieve nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    # Create an empty list of venues
    venues_list=[]
    
    # Loop through neighborhoods
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    
    # Create dataframe with close venues to each neighborhood
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    # Return nearby venues dataframe
    return(nearby_venues)

In [15]:
# Create dataframe with nearby venues to each Toronto neighborhood
toronto_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

In [16]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# Show the first 5 entries of the dataframe
toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

In [18]:
# Set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Add cluster labels to dataframe
toronto_grouped['Cluster'] = kmeans.labels_ 

# Add coordinates to dataframe
toronto_grouped = pd.merge(toronto_grouped, neighborhoods[['Neighborhood','Latitude','Longitude']], on = 'Neighborhood')

In [19]:
# Create map of Toronto
map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_grouped['Latitude'], toronto_grouped['Longitude'], toronto_grouped['Neighborhood'], toronto_grouped['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters