# Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto

This notebook is **Alexis Raymond**'s submission to the *Segmenting and Clustering Neighborhoods in Toronto* portion of the data capstone project of the IBM Data Science Professional Certificate.

### Importing Useful Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib as plt # library for basic data visualisations
import seaborn as sns # library for advanced data visualisations

from sklearn.cluster import KMeans # import k-means from clustering stage

import folium # map rendering library

### 1. Download and Explore Dataset

Since the dataset needed for this project is not available to download, I need to scrape the web for the data on the neighborhoods in Toronto. Luckily, there is a wikipedia page that contains all the information needed.

In [2]:
# Get the wikipedia page's source code
toronto_postal_codes = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
# Keep only the table from the source code
toronto_postal_codes = toronto_postal_codes[toronto_postal_codes.find('<tbody>')+7:toronto_postal_codes.find('</tbody>')]

In [4]:
# Split the table's source code in a list of rows
toronto_postal_codes = toronto_postal_codes.split('<tr>')[1:]

In [5]:
# Define the column names of the dataframe that will contain the Toronto neighborhoods
column_names = ['PostalCode','Borough','Neighborhood']

# Create the empty dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


In [6]:
# Loop through all the rows of the table and append the appropriate values to the dataframe
for i in range(1, len(toronto_postal_codes)) : 
     
    # Split the row in a list containing the postal code, the borough and the neighborhood
    neighborhood_data = toronto_postal_codes[i].split('\n')[1:-2]

    # Capture the value of the postal_code
    postal_code = neighborhood_data[0][4:-5]

    # Find start and end point of the name of the borough
    if neighborhood_data[1][4:-5].find('>') == -1 : # If the borough is not a link
        start = 0
        end = len(neighborhood_data[1][4:-5])

    else : # If the borough is a link
        start = neighborhood_data[1][4:-5].find('>') + 1
        end = -4

    # Capture the value of the borough
    borough = neighborhood_data[1][4:-5][start:end]

    # Find start and end point of the name of the neighborhood
    if neighborhood_data[2][4:].find('>') == -1 : # If the neighborhood is not a link
        start = 0
        end = len(neighborhood_data[2][4:])

    else : # If the neighborhood is a link
        start = neighborhood_data[2][4:].find('>') + 1
        end = -4

    # Capture the value of the neighborhood
    neighborhood = neighborhood_data[2][4:][start:end]

    # Append data to dataframe
    neighborhoods = neighborhoods.append({'PostalCode': postal_code, 'Borough': borough, 'Neighborhood': neighborhood}, ignore_index = True)

In [7]:
# Loop through all rows
for i in range(neighborhoods.shape[0]-1, -1, -1) :
    
    # Assign the borough's value if the neighborhood is not assigned
    if neighborhoods.iloc[i]['Neighborhood'] == 'Not assigned' :
        neighborhoods.iloc[i]['Neighborhood'] = neighborhoods.iloc[i]['Borough']
        
    # Combine rows with the same postal code
    if neighborhoods.iloc[i]['PostalCode'] == neighborhoods.iloc[i-1]['PostalCode'] :
        neighborhoods.iloc[i]['Borough'] = 'Not assigned'
        neighborhoods.iloc[i-1]['Neighborhood'] += ', ' + neighborhoods.iloc[i]['Neighborhood']
        
# Drop rows with unidentified boroughs
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned']

# Reset the indexes
neighborhoods.reset_index(drop = True, inplace = True)

In [8]:
# Show the first 5 entries in the dataframe
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [9]:
# Print the number of rows in the dataframe
print('There are ' + str(neighborhoods.shape[0]) + ' rows in the dataframe.')

There are 103 rows in the dataframe.
