# Clustering and Segmenting Neighborhoods in Toronto, Canada

This notebook is a submission for the week 3 peer-reviewed asignment on IBM's Data Science Capstone Project Course.

This notebook has 3 sections. Each section dedicated for a particular response in the project evaluation:

1. Scraping the data from wikipedia.

2. Creating the data frame for use in our notebook.

3. Getting latitude and longitude coordinates; a requisite for working with Foursquare data.

4. Mapping Toronto neighborhood data; as seen in a previous example with NYC

5. Clustering Toronto neighborhood data using Foursquare inputs.

6. Printing **.shape** at the end to evaluate the dimensions of the dataframe we used.


### Setting up

We first load the libraries we will be requiring for our workflow.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                       

### 1. Scraping wikipedia table

In [109]:

wiki = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641')
df = wiki[0]
print(df.columns)
print(type(df))

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')
<class 'pandas.core.frame.DataFrame'>


### 2. Creating a clean data frame for use in our notebook

We eliminate the postal codes without an assigned borough.



In [110]:

df = df.loc[df['Borough']!='Not assigned']
df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [111]:
df['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

We can see neighbourhoods with the same post code have their own row. We need to combine multiple neighbourhoods of the same postcode into the same row.

In [112]:

df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()


We now have a comprehensive list of the neighbourhoods found in each borough and postcode available. We also need to rename any neighbourhood with an unassigned name. Use the name of the Borough instead.

In [100]:
df.loc[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


Fortunately only one neighbourhood does not have an assigned name.

In [101]:
df.loc[df['Borough']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


Even more fortunately, there is only borough name assigned to Queen's Park. We will change the name of the neighbourhood to the name of the borough.

In [102]:
df['Neighbourhood'][df['Borough']=="Queen's Park"] = "Queen's Park"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


We check that the change was made.

In [108]:
df.loc[df['Borough']=="Queen's Park"]
print(df.loc[df['Neighbourhood'] == 'Not assigned'])

Empty DataFrame
Columns: [Postcode, Borough, Neighbourhood]
Index: []
