# Segmenting and clustering neighborhoods in Toronto

## Data scraping (_from wikipedia.org_)

Let's start wikitables, in order to access Wikipedia database.

In [1]:
!pip install wikitables



Let's now import the table to a .json file, and then change column names according to the assignment:

In [2]:
from wikitables import import_tables
tables = import_tables('List_of_postal_codes_of_Canada:_M') #returns a list of WikiTable objects

Let's now convert the .json obtained using pandas to a dataframe. Then we will change column names according to the assignment and check the head of the dataframe:

In [3]:
import pandas as pd
df = pd.read_json(tables[0].json())
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now, we will change the "Not assigned" values according to the assignment.

First, let's drop the rows with unassigned Borough, in order to skip them for the next steps:

In [4]:
import pandas as pd
import numpy as np

df['Borough'] = np.where(df['Borough'] == "Not assigned", np.nan, df['Borough'])
df=df.dropna().reset_index().drop('index', axis=1)
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now, we will group the neighborhoods with same Borough and PostalCode as requested.

In [5]:
#df.groupby('col')['val'].agg('-'.join)
df=df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's check the item M5A to check if both neighborhoods appear in the neighborhood field.

In [6]:
df.loc[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


Let's now copy the borough column for the unassigned neighborhoods, according to the assignment:

In [7]:
df['Neighborhood'] = np.where(df['Neighborhood'] == "Not assigned", df['Borough'], df['Neighborhood'])
df.head(12)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's check the 9th value from the Wikipedia table (PostalCode: M7A) to check if it has been properly done.

In [8]:
df.loc[df['PostalCode'] == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


Shape of the dataframe:

In [9]:
df.shape

(103, 3)

<b>Eduardo Pérez</b>