# Capstone Project
## Segmenting and Clustering Neighborhoods in Toronto


## 1. Create Notebook

In [1]:
# Install the libraries, if not have done so
# ! pip install html5lib

# Libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## 2. Scrape the Postal Codes of Canada data from Wikipedia
### https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [2]:
src = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(src, 'html5lib')

- **Get the data table in Wikipedia with class 'wikitable sortable'**

In [3]:
wikiTable = soup.find('table',{'class':'wikitable sortable'})
wikiTable_rows = wikiTable.find_all('tr')

- **Extract from the table the PostalCode, Borough and Neighbourhood**

In [4]:
dataExtracted = []
for row in wikiTable_rows:
    dataExtracted.append([t.text.strip() for t in row.find_all('td')])

## 3. Create Dataframe for neighborhoods in Toronto

In [5]:
dfNeigh = pd.DataFrame(dataExtracted, columns=['PostalCode', 'Borough', 'Neighbourhood'])
# Remove null PostalCode
dfNeigh = dfNeigh[~dfNeigh['PostalCode'].isnull()]
dfNeigh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 288 entries, 1 to 288
Data columns (total 3 columns):
PostalCode       288 non-null object
Borough          288 non-null object
Neighbourhood    288 non-null object
dtypes: object(3)
memory usage: 9.0+ KB


- **The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [6]:
dfNeigh.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


- **Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

In [7]:
dfNeigh.drop(dfNeigh[dfNeigh['Borough']=="Not assigned"].index,axis=0, inplace=True)
dfAssigned = dfNeigh.reset_index()
dfAssigned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
index            211 non-null int64
PostalCode       211 non-null object
Borough          211 non-null object
Neighbourhood    211 non-null object
dtypes: int64(1), object(3)
memory usage: 6.7+ KB


In [8]:
dfAssigned.head(10)

Unnamed: 0,index,PostalCode,Borough,Neighbourhood
0,3,M3A,North York,Parkwoods
1,4,M4A,North York,Victoria Village
2,5,M5A,Downtown Toronto,Harbourfront
3,6,M5A,Downtown Toronto,Regent Park
4,7,M6A,North York,Lawrence Heights
5,8,M6A,North York,Lawrence Manor
6,9,M7A,Queen's Park,Not assigned
7,11,M9A,Etobicoke,Islington Avenue
8,12,M1B,Scarborough,Rouge
9,13,M1B,Scarborough,Malvern


- **More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.**

In [9]:
dfCombined = dfAssigned.groupby('PostalCode').agg(lambda x: ','.join(x))
dfCombined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103 entries, M1B to M9W
Data columns (total 2 columns):
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(2)
memory usage: 2.4+ KB


In [10]:
dfCombined.head(10)

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,"Scarborough,Scarborough","Rouge,Malvern"
M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,"Scarborough,Scarborough,Scarborough","East Birchmount Park,Ionview,Kennedy Park"
M1L,"Scarborough,Scarborough,Scarborough","Clairlea,Golden Mile,Oakridge"
M1M,"Scarborough,Scarborough,Scarborough","Cliffcrest,Cliffside,Scarborough Village West"
M1N,"Scarborough,Scarborough","Birch Cliff,Cliffside West"


- **If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.**

In [11]:
# Combined dataframe for the not assigned neighborhood and the not assigned borough
dfCombined.loc[dfCombined['Neighbourhood']=="Not assigned",'Neighbourhood'] = dfCombined.loc[dfCombined['Neighbourhood']=="Not assigned",'Borough']
dfCombined.head(10)

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,"Scarborough,Scarborough","Rouge,Malvern"
M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,"Scarborough,Scarborough,Scarborough","East Birchmount Park,Ionview,Kennedy Park"
M1L,"Scarborough,Scarborough,Scarborough","Clairlea,Golden Mile,Oakridge"
M1M,"Scarborough,Scarborough,Scarborough","Cliffcrest,Cliffside,Scarborough Village West"
M1N,"Scarborough,Scarborough","Birch Cliff,Cliffside West"


In [12]:
# Reset the index
dfFinal = dfCombined.reset_index()
# Remove the duplicate boroughs
dfFinal['Borough']= dfFinal['Borough'].str.replace('nan|[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",")
dfFinal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [13]:
dfFinal.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


- **In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

In [14]:
dfFinal

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [15]:
dfFinal.to_csv('torontoPostalCode.csv', index=False)