# Segmenting and Clustering Neighborhoods in Toronto

Gather data from the wiki page to extract neighborhoods data in Toronto into data frame. Clean the data to concate the neighborhoods under the same borough.


In [2]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import requests # library to handle requests
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


## Get the Wiki Page data through BeautifulSoup Tool


In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

## Extract the neighborhood data table

After reviewing the HTML codes behind the wiki pages, the neighborhood data table is located under a list of <tbody>. The following codes are to find all those tags and extract them into a list.
    


In [4]:
citydata = []

table_content = soup.find('tbody')
rows = table_content.find_all('tr')

for row in rows:
    cells = row.findChildren('td')
    
    if (len(cells) > 0) and (cells[1].text <> 'Not assigned'):
        city = (cells[0].text, cells[1].text, cells[2].text.replace('\n', ''))
        citydata.append(city)

citydf = pd.DataFrame(citydata, columns=['PostCode', 'Borough', 'Neighborhood'])
citydf.head(30)


Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


##  Combine the same neighborhoods under same PostCode and Borough

Use the dataframe groupby to combine the same neighborhoods rows under the same PostCode and Borough into single rows with ',' separator.

In [5]:
citydf_groupby = citydf.groupby(['PostCode', 'Borough'])['Neighborhood'].apply(lambda neighbor: ','.join(neighbor))

citydf_new = citydf_groupby.to_frame().reset_index() 
citydf_new.head(30)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


## Clean up empty Neighborhood data

If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borogh

In [6]:

for index, row in citydf_new.iterrows():
    if (row['Neighborhood'] == 'Not assigned'):
        citydf_new.at[index, 'Neighborhood'] = row['Borough']

citydf_new.head(30)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [7]:
# the shape of the dataframe
print(citydf.shape)

(212, 3)
