# Segmentating and Clustering Neighborhoods in Toronto

This notebook contains the code, to obtain a dataframe with the postal codes, boroughs and neighborhoods from Toronto by scrapping a wikipedia page using beautiful soup library which allow us to pull out data from html and XML files.  
More information about beautifulsoup at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Import Beautiful to scrap the wikipedia HTML file 
from bs4 import BeautifulSoup

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe
  

In [42]:
#Save the html file as a text
website= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup= BeautifulSoup(website.content, 'html.parser')

In [44]:
table = soup.find('tbody')
rows = table.select('tr')
row = [r.get_text() for r in rows]

In [48]:
#Checking the headers
print(My_table.tr.text)


Postal code

Borough

Neighborhood



## Converting to a dataframe

In [47]:
df = pd.DataFrame(row)
df1 = df[0].str.split('\n', expand=True)
df2 = df1.rename(columns=df1.iloc[0])
df3 = df2.drop(df2.index[0])
df3.head()

Unnamed: 0,Unnamed: 1,Postal code,Unnamed: 3,Borough,Unnamed: 5,Neighborhood,Unnamed: 7,None,None.1
1,,M1A,,Not assigned,,,,,
2,,M2A,,Not assigned,,,,,
3,,M3A,,North York,,Parkwoods,,,
4,,M4A,,North York,,Victoria Village,,,
5,,M5A,,Downtown Toronto,,Regent Park / Harbourfront,,,


In [72]:
df = pd.DataFrame(row)
df.head()

Unnamed: 0,0
0,\nPostal code\n\nBorough\n\nNeighborhood\n
1,\nM1A\n\nNot assigned\n\n\n
2,\nM2A\n\nNot assigned\n\n\n
3,\nM3A\n\nNorth York\n\nParkwoods\n
4,\nM4A\n\nNorth York\n\nVictoria Village\n


In [73]:
df1=df[0].str.split('\n', expand = True) #spliting the series into a data frame getting a new column for every each split
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,Postal code,,Borough,,Neighborhood,,,
1,,M1A,,Not assigned,,,,,
2,,M2A,,Not assigned,,,,,
3,,M3A,,North York,,Parkwoods,,,
4,,M4A,,North York,,Victoria Village,,,


In [74]:
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,Postal code,,Borough,,Neighborhood,,,
1,,M1A,,Not assigned,,,,,
2,,M2A,,Not assigned,,,,,
3,,M3A,,North York,,Parkwoods,,,
4,,M4A,,North York,,Victoria Village,,,


In [75]:
df1.drop([6,7,8], axis= 1, inplace=True)
df1.head()

Unnamed: 0,0,1,2,3,4,5
0,,Postal code,,Borough,,Neighborhood
1,,M1A,,Not assigned,,
2,,M2A,,Not assigned,,
3,,M3A,,North York,,Parkwoods
4,,M4A,,North York,,Victoria Village


In [76]:
df1.drop([0,4], axis= 1, inplace= True)
df1.head()

Unnamed: 0,1,2,3,5
0,Postal code,,Borough,Neighborhood
1,M1A,,Not assigned,
2,M2A,,Not assigned,
3,M3A,,North York,Parkwoods
4,M4A,,North York,Victoria Village


In [77]:
df1.drop([2], axis= 1, inplace= True)
df1.head()

Unnamed: 0,1,3,5
0,Postal code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [78]:
df1.rename(columns={1:'Postcode',3:'Borough', 5:'Neighborhood'}, inplace=True)
df1.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,Postal code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [79]:
#Dropping first row 
df1.drop([0], inplace= True)
df1.head()

Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront


In [80]:
print(df.shape)

(181, 1)


## Cleaning the data

#### Ignore cells with a  'Not assigned' borough 

In [84]:
indexes = df1[ df1['Borough'] =='Not assigned'].index
df1.drop(indexes , inplace=True)


(181, 1)


In [85]:
print(df1.shape)

(103, 3)


We eliminated 78 rows 

#### Combine neighborhoods which have  same postcode

In [87]:
result = df1.groupby(['Postcode','Borough'], sort=False).agg( ', '.join)

In [88]:
df2= result.reset_index()
print(df2.shape)

(103, 3)


## saving the dataframe into a csv file

In [89]:
df2.to_csv('toronto.csv')