## Segmenting and Clustering Neighborhoods in Toronto

### Importing libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

### DataFrame Generation

We must obtain the data from the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. It includes the Postal codes from Toronto, with the Boriugh and Neighborhood.


#### 1. Reading Data from HTML Wikipedia page
I've used BeautifulSoup library to import the HTML code

In [2]:
# Read the URL page and save it as a html page
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
text = bs(page.content, 'lxml')

#### 2. Extracting values from HTML Table elements
The columns are selected from <th> items and the rows are populated with values inside <td> tags.

In [3]:
# Read all  columns and include an empty list for each one
table = text.find('table',{'class':'wikitable sortable'})
tr_rows = table.find_all('tr')
df_cols = []
df_rows = []

# Search for header tags. They will be the DataFrame columns name
th_rows = table.find_all('th')
for th in th_rows:
    df_cols.append(th.text)

# Search for row tags. They will be the DataFrame rows
for tr in tr_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    df_rows.append(row)

#### 3. Creating the DataFrame from extracted elements
There is an empty row (from header) that must be deleted

In [4]:
# DataFrame creation with df_cols and df_rows. Null or None rows are dropped.
df = pd.DataFrame(df_rows, columns=df_cols)
df.dropna(inplace = True)
df.rename(columns={'Neighbourhood\n':'Neighborhood', 'Postcode':'PostalCode'}, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n


#### 4. The values in the last column include an innecesary '\n' special char. We must remove it.

In [5]:
df = df.replace('\\n','', regex=True)
df
df.head()
print(df.shape)

(289, 3)


#### 5. Rows without borough must be dropped

In [6]:
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames, inplace = True)
df.shape

(212, 3)

#### 6. If Neighborhood is Not assigned, this value will be equal to Borough

In [7]:
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Queen's Park
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


#### 7. Rows with same Postcode must be merged, combining the Neighbourhood name
As every PostalCode has the same Borough, we group 'df' by both value and join the Neighborhoods in the same row.
Finally we reset the index. The DataFrame is ready.

In [8]:
dfgrouped = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
dfgrouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### FINAL: Renaming and Printing the shape of the new DataFrame

In [9]:
TorontoPD = dfgrouped
print(TorontoPD.columns)

#DataFrame Shape
TorontoPD.shape

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')


(103, 3)