### Project: Segmenting and Clustering Neighborhoods in Toronto (part one)

---

**Description**

This project aims to draw knowledge from locations that share common attributes in the Toronto area using unsupervised machine learning and methods.

#### Stage one: extract and tranform data to analysis

---

Load necessary libraries

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

The data to be used is on Wikipedia, and contains the following information about the Toronto region:

* Postal code
* Borough
* Neighborhood

The pandas library has an excellent html reader. I will use your method to extract the table

In [2]:
df_ca = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [3]:
df_ca.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Let's the premises:

* Only process the cells that have an assigned **borough**. Ignore cells with a borough that is **"Not assigned"**
* More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma
* If a cell has a borough but a **"Not assigned"** neighborhood, then the neighborhood will be the same as the borough

In [4]:
# Processing the right data

df_ca = df_ca[df_ca['Borough'] != 'Not assigned']

In [5]:
# Replacing the missing neighbourhood and group by Borough

df_ca['Neighbourhood'] = [b+', ' if n == 'Not assigned' else n+', ' for b, n in zip(df_ca['Borough'], df_ca['Neighbourhood'])]
df_ca = df_ca.groupby(['Postcode', 'Borough'])['Neighbourhood'].sum().to_frame().reset_index()
df_ca['Neighbourhood'] = [n[:-2] for n in df_ca['Neighbourhood']]

In [6]:
# Checking the new dataset

df_ca.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
# Print the number of rows

print('In total, there are %s rows in the dataset' % df_ca.shape[0])

In total, there are 103 rows in the dataset


The first part is done.