# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### In this assignment, I'll explore, segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.

--------------------------

# First Part: Scraping, Wrangling, Cleaning dataset

In this first part of notebook I will scrape data from Wikipedia ( https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and transform it in a Pandas dataframe

###### First of all let's import libraries:

In [1]:
#Install packages
!pip -q install geopy
!pip -q install folium

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

from bs4 import BeautifulSoup # library for scraping web pages

print('Libraries imported.')

Libraries imported.


###### Now let's scrape web page. It's enough a simple matter because the data in Wikipedia is in a table, so it can be directly converted to a Pandas Dataframe:

In [3]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


__The dataframe shows Boroughs and Neighbourhoods with not assigned values. Let's drop rows with Borough "Not assigned"__


In [4]:
indicesToDrop = df[df['Borough'] == 'Not assigned'].index
df.drop(indicesToDrop, inplace=True)

In [5]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


More than one neighborhood can exist in one postal code area. 

For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. 

These two rows will be combined into one row with the neighborhoods separated with a comma

In [6]:
## Set Postcode and Borough as Multiindex of dataframe
df.set_index(["Postcode", "Borough"], inplace = True, 
                             append = True, drop = True) 

In [7]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighbourhood
Unnamed: 0_level_1,Postcode,Borough,Unnamed: 3_level_1
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
## Do a groupby to concat rows with same Postcode
df_aggregated = df.groupby(level = ['Postcode', 'Borough'], sort = False).agg(','.join)

In [9]:
df_aggregated.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
Postcode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Downtown Toronto,Queen's Park


In [10]:
## Reset index
df_aggregated.reset_index(inplace=True)
df_aggregated.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [15]:
df_aggregated.shape

(103, 3)