<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toroton City</font></h1>

## Introduction

In this notebook, we will scrape data from a wikepedia page 
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 

that has information about Toronto neighborhood to provide a dataframe for segmentation later.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Scrape Data and Explore Dataset

#### Load and explore the data

Next, let's scrape the data using BeautifulSoup package

In [2]:
from bs4 import BeautifulSoup as BS
#!conda install -c conda-forge Requests --yes

In [3]:
import requests
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(url)
results.status_code

200

In [4]:
#assign data to Beautifulsoup for Scraping
c = results.content
soup = BS(c)

In [5]:
#get the column names
dataFrame = pd
column_names = []
for n in soup.table.find_all('th'):
    
    column_names.append((n.text.rstrip()))   
column_names #data = map(lambda soup.table: (soup.table.find_all(text=True),souptable.find_all('tr')))

['Postcode', 'Borough', 'Neighbourhood']

In [6]:
column_names[-1].replace(r'\n',' ')
neighborhoods = pd.DataFrame(columns=column_names)# instantiate the dataframe
neighborhoods

Unnamed: 0,Postcode,Borough,Neighbourhood


Reading through the table and assign data to 

In [7]:
for i, row in enumerate(soup.table.findAll('tr')):
    rowValue = []
    for a in row.findAll('td'):
        rowValue.append(a.text)
    if len(rowValue) == 3:
        neighborhoods.loc[i-1] = rowValue

In [8]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [9]:
#removing '\n'
neighborhoods = neighborhoods.replace('\n',' ', regex=True)
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [10]:
neighborhoods.shape

(289, 3)

### Cleaning up the data 

Removing all the 'not assigned' row

In [11]:
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']
neighborhoods.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [12]:
cleanNbh = neighborhoods.reset_index()

In [13]:
del cleanNbh['index'] #make neighborhood name same as Borough if value is "not assigned"

In [14]:
cleanNbh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Changing 'not assigned" to the Borough name

In [15]:
cleanNbh.Neighbourhood = cleanNbh.Neighbourhood.replace('Not assigned ',cleanNbh.Borough)

In [16]:
cleanNbh.head(8)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue


Using groupby() method to combine all neighborhoods in the same postal code together. 

In [17]:
cleanNbh = pd.DataFrame(cleanNbh.groupby(['Postcode', 'Borough'], as_index=False, sort=False)['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x)))

In [18]:
cleanNbh.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Postcode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront , Regent Park"
M6A,North York,"Lawrence Heights , Lawrence Manor"
M7A,Queen's Park,Queen's Park
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Rouge , Malvern"
M3B,North York,Don Mills North
M4B,East York,"Woodbine Gardens , Parkview Hill"
M5B,Downtown Toronto,"Ryerson , Garden District"


Reset index to allow the dataframe to have three columns

In [19]:
cleanNbh = cleanNbh.reset_index()
cleanNbh.head()

Unnamed: 0,Postcode,Borough,0
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront , Regent Park"
3,M6A,North York,"Lawrence Heights , Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [20]:

cleanNbh.head(15)

Unnamed: 0,Postcode,Borough,0
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront , Regent Park"
3,M6A,North York,"Lawrence Heights , Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge , Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens , Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson , Garden District"


Printing out the final shape of the data

In [22]:
cleanNbh.shape

(103, 3)

## Getting the longtitude and Latitude of each Postal Code