# Segmenting and Clustering Neighborhoods in Toronto

A Jupyter Notebook for the Applied Data Science Capstone, as a part of "IBM Data Science" course on Coursera.

## 1. Scraping Data of Postal Codes and Neighborhoods

In this section we will scrape the Wikipedia page ([List of postal codes of Canada](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)) to extrac neighborhoods in the city of Toronto.In this list corresponds to the postal codes with the first letter M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario.

Then we will convert data to a _pandas_ dataframe by wrangling and cleaning the data.

First of all, we will import necessary libraries.

#### Import necessary libraries

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for scraping and communicate with websites
import requests
from bs4 import BeautifulSoup


Then we'll use wikipedia page to extrace data of postal codes, boroughs and neighbourhoods. For this purpose we would first extract the page content and make a _soup_ of different elements. Then based on the type of our element, here _table_ and its _class_, here _wikitable sortable_, we would extract the table contents.

In [3]:
# scrap data from wikipedia page
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_wiki = requests.get(url_wiki).text
soup = BeautifulSoup(page_wiki, 'lxml')

# get right table to scrap
table = soup.find('table',{'class':'wikitable sortable'})

The content is still not clean and we have our data between lots of `<th> ... <\th>`, `<tr> ... <\tr>`, and `<td> ... <\td>`. We would use `.find` and `.find_all` methods to find these parts and select only the text part. 

In [4]:
# extract header of the table
header = [th.text.rstrip() for th in table.find_all('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

In [5]:
# extract cells of the table
# consider an empty list for each column of the table
postal_code = []
borough = []
neighbourhood = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==3:    # only extract table body not the heading
        postal_code.append(cells[0].find(text=True).rstrip())
        borough.append(cells[1].find(text=True).rstrip())
        neighbourhood.append(cells[2].find(text=True).rstrip())

In the next step we will make our datafram from the extracted `postal_code`, `borough`, `neighbourhood`, which were as `list`. We will use `header` for the column's names.

In [7]:
# make a dataframe of the table
toronto_data = pd.DataFrame(zip(postal_code, borough, neighbourhood), columns = header)
toronto_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In the next step we will remove rows which their `Borough`s value is `Not assigned`, and we'll make sure to reset the indecies.

In [11]:
# drop rows which Borough not assigned
toronto_data.drop(toronto_data.index[toronto_data['Borough']=='Not assigned'], inplace=True)
# reset index
toronto_data.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Check if there is any `Not assigned` value after cleaning the dataframe or not.

In [18]:
'Not assigned' in toronto_data.Neighbourhood.values

False

We are done for this step as there is no `Not assigned` value for the `Neighbourhood` in the `toronto_data`.

In [19]:
toronto_data.shape

(103, 3)