## Segmenting and Clustering Neighborhoods in Toronto

### 1: Scraping the Postal Codes of Canada from Wikipidea

### Introduction

We will explore and cluster the neighborhoods in Toronto in this notebook. 
We will build the code to scrape the Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a>. 
We will obtain the data that is in the table of postal codes and then save it to a CSV file. We will then read the CSV file to wrangle the data, clean it, and then read it into a pandas dataframe.  

#### Importing the required packages

In [1]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

#### Parsing the HTML table in Wikipedia  page and writing it to a CSV file

We scrape and then parse the data of the table from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> and then iterate over it to create an array of table rows, which we later write to the **postalcodes_canada.csv** file.

In [2]:
# scraping and parsing the data from the Wikipedia page
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(page, 'lxml')
table = soup.find('table', class_=['wikitable'])

table_rows = []
for table_row in table.findAll('tr'):    
    columns = table_row.findAll('td')
    table_row = []
    for column in columns:
        table_row.append(column.text.rstrip())
    table_rows.append(table_row)
    
header_row = []
for table_head in table.findAll('th'):
    header_row.append(table_head.text.rstrip())    
table_rows[0] = header_row

# Writing table into a CSV file
with open('postalcodes_canada.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerows(table_rows)

#### Loading, cleaning and transforming data
We read the data from **postalcodes_canada.csv** and rename the columns. 
We then drop the cells where the value of **Borough** is **Not assigned**. 
A cell that has a **Borough** but a **Not assigned** **Neighborhood**, that **Neighborhood** is assigned the same value as **Borough**. 
For every duplicate **PostalCode** we concatenate its **Neighborhood** values by separating them with a **,** (comma). We do it by grouping the data by PostalCode and Borough. As the  sample data in the assignment is unsorted, <code>sort=False</code> is used. 

##### Reading data from 'postalcodes_canada.csv'

In [3]:
df_postalcodes_canada = pd.read_csv('postalcodes_canada.csv')

##### Renaming the columns of the dataframe

In [4]:
df_postalcodes_canada.columns = ['PostalCode', 'Borough', 'Neighborhood']

##### Droping the cells where the value of 'Borough' is 'Not assigned'

In [5]:
df_postalcodes_canada = df_postalcodes_canada.drop(df_postalcodes_canada[df_postalcodes_canada['Borough'] == 'Not assigned'].index)

##### Assigning 'Neighborhood' the same value as 'Borough' if the value of 'Neighborhood' is 'Not assigned'

In [6]:
df_postalcodes_canada.loc[df_postalcodes_canada['Neighborhood'] == 'Not assigned', ['Neighborhood']] = df_postalcodes_canada['Borough']

##### Concatenating 'Neighborhood' values for every duplicate 'PostalCode'

In [7]:
df_postalcodes_canada = df_postalcodes_canada.groupby(['PostalCode', 'Borough'], sort=False).agg({'Neighborhood': ', '.join}).reset_index()

#### Printing the shape of the dataframe

#### Displaying the first 12 rows of the dataframe

In [8]:
df_postalcodes_canada.shape

(103, 3)

In [9]:
df_postalcodes_canada.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


#### Saving the dataframe to CSV file

In [10]:
df_postalcodes_canada.to_csv('postalcodes_canada_part1.csv', index=False)