# Segmenting and Clustering Neighborhoods in Toronto
Alex P. Blizzard
## Problem 1

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.

2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

### Import Libraries

In [1]:
!pip install BeautifulSoup4
!pip install requests

from bs4 import BeautifulSoup
import requests   # library to handle requests
import numpy as np
import pandas as pd



### Scraping Website for Data

In [2]:
#Scrape website
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'html')
table = soup.find('table', class_='wikitable sortable')
#print(table.prettify())

#Scrape website into list
data = []
columns = []
table = soup.find(class_='wikitable')
for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
    #First row of data is the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

#convert list into Pandas DataFrame
canada_df = pd.DataFrame(data = data,columns = columns)

### Processing Data

In [8]:
#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
canada_df = canada_df[canada_df['Borough'] != 'Not assigned']

#More than one neighborhood can exist in one postal code area
canada_df["Neighborhood"] = \
canada_df.groupby("Postal Code")["Neighborhood"].transform(lambda neigh: ', '.join(neigh))
canada_df = canada_df.drop_duplicates()
if(canada_df.index.name != 'Postal Code'):
    canada_df = canada_df.set_index('Postal Code')

#If a cell has a borough but a Not assigned neighborhood, 
#then the neighborhood will be the same as the borough
canada_df['Neighborhood'].replace("Not assigned", canada_df["Borough"],inplace=True)
canada_df.head()

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
canada_df.shape

(98, 2)