# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information. 

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page, wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format.


### Install BeautifulSoup

In [1]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1273 sha256=7e6ca8c592da5f9d3fb326f419c94c00066081d454addc124ef23e97a650d5ea
  Stored in directory: /Users/charleseubanks/Library/Caches/pip/wheels/75/78/21/68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


### Import Necesary Libraries

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Scrape data from Wikipedia page

In [43]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text


### Use Beautifulsoup to parse the data gathered from Wikipedia page

In [44]:
data = []
columns = []

soup = BeautifulSoup(html_data, 'lxml')

table = soup.find(class_='wikitable')

for index, tr in enumerate(table.find_all('tr')):
    section = []
    for td in tr.find_all(['th', 'td']):
        section.append(td.text.strip())
        
    if (index == 0):
        columns = section
    else:
        data.append(section)

### Convert data to Pandas Dataframe and display first 5 rows

In [45]:
df_canada = pd.DataFrame(data = data,columns=columns)
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Remove rows where Burough is 'Not Assigend' and display first 5 rows

In [33]:
#Remove Buroughs with 'Not Assigned'
df_canada = df_canada[df_canada['Borough'] != 'Not assigned']
df_canada.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Display the shape of the data using the .shape method

In [42]:
df_canada.shape

(103, 3)