# Segmenting and Clustering Neighborhoods in Toronto

This notebook is part of an assignment for the course _Applied Data Science Capstone_ in [Coursera](https://www.coursera.org).

## 1. Scraping the data

### Import libraries

In [1]:
# for HTTP requests
import requests  

# for HTML scrapping 
from bs4 import BeautifulSoup 

# for table analysis
import pandas as pd

### URL for Wikipedia article

In [2]:
# URL of wikipedia page from which to scrap tabular data.
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

### Request & Response

In [3]:
# If the request was successful, reponse should be '200'.
response = requests.get(wiki_url) #.json()
response

<Response [200]>

### Wrangling HTML With BeautifulSoup

In [4]:
# Parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')
#soup

### Viewing HTML content

In [5]:
# Title of Wikipedia page
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

In [6]:
# Find all the tables in the HTML
all_tables=soup.find_all('table')

In [7]:
# Find the right table to scrap
right_table=soup.find('table', {"class":'wikitable sortable'})

In [8]:
# Get the 1st row of the table i.e. the header
row0 = right_table.findAll("tr")[0]

# Show the column names
header = [th.text.rstrip() for th in row0.find_all('th')]
header

['Postal Code', 'Borough', 'Neighbourhood']

### Scraping the table contents

In [9]:
# Scrap the data and append to respective lists
c0=[]
c1=[]
c2=[]

# Iterate through the rows of the table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==3 and 'Not assigned' not in cells[1].find(text=True): #Only extract assigned postal codes
        c0.append(cells[0].find(text=True).replace('\n', ''))
        c1.append(cells[1].find(text=True).replace('\n', ''))
        c2.append(cells[2].find(text=True).replace('\n', ''))

In [10]:
# Create a dictionary
d = dict([(x,0) for x in header])
d

{'Postal Code': 0, 'Borough': 0, 'Neighbourhood': 0}

In [11]:
# Append dictionary with corresponding data list.
d['Postal Code'] = c0
d['Borough'] = c1
d['Neighbourhood'] = c2
#d

### Creating dataframe

In [12]:
# Convert dict to dataFrame
df_table = pd.DataFrame(d)

# Size of dataframe
print(f'Shape: {df_table.shape}')

# Top 5 records
df_table.head(5)

Shape: (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


# Segmenting and Clustering Neighborhoods in Toronto

## 3. Clustering the data