# Segmenting and Clustering Neighborhoods in Toronto

**Instructions:** In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

## Retrieve HTML File from Wikipedia Containing Table of Toronto Zip Codes

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

wiki_url: str = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wiki_soup = BeautifulSoup(wiki_url, 'lxml')

## Create the DataFrame
Table headers and table row data will be scraped from the HTML, then added to a Pandas DataFrame. This DataFrame will not have empty cells `dropna()` and will not have "Not Assigned" boroughs `df['Borough] != 'Not assigned'`.

In [2]:
table = wiki_soup.find('table', { 'class': 'wikitable sortable'})
table_headers = table.find_all('th')

parsed_headers = []
for h in table_headers:
    parsed_headers.append(h.text[:-1]) # [:-1] to remove the newline

table_rows = table.find_all('tr')
parsed_rows = []
for r in table_rows:
    table_row_data = r.find_all('td')
    row_data = []
    for d in table_row_data:
        row_data.append(d.text[:-1])
    parsed_rows.append(row_data)

df = pd.DataFrame(data=parsed_rows, columns=parsed_headers)

### Preprocess the DataFrame

In [3]:
df = df.dropna() # Drop empty rows
df = df[df['Borough'] != 'Not assigned'] # Drop not assigned
df.reset_index(inplace=True) # Ensure index starts at 0
df.drop(columns=['index'], inplace=True) # Remove redundant, old, index
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### The Shape of the DataFrame

In [8]:
rows = df.shape[0]
cols = df.shape[1]

print(f"The DataFrame has a shape of {rows} rows and {cols} columns.")

The DataFrame has a shape of 103 rows and 3 columns.
