# Segmenting and Clustering Neighborhoods in Toronto

### 1. Introduction

In this notebook, I will create a pandas dataframe by scraping Canadian postal codes that start with the letter M from a Wikipedia page. The dataframe will consist of 3 columns: PostalCode, Borough, and Neighborhood. 

Next I will clean the data to remove rows where a borough or neighborhood is not assigned.

Assumptions: Postal Codes that include multiple neighborhoods will be listed as one row with the neighborhoods comma separated.

### 2. Importing the data

First, I will import the library required to scrape the data and create a dataframe.

In [1]:
# Import required library
import pandas as pd

Scrape the data and create a pandas dataframe. Check the first 5 rows of the new df to ensure it scraped properly.

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Check the size of the dataframe.

In [3]:
print('There are {} rows and {} columns in the dataframe'.format(df.shape[0], df.shape[1]))

There are 180 rows and 3 columns in the dataframe


### 3. Cleaning the data

Now we will clean up the data.

In [4]:
# Rename Postal Code column to be one word for simplicity
df.rename({'Postal Code': 'PostalCode'}, axis=1, inplace=True)

# Make sure it worked
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# Remove rows with a borough that is Not assigned.
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
# Check the final size of the dataframe
df.shape

(103, 3)