# Segmenting and Clustering Neighborhoods in Toronto

Use this notebook to build the code to scrape the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [105]:
#import BeautifulSoup
from bs4 import BeautifulSoup

#import all needed packages
import pandas as pd
import requests

To create the above dataframe:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is not assigned. 
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that MSA is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a common as shown in row 11 in the above table. 
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park. 
- Clean your Notebook and add Markdown cells to explain your work and assumptions you are making. 
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. 
- Submit a link to your notebook on your Github repository. 

### Extracting the data from the Wikipedia page

In [106]:
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(website_url, 'lxml')

In [107]:
#get the data from the table
torontoTable = soup.find('table', {'class':'wikitable sortable'})

#put data into dataframe df
df = pd.read_html(str(torontoTable), header=0)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Ignoring cells with a Borough = Not assigned

In [110]:
#drop rows that have Not assigned value for Borough
df = df[df.Borough != 'Not assigned']
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


### Group together neighborhoods with the same postcode

In [112]:
df = df.groupby(['Postcode', 'Borough'], sort=False)['Neighbourhood'].apply(', '.join).reset_index()
df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


### Not assigned neighborhood should be updated with Borough value

In [115]:
#if a cell has a borough but a Not assigned neighborhood
#then the neighborhood will be the same as the borough
df.loc[df['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df['Borough']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


### Final Number of rows in the dataframe

In [116]:
df.shape

(103, 3)