# Toronto Neighborhood Segmentation

## Part 1: Toronto's Neighboorhoods
In this part we retrieve the basic information about Toronto's neighborhoods that have *M* in their postcode and summarize it in a pandas dataframe.
The data are scrapped from the Wikipedia page: [https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

Steps:
- Downloading the Webpage Using Requests Library
- Parsing Webpage HTML Using BeautifulSoup
- Extracting Data and Building DataFrame

In [1]:
#import necessary packages
import pandas as pd 
import requests
from bs4 import BeautifulSoup

In [2]:
#downlaod wikipedia page using requests
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data = requests.get(url).text

In [5]:
#parse the html data using beatiful_soup
soup = BeautifulSoup(html_data,"html5lib")

In [6]:
#get the page title
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

Using Beatiful soup extract the table with the neighborhood data and store them in a dataframe .

In [46]:
# create the dataframe
toronto_neighborhoods = pd.DataFrame(columns=[
    "PostalCode", 
    "Borough", 
    "Neighborhood"])

#extract the table, and extract data row by row, column by column
for row in soup.find("tbody").find_all("tr"):
    col = row.find_all("td")
    if col:
        postcode = col[0].text
        postcode = col[0].text
        borough = col[1].text
        neighborhood = col[2].text

        toronto_neighborhoods = toronto_neighborhoods.append({
            "PostalCode":postcode, 
            "Borough":borough, 
            "Neighborhood":neighborhood}, 
            ignore_index = True)

toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


Clean the dataframe:
- remove "\n"
- remove not assinged postal codes (rows with *Borough="Not assigned")
- group Neighborhoods with same postalcode in the same row 

In [47]:
#remove "\n"
toronto_neighborhoods=toronto_neighborhoods.replace(to_replace=r'\n', value='', regex=True)

In [50]:
#filter out not assinged postal codes
mask = toronto_neighborhoods['Borough']=="Not assigned"
toronto_neighborhoods = toronto_neighborhoods[~mask]
toronto_neighborhoods.reset_index(drop=True,inplace=True)
toronto_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [52]:
# Check wheter there are rows with duplicated postal code
duplicateDFRow = toronto_neighborhoods[toronto_neighborhoods.duplicated(['PostalCode'])]
print(duplicateDFRow)

Empty DataFrame
Columns: [PostalCode, Borough, Neighborhood]
Index: []


In [51]:
#print out the shape of the dataframe
toronto_neighborhoods.shape

(103, 3)