# Neighborhoods in Toronto

### Task: Scrape postalcode data from Wikipedia page, wrangle the data, clean it, and then read it into a pandas dataframe

**Packages: Beautifulsoup,Pandas, lxml**

In [1]:
#install Beautifulsoup 
import requests
import pandas as pd
#!conda install --yes --prefix {sys.prefix} beautifulsoup4

In [2]:
# install parser:lxml
import sys
#!conda install --yes --prefix {sys.prefix} lxml

Connect to the Wikipedia page by specifing the URL of the Wikipedia page to be scraped. Then parse the page using the *BeautifulSoup* constructor.

In [3]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup=BeautifulSoup(url, 'lxml')
# Output too long hence not displayed 

Review the HTML code above and look for the data we need. The data is in a table. Since the wepage has only two tables, it is quicker to parse all table opbjects and select the first table.

In [4]:
table = soup.find_all('table')[0]
# Output too long hence not displayed 

Convert the HTML table to a dataframe using pandas library

In [5]:
df=pd.read_html(str(table))[0]
type(df[0:5])

pandas.core.frame.DataFrame

In [6]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Remove Boroughs that are 'Not assigned'.

In [7]:
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
df.reset_index().drop('index',axis=1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Multiple neighbourhoods can be assigned to a postal code. Neighbourhoods under the same postal code should be grouped seperated by a comma. So the dataframe is grouped by *Postcode* and *Borough*, and *Neighbourhood* is aggreagted by concatenating with a comma.

In [9]:
df_grouped = df.groupby(['Postcode','Borough'], as_index=False).agg({'Neighbourhood':',' .join})
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
# update the column name to match table in assignment page
df_grouped.columns = ['Postalcode', 'Borough','Neighbourhood']
df_grouped.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
df_grouped.shape

(103, 3)