# Scrape Wikipedia Page
This notebook scrapes the wikipedia page of Toronto. The outcome is a dataframe of the postal code, borough, and neighborhood of Toronto.

### Install BeautifulSoup and Import Libraries

In [16]:
# Install beautifulsoup4
#!conda install -c anaconda beautifulsoup4 --yes

# Import libraries
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

### Donwload and Load Data

In [17]:
!wget -O 'toronto page.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
soup = BeautifulSoup(open("toronto page.html"))

--2018-12-14 22:46:16--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80280 (78K) [text/html]
Saving to: ‘toronto page.html’


2018-12-14 22:46:17 (304 KB/s) - ‘toronto page.html’ saved [80280/80280]





 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  


### Scrape Postal Code Table

In [18]:
table = soup.find('table', {'class': 'wikitable sortable'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

df = pd.DataFrame(res,columns = ["PostalCode","Borough","Neighborhood"])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean Dataframe

**Clean 1** - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [19]:
flt = df['Borough']!='Not assigned' # Make a filter indicating if a borough is assigned
df_clean1 = df[flt] # Apply the filter
df_clean1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


**Clean 2** - More than one neighborhood can exist in one postal code area. Combine these neighborhoods into one row.

In [20]:
series_clean2 = df_clean1.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)) # Combine neighborhoods grouped by postal code
df_clean2 = pd.DataFrame(series_clean2).reset_index() # Change series to dataframe
df_clean2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**Clean 3** - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [21]:
# For Neighborhood = Not assigned, make Neighborhood = Borough, otherwise keep Neighborhood
df_clean3 = df_clean2.copy()
df_clean3['Neighborhood'] = np.where(df_clean3['Neighborhood'] == 'Not assigned', df_clean3['Borough'], df_clean3['Neighborhood'])

In [22]:
# Compare change before and after
flt2 = df_clean2['Neighborhood'] == 'Not assigned'
print("Before clean3: \n",df_clean2[flt2].to_string(index=False),"\n \nAfter clean3: \n",df_clean3[flt2].to_string(index=False))

Before clean3: 
 PostalCode       Borough  Neighborhood
      M7A  Queen's Park  Not assigned 
 
After clean3: 
 PostalCode       Borough  Neighborhood
      M7A  Queen's Park  Queen's Park


### Save the cleaned data

In [25]:
# Save the cleaned data as csv
df_clean3.to_csv('neigh.csv',index=False)
# Show head of the final neighborhood dataframe
df_clean3.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Check the number of rows of the neighborhood dataframe.

In [26]:
df_clean3.shape

(103, 3)