# Capstone Project - Week 3 - Part 1

First we import the necessary packages. We import 'BeautifulSoup' to scrape and parse our data, 'requests' to be able to retrieve the data on Wikipedia, 'pandas' to put our scraped data into a dataframe, and 'numpy' to perform manipulation on our strings.

In [155]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

We retrieve the html text from the wiki page, and put it into a "soup" object as 'lxml' using BeautifulSoup

In [156]:
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text, 'lxml')

Looking through the source code shows us that all the text we need is within 'td' html. We assign all the 'td' sections to a "table_data" object. Then we create an empty list called "postcodes", and for every entry in "table_data" up to the 865th entry, we return the text within the 'td' html, and add it to the "postcodes" list. I admittedly found it to be the first 865 entries through brute force.

In [157]:
table_data = soup.find_all('td')
postcodes=[]
for i in table_data[0:864]:
    postcodes.append(i.text)

In our "postcodes" list, every 1st entry is a postal code, every 2nd entry is a borough, and every 3rd entry is a neighborhood. We create a dataframe to contain these, and slice it accordingly so that each column contains the correct data. This lists the columns alphabetically by default, so we need an extra line of code to make sure the columns are in the order we want.

In [159]:
df = pd.DataFrame({'PostalCode': postcodes[0::3], 'Borough': postcodes[1::3], 'Neighborhood': postcodes[2::3]})
df = df[['PostalCode','Borough','Neighborhood']]
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M5A,Downtown Toronto,Regent Park\n
6,M6A,North York,Lawrence Heights\n
7,M6A,North York,Lawrence Manor\n
8,M7A,Queen's Park,Not assigned\n
9,M8A,Not assigned,Not assigned\n


We don't want the '\n' in our Neighborhood columns, so we delete them (by replacing them with nothing). We also change 'Not assigned' boroughs to NaN so that we can then delete the rows containing them.

In [160]:
df.Neighborhood = df.Neighborhood.str.replace('\n','')
df.Borough = df.Borough.replace('Not assigned', np.nan)
df = df.dropna()
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Now let's combine rows where the postal codes are the same.

In [161]:
df = df.groupby(['PostalCode','Borough'], as_index=False, sort=False).agg(','.join)
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Finally, if a neighborhood is 'Not assigned', we replace it with its respective borough name.

In [162]:
df.Neighborhood.replace('Not assigned', df.Borough, inplace=True)
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


How many rows and columns does our dataframe have?

In [163]:
df.shape

(103, 3)