## Download and Cleanup Wiki-page Table

# Requirements

1. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table postal codes and to transform to a pandas dataframe

2. To create the dataframe:
    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    - Only process the cells that have an assigned borough. 
    - Ignore cells with a borough that is Not assigned.
    - More than one neighborhood can exist in one postal code area.
        - For example, in the table on the Wikipedia page, you will notice that 
        - M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
        - These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
    - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


### Download the wiki page and load the table into a Pandas DataFrame

In [1]:
# Pull in the required libraries
import pandas as pd
import numpy as np
import requests
!conda install -y -c anaconda beautifulsoup4
from bs4 import BeautifulSoup

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
# Define and download the wiki page
the_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
res = requests.get(the_url)

In [3]:
#Parse the table with BeautifulSoup
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]

In [4]:
# Load the Table into Pandas
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
df.tail()

Unnamed: 0,Postal Code,Borough,Neighborhood
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


### Ignore cells with a borough that are 'Not assigned'.

In [6]:
# We'll use 'wdf' as our 'working data frame' so that that we can 
# refer back to the original if needed.
wdf = df.loc[(df['Borough'] != 'Not assigned')]

# Validate ... 
wdf.loc[(wdf['Borough'] == 'Not Assigned')]

Unnamed: 0,Postal Code,Borough,Neighborhood


### Group Neighborhoods by Postal Code
For Postal Codes that span more than one Neighborhood, group the Neighborhoods with a comma, forming a single 'Postal Code' record

In [7]:
# Check if we have any codes that span Neighborhood
wdf[wdf.duplicated(['Postal Code'])]

Unnamed: 0,Postal Code,Borough,Neighborhood


The output above indicates that only one entry for each postal code exists, so this requirement is already met.

### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [8]:
wdf[wdf.Neighborhood == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


In this dataset, all Neighborhoods are assigned.

In [9]:
# If we needed to change a few, we can do it with one one of code:

mask = wdf.Neighborhood == 'Not assigned'
wdf['Neighborhood'][mask].Neighborhood = wdf['Borough'][mask]

print("Shape after 'hood assignment fix: {0}".format(wdf.shape))

Shape after 'hood assignment fix: (103, 3)
