# Capstone Project Notebook
This notebook contains the capstone project work done for the IBM Professional Certificate on Data Science. It will be updated as more details emerge.

On this notebook, we are collecting the neighborhood data from Wikipedia. My first step is to define the libraries that I will be using.

In [13]:
import pandas as pd
import numpy as np

Getting the data required some research. I used the [trick](https://www.coursera.org/learn/applied-data-science-capstone/discussions/weeks/3/threads/kBaFtPNGSHGWhbTzRohxtQ) by [Mutlu Okumus](https://www.coursera.org/learn/applied-data-science-capstone/profiles/0b18aae3b5eabac80cc71c57ba7f02b8), whereby I collected the data from a previous version of the page. Then, I used plain Pandas to scrap the data.

A few things that I also noticed from the process:

1. There were no 'Not assigned' Neighbourhood values that had a Borough different from 'Not assigned'.
2. In general, a postcode is always within one borough

The steps are detailed as comments in the code.

In [81]:
# Importing the data from web using pandas.
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942851379'
data_raw = pd.read_html(url)
# The table will be located in the first element of the series
data_tbl = data_raw[0]
# I get rid of any element which has a 'Borough' equal to 'Not assigned'
data_tbl = data_tbl.loc[data_tbl['Borough'] != 'Not assigned']
# I pull out the unique postcode values
postcode = data_tbl['Postcode'].value_counts().index
# Create an empty dataframe to which all the data will be attached
df = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
i = 0

# For every single unique postcode
for pc in postcode:
    # Extract the data for that unique postcode
    idx = data_tbl['Postcode'] == pc
    data_pc = data_tbl.loc[idx][['Borough','Neighbourhood']]
    # We assume that a postcode is always within a unique Borough
    borough = data_pc['Borough'].value_counts().index[0]
    # Concatenate the neighborhoods into one string
    neigh = data_pc['Neighbourhood'].str.cat(sep=", ")
    # Now, we add a new line to our dataframe
    df.loc[i] = [pc, borough, neigh]
    # Make sure that we move to the next line
    i = i+1;
# Our data frame is complete, we sort the result by postcode
df.sort_values(by=['Postcode'], inplace=True)
# Clean up the indexes so they appear in order
df.index = range(0,i)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


This is the size of the dataset

In [83]:
df.shape

(103, 3)