# Capstone Project Notebook
This notebook contains the capstone project work done for the IBM Professional Certificate on Data Science. It will be updated as more details emerge.

On this notebook, we are collecting the neighborhood data from Wikipedia, and then adding the geo-coordinates. My first step is to define the libraries that I will be using.

In [19]:
import pandas as pd
import numpy as np

Getting the data required some research. I used the [trick](https://www.coursera.org/learn/applied-data-science-capstone/discussions/weeks/3/threads/kBaFtPNGSHGWhbTzRohxtQ) by [Mutlu Okumus](https://www.coursera.org/learn/applied-data-science-capstone/profiles/0b18aae3b5eabac80cc71c57ba7f02b8), whereby I collected the data from a previous version of the page. Then, I used plain Pandas to scrap the data.

A few things that I also noticed from the process:

1. There were no 'Not assigned' Neighbourhood values that had a Borough different from 'Not assigned'.
2. In general, a postcode is always within one borough

The steps are detailed as comments in the code.

In [20]:
# Importing the data from web using pandas.
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942851379'
data_raw = pd.read_html(url)
# The table will be located in the first element of the series
data_tbl = data_raw[0]
# I get rid of any element which has a 'Borough' equal to 'Not assigned'
data_tbl = data_tbl.loc[data_tbl['Borough'] != 'Not assigned']
# I pull out the unique postcode values
postcode = data_tbl['Postcode'].value_counts().index
# Create an empty dataframe to which all the data will be attached
df = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
i = 0

# For every single unique postcode
for pc in postcode:
    # Extract the data for that unique postcode
    idx = data_tbl['Postcode'] == pc
    data_pc = data_tbl.loc[idx][['Borough','Neighbourhood']]
    # We assume that a postcode is always within a unique Borough
    borough = data_pc['Borough'].value_counts().index[0]
    # Concatenate the neighborhoods into one string
    neigh = data_pc['Neighbourhood'].str.cat(sep=", ")
    # Now, we add a new line to our dataframe
    df.loc[i] = [pc, borough, neigh]
    # Make sure that we move to the next line
    i = i+1;
# Our data frame is complete, we sort the result by postcode
df.sort_values(by=['Postcode'], inplace=True)
# Clean up the indexes so they appear in order
df.index = range(0,i)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


This is the size of the dataset

In [21]:
df.shape

(103, 3)

Unfortunately, using the geocoder package was too time-consuming, with one single postcode (out of 103), taking a significant amount of time. I guess it is being heaviliy used. Therefore, I moved to use the 'Geospatial_Coordinates.csv' provided in the instructions. Assuming that I got the correct neighborhood data, I will sort this data by 'postcode' too, just to be sure that I have the right order on the rows.

In [33]:
df_coord = pd.read_csv('Geospatial_Coordinates.csv')
df_coord.sort_values(by=['Postal Code'], inplace=True)
df_coord.index = range(0,i)
df_coord.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


With that out of the way, I will just concatenate the 'Latitude' and 'Longitude' columns to my original dataframe.

In [35]:
df = df.join(df_coord[['Latitude','Longitude']])
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


That's it for this part of the Assignment