# APPLIED DATA SCIENCE CAPSTONE PROJECT
This notebook will be used for the capstone project.

# __PART 1__

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Install lxml package which is used for reading tables from wikipedia page
!conda install -c anaconda lxml -y
print('lxml installed')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - lxml


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2020.4.5~ --> anaconda::ca-certificates-2020.1.1-0
  certifi            conda-forge::certifi-2020.4.5.1-py36h~ --> anaconda::certifi-2020.4.5.1-py36_0
  openssl            conda-forge::openssl-1.1.1g-h516909a_0 --> anaconda::openssl-1.1.1g-h7b6447c_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
lxml installed


The first step is reading the table from the given wikipedia page. The code below will read all tables in the given page and assign it to the 'tables'.

In [3]:
tables = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print('List contains {} tables!'.format(len(tables)))

List contains 3 tables!


Now we have 3 tables in our list. If we examine the elements of the list, it's clear that the first table is the table we need. So let's assing this table to a dataframe and check the result:

In [4]:
df_postal = pd.DataFrame(data = tables[0])
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Dataframe seems correct. Now let's extract only the rows with an assigned 'Borough':

In [5]:
#Ignore postal codes with 'Not assigned' borough
df_postal = df_postal[df_postal['Borough'] != 'Not assigned']
#Reset index
df_postal.sort_values(by='Postal Code', inplace=True )
df_postal.reset_index(drop=True, inplace=True)
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
#Check if there is unassigned neighborhood
df_postal['Neighborhood'].isna().value_counts()

False    103
Name: Neighborhood, dtype: int64

Since there is no unassigned 'Neighborhood' value, the dataframe is ready for the next part.

In [7]:
print('Shape of the dataframe:', df_postal.shape)

Shape of the dataframe: (103, 3)


# __PART 2__

Read coordinates from the csv file and assign it to a dataframe:

In [8]:
coordinates = pd.read_csv('https://cocl.us/Geospatial_data')

In [9]:
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Filter the coordinates dataframe using the dataframe from PART 1:

In [10]:
coordinates = coordinates[coordinates['Postal Code'].isin(df_postal['Postal Code'].tolist())]

Sorting the values, so we can merge 2 dataframes to obtain a dataframe with coordinate information for a given postal code

In [11]:
coordinates.sort_values(by='Postal Code')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Finally, let's merge two dataframes and get a final dataframe with coordinates:

In [13]:
df_final = pd.merge(df_postal,coordinates)
df_final.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
