# The IBM Applied DS Capstone Project

This notebook will be used for walking through [IBM Data Science Professional Certificate Specialization: Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone)

In [1]:
import pandas as pd
import numpy as np

print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Scrape "List of postal codes of Canada: M" wikipedia page for Toronto postal codes

Required imports

In [2]:
from bs4 import BeautifulSoup
import requests

Fetch wiki page as a BeautifulSoup object

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

Find the html table that contains the codes

In [4]:
soup_codes_table = soup.find('table', class_='wikitable sortable')

Convert to pandas dataframe

In [5]:
pd.set_option('display.max_colwidth', -1) # for wide columns later on

postal_codes_df = pd.read_html(str(soup_codes_table), header = 0)[0]
postal_codes_df.columns = ['PostalCode', 'Borough', 'Neighborhood']

postal_codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop entries with "Not assigned" Borough

In [6]:
postal_codes_df = postal_codes_df[postal_codes_df.Borough != 'Not assigned']

postal_codes_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Aggregate function to be used with groupby()

In [7]:
from functools import reduce

# this functions forms a comma+space separated string of unique sorted elements of list `series`
def aggreg(series):
    return reduce(lambda x, y: x + ', ' + y, sorted(list(set(series))))

Aggregate neighborhoods with the same PostalCode to comma+space separated strings

In [8]:
postal_codes_df = postal_codes_df.groupby('PostalCode').agg(aggreg).reset_index()

postal_codes_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Do we have 'Not assigned' Neighborhoods?

In [9]:
not_assigned_neighborhoods = postal_codes_df.Neighborhood == 'Not assigned'
#postal_codes_df[postal_codes_df.Neighborhood == 'Not assigned'].head()
postal_codes_df[not_assigned_neighborhoods]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


Let's rename 'Not assigned' Neighborhood to their respective Boroughs

In [10]:
postal_codes_df.Neighborhood = list(map(lambda x: x[1] if x[1] != 'Not assigned' else x[0], zip(postal_codes_df.Borough, postal_codes_df.Neighborhood)))

# test
postal_codes_df.loc[not_assigned_neighborhoods]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


`postal_codes_df` size

In [11]:
postal_codes_df.shape

(103, 3)