# **My Toronto Neighborhoods Project** #

### *Kees Korver*, Haarlem, the Netherlands ###

This project is part of the Applied Data Science Capstone course, week 3, on Coursera 

### PART 1 - data wrangling and creating the pandas dataframe ###

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np

In [2]:
# So I finally found out that lxml is needed to read html content
!pip install lxml



In [3]:
url_path='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
# With a simple pd.read_html command you get everything from the webpage which is in table format in the form of a list
# Luckily, from the documentation I found out that a match parameter exists ...
match = 'North York'
# however, I could have used pd[0] also

In [5]:
# read the file from the webpage, and replace the 'Not assigned' with the default NaN value
df=pd.read_html(url_path, match=match, header=0, na_values='Not assigned', keep_default_na=True)
df

[    Postcode           Borough          Neighbourhood
 0        M1A               NaN                    NaN
 1        M2A               NaN                    NaN
 2        M3A        North York              Parkwoods
 3        M4A        North York       Victoria Village
 4        M5A  Downtown Toronto           Harbourfront
 ..       ...               ...                    ...
 282      M8Z         Etobicoke              Mimico NW
 283      M8Z         Etobicoke     The Queensway West
 284      M8Z         Etobicoke  Royal York South West
 285      M8Z         Etobicoke         South of Bloor
 286      M9Z               NaN                    NaN
 
 [287 rows x 3 columns]]

In [6]:
# Convert the dataset into a Pandas dataframe; I'll start with = df_tor
df_tor = df[0]
df_tor.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,
8,M8A,,
9,M9A,Queen's Park,Queen's Park


In [7]:
df_tor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
Postcode         287 non-null object
Borough          210 non-null object
Neighbourhood    209 non-null object
dtypes: object(3)
memory usage: 6.9+ KB


In [8]:
# From the above it follows that there are lots of null (NaN) values
# FIRST we should ignore all cells (rows) where 'Borough' = NaN and then reset the index
df_tor.dropna(subset=['Borough'], axis=0, inplace=True)
df_tor.reset_index(drop=True, inplace=True)
df_tor.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Queen's Park,
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [9]:
# After executing the following command, it follows that 210 rows are left (from the original 287); i.e. 77 cells are ignored because of NaN values
df_tor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 3 columns):
Postcode         210 non-null object
Borough          210 non-null object
Neighbourhood    209 non-null object
dtypes: object(3)
memory usage: 5.0+ KB


In [10]:
# SECOND if Neighbourhood contains a NaN, then replace it by the Borough value
df_tor.Neighbourhood.fillna(df_tor.Borough, inplace=True)
df_tor.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Queen's Park,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [11]:
df_tor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 3 columns):
Postcode         210 non-null object
Borough          210 non-null object
Neighbourhood    210 non-null object
dtypes: object(3)
memory usage: 5.0+ KB


In [12]:
# THIRD, we should aggregate Neighbourhoods if belonging to one unique Postcode
# I'll store the results in df_tor2

In [13]:
df_tor2 = df_tor.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x)).reset_index()

In [14]:
df_tor2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [15]:
# Checking if there are any NaN values left:
df_tor2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postcode         103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [16]:
print("The number of rows in the dataframe =",df_tor2.shape[0])

The number of rows in the dataframe = 103
