## Data Collection

In this notebook, I'll scrape the data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and create a Pandas dataframe from it.

### Scraping data

I'll begin by scraping the table on the webpage.

I'll import pandas to create the dataframe and read through the HTML page.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
dataset = df[0]

In [4]:
dataset.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Cleaning The Data

As I have created the Pandas dataframe, I'll clean the dataframe.

First, I'll remove all the unnecessary rows where Borough is Not assigned

In [5]:
dataset = dataset[dataset['Borough'] != 'Not assigned'].reset_index(drop=True).sort_values(by=['Postal Code'])

In [6]:
dataset

Unnamed: 0,Postal Code,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
...,...,...,...
64,M9N,York,Weston
70,M9P,Etobicoke,Westmount
77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


I tried to use the geocoder library to fetch the coordinates.

In [7]:
from geopy.geocoders import Nominatim

In [8]:
lat_long = []
for address in dataset['Neighborhood']:
    try :
        geolocator = Nominatim(user_agent="foursquare_agent")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        cordinate = [latitude,longitude]
        lat_long.append(cordinate)
    except :
        lat_long.append([np.nan,np.nan])
        continue

Combine the lat_long list with dataset.

In [9]:
dataset = pd.concat([dataset, pd.DataFrame(lat_long,columns=['Latitude','Longitude'])], axis=1).sort_values(by=['Postal Code']).reset_index(drop=True)

In [10]:
dataset.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",,
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",48.732347,6.236384
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.766857,-79.364258
3,M1G,Scarborough,Woburn,43.77398,-79.413833
4,M1H,Scarborough,Cedarbrae,43.775347,-79.345944


In [11]:
dataset.isna().sum()

Postal Code      0
Borough          0
Neighborhood     0
Latitude        38
Longitude       38
dtype: int64

I tried to use the geocoder library to fetch the coordinates but still was get NaN latitude and longitude. So, I decided to use the Geospatial_Coordinates.csv file from https://github.com/kb22/Coursera_Capstone to get the coordinates.

In [12]:
coor = pd.read_csv('Geospatial_Coordinates.csv')
coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
for row in range(len(dataset)):
    if dataset['Latitude'][row] or dataset['Longitude'][row] == np.nan and dataset['Postal Code'][row] == coor['Postal Code'][row]:
        dataset['Latitude'][row] = coor['Latitude'][row]
        dataset['Longitude'][row] = coor['Longitude'][row]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [14]:
dataset.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [15]:
dataset.isna().sum()

Postal Code     0
Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64

In [16]:
dataset.to_csv('toronto.csv')