<h1>Final project of IBM Data Science Certification</h1>
<h2>Segmenting and Clustering Neighborhoods in Toronto</h2>
<h3>Problem 2 - Get coordinates for each neighborhood</h3>
<h3>By: Aurelio Álvarez Ibarra</h3>

This notebook contains the code to get the coordinates of neighborhoods in Toronto. For details on the first section of this notebook, please refer to <a href="https://github.com/aurelioai/Coursera_Capstone/blob/master/Final_proyect_AAI_1.ipynb">Problem 1<a>.

<h4>Initializing data from Problem 1</h4>

In [1]:
# Get packages and libraries ready
!pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd



In [2]:
# Save data from webpage
myurl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(myurl).text
mysoup = BeautifulSoup(source,'lxml')
mytable = mysoup.find('table')
toronto_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

# Process data from webpage
header = True
for mytr in mytable.find_all('tr'): # Looping for each row in the table
    # Initialize data list (row)
    data = []
    for mycell in mytr.find_all('td'): # Looping for each cell in the row
        data.append(mycell.text.strip()) # Strip removes the \n in the end of the cell data
        # The prenious line works for any number of columns (cells) in a row.
    # Write values from the row
    size = len(toronto_df) # Current size of dataframe
    if header: # The header row (which is only one) leaves "data" as a blank list!
        header = False
    else: # Non-header rows can be assigned to dataframe
        toronto_df.loc[size] = data # Appending data after last row of dataframe

# Clean data

# List of boroughs with an assignment
condition1 = toronto_df['Borough']!='Not assigned'
tmp1 = toronto_df[condition1]
tmp1 = tmp1.reset_index(drop=True) # Drops the old index column

# Copying borough name to neighborhood when neighborhood is not assigned
tmp2 = tmp1
for i,hood in enumerate(tmp1['Neighborhood']):
    if (hood=='Not assigned' or hood==''):
        bor = tmp2['Borough'][i]
        print('Updating Neighborhood name for ',bor,' in index ',i)
        tmp2['Neighborhood'][i] = bor
tmp2 = tmp2.reset_index(drop=True) # Drops the old index column

# Merge neighborhoods with the same PostalCode (separated by commas)
tmp3 = tmp2.groupby('PostalCode')['Neighborhood'].apply(','.join).reset_index()
tmp3.rename(columns={'Neighborhood':'Neighborhood_comb'},inplace=True)
merged = pd.merge(tmp2, tmp3, on='PostalCode')
merged.drop(['Neighborhood'],axis=1,inplace=True) # Dropping "old" Neighborhood column
merged.drop_duplicates(inplace=True) # Dropping duplicated rows
merged.rename(columns={'Neighborhood_comb':'Neighborhood'},inplace=True)

# Replacing / by , as the exercise required
dataframe = merged.replace(' / ', ', ',regex=True)
print('After cleaning, the size of the dataframe is: ',dataframe.shape)
dataframe.head(10)

After cleaning, the size of the dataframe is:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<h4>Using geocoder to get the coordinates</h4>
Most following code is borrowed from the exercise explanation. It is suggested to use <code>geocoder</code> to retrieve the coordinates of the neighborhoods in the dataframe. Remember that <code>geocoder</code> allows only 2500 calls per day.

In [3]:
!pip install geocoder # Install geocoder



In [4]:
import geocoder # import geocoder

In [5]:
# initialize your variable to None
lat_lng_coords = None
postal_code = 'M3A'
calls = 0
maxcalls = 50
success = True
# loop until you get the coordinates
while(lat_lng_coords is None):
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    calls = calls + 1
    if calls == maxcalls:
        print('Too many calls! (',calls,')')
        success = False
        break
if success:
    lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
else:
    print('Failed to retrieve data!')

Too many calls! ( 50 )
Failed to retrieve data!


This seems to have a tough time working... So, I will use what they used in a previous lab:

In [6]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

from time import sleep # To not saturate geopy, make pauses between calls

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


Let's give a try with <code>geopy</code>!

In [7]:
postal_code = 'M3A'
address = '{}, Toronto, Ontario, Canada'.format(postal_code)
print('Looking for ',address,'...')

geolocator = Nominatim(user_agent="TO_explorer")
calls = 0
success = True
maxcalls = 30
location = None
while location is None:
    location = geolocator.geocode(address)
    calls = calls + 1
    sleep(1)
    if calls==maxcalls:
        print(calls,' calls performed... too many!')
        success = False
        break
if success:
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(address,latitude, longitude))
else:
    print('No coordinates retrieved. Try again!')

Looking for  M3A, Toronto, Ontario, Canada ...
The geograpical coordinates of M3A, Toronto, Ontario, Canada are 43.6534817, -79.3839347.


I have to accept that it worked sometimes, sometimes it didn't. The loop to retrieve the coordinates of all postal codes would be:

In [8]:
failed = []
coordinates = []
maxcalls = 50
for i,pcode in enumerate(dataframe['PostalCode']):
    address = '{}, Toronto, Ontario'.format(pcode)
    print('Looking for ',address,'...')
    geolocator = Nominatim(user_agent="TO_explorer")
    calls = 0
    success = True
    location = None
    while location is None:
        try:
            location = geolocator.geocode(address)
            calls = calls + 1
            sleep(0.1)
            if calls==maxcalls:
                print(calls,' calls performed... too many!')
                success = False
                failed.append(pcode)
                coordinates.append([pcode,None,None])
                break
        except Exception as e:
            success = False
            failed.append(pcode)
            coordinates.append([pcode,None,None])
            print('An error occured: ',type(e).__class_.__name__)
            break
    if success:
        latitude = location.latitude
        longitude = location.longitude
        print('The geographical coordinates of {} are {}, {}.'.format(address,latitude, longitude),end='\n\n')
        coordinates.append([pcode,latitude,longitude])
    else:
        print('No coordinates retrieved. Try again!',end='\n\n')
print()
print('Finished retrieving!!')
print('The retrieved data is stored in "coordinates"')
print('The list of postal codes with missing data is in "failed"')

Looking for  M3A, Toronto, Ontario ...
The geographical coordinates of M3A, Toronto, Ontario are 43.6534817, -79.3839347.

Looking for  M4A, Toronto, Ontario ...
50  calls performed... too many!
No coordinates retrieved. Try again!

Looking for  M5A, Toronto, Ontario ...
50  calls performed... too many!
No coordinates retrieved. Try again!

Looking for  M6A, Toronto, Ontario ...
50  calls performed... too many!
No coordinates retrieved. Try again!

Looking for  M7A, Toronto, Ontario ...
The geographical coordinates of M7A, Toronto, Ontario are 43.6534817, -79.3839347.

Looking for  M9A, Toronto, Ontario ...
50  calls performed... too many!
No coordinates retrieved. Try again!

Looking for  M1B, Toronto, Ontario ...
The geographical coordinates of M1B, Toronto, Ontario are 43.6534817, -79.3839347.

Looking for  M3B, Toronto, Ontario ...
50  calls performed... too many!
No coordinates retrieved. Try again!

Looking for  M4B, Toronto, Ontario ...
50  calls performed... too many!
No coordi

<b>Note</b>: The exercise asks to retrieve the coordinates of every <i>Neighborhood</i>. However, sometimes several neighborhood are under the same <i>Borough</i> and some boroughs have different <i>Postal Codes</i> associated. The unique identifier for a given location is <i>Postal Code</i> so I used that for the search.

In [9]:
print('Failed: ',failed,end='\n\n')
print('Data: ',coordinates)

Failed:  ['M4A', 'M5A', 'M6A', 'M9A', 'M3B', 'M4B', 'M5B', 'M6B', 'M4C', 'M5C', 'M6C', 'M1E', 'M4E', 'M6E', 'M4G', 'M5G', 'M6G', 'M1H', 'M2H', 'M4H', 'M6H', 'M1J', 'M3J', 'M4J', 'M1K', 'M2K', 'M3K', 'M4K', 'M5K', 'M1L', 'M2L', 'M3L', 'M5L', 'M6L', 'M9L', 'M1M', 'M3M', 'M4M', 'M5M', 'M6M', 'M9M', 'M1N', 'M3N', 'M4N', 'M5N', 'M9N', 'M1P', 'M2P', 'M4P', 'M5P', 'M9P', 'M1R', 'M2R', 'M4R', 'M5R', 'M6R', 'M7R', 'M1S', 'M4S', 'M5S', 'M1T', 'M4T', 'M5T', 'M1V', 'M4V', 'M8V', 'M9V', 'M4W', 'M5W', 'M8W', 'M9W', 'M1X', 'M5X', 'M8X', 'M4Y', 'M7Y', 'M8Y', 'M8Z']

Data:  [['M3A', 43.6534817, -79.3839347], ['M4A', None, None], ['M5A', None, None], ['M6A', None, None], ['M7A', 43.6534817, -79.3839347], ['M9A', None, None], ['M1B', 43.6534817, -79.3839347], ['M3B', None, None], ['M4B', None, None], ['M5B', None, None], ['M6B', None, None], ['M9B', 43.64074125, -79.5419018239487], ['M1C', 43.6534817, -79.3839347], ['M3C', 43.7328216, -79.3469614], ['M4C', None, None], ['M5C', None, None], ['M6C', None, 

Well, it seems that also <code>geopy</code> has some restrictions (which is fair for a free service). I will use the provided <a href='https://cocl.us/Geospatial_data'>CSV file</a> from the exercise:

In [10]:
# Download provided CSV file
!wget -q -O 'toronto_data.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [11]:
# Convert file to dataframe
toronto_df = pd.read_csv('toronto_data.csv')
toronto_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
# Changing the name of the first column of downloaded data
toronto_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)

# Merging provided data into the original dataframe
# dataframe is the original data retrieved and cleaned from wikipedia
# toronto_df is the downloaded data
merged = pd.merge(dataframe, toronto_df, on='PostalCode')
merged.drop_duplicates(inplace=True) # Dropping duplicated rows
print('Shape of merged dataframe: ',merged.shape)
merged.head(10)

Shape of merged dataframe:  (103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
