## Segmenting and Clustering Neighborhoods in Toronto (Part 2)

This notebook consists of answer for Part 1 and 2 of the Assignment.

## Part 1

In [1]:
!pip install bs4
print('\nBeautifulSoup package installation success.')


BeautifulSoup package installation success.


Import libraries

In [2]:
from bs4 import BeautifulSoup 
import requests
import pandas as pd

The following url contains html tables with list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto.

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Read the HTML tables inside the Wikipedia page.

In [4]:
table_data = pd.read_html(url, flavor='bs4')
print(f'Total tables: {len(table_data)}')

Total tables: 3


In [5]:
table_data[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned


We want to capture Postal codes, Borough and Neighborhood data from the first table.

Use BeautifulSoup package to transform the Wikipedia data into dataframe.

In [6]:
url_data  = requests.get(url).text
soup = BeautifulSoup(url_data,"html5lib")
tables = soup.find_all('table')
table_data = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])

for row in tables[0].tbody.find_all('td'):
    if row.span.text == 'Not assigned':
        pass
    else:
        postalcode = row.p.text[:3] 
        borough = (row.span.text).split('(')[0]
        neighborhood = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_data = table_data.append({"PostalCode" : postalcode, "Borough" : borough, "Neighborhood" : neighborhood}, ignore_index=True)

table_data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [7]:
table_df = pd.DataFrame(table_data)
table_df['Borough'] = table_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
table_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Make sure "Not assigned" neighborhood is not inside dataframe.

In [8]:
table_df.Neighborhood.str.count('Not assigned').sum()

0

In [9]:
len(table_df['PostalCode'].unique())

103

There are 103 unique postal code inside the dataframe.

In [10]:
table_df.shape

(103, 3)

### Answer: There are 103 rows inside the dataframe.

---

## Part 2

In [11]:
!pip install geocoder
print('\nGeocoder package installation success.')


Geocoder package installation success.


Get the latitude and longitude coordinates of each neighborhood in Toronto.

In [12]:
import geocoder

geo_coords = None
postal_code = 'M5G'
count = 0

while (geo_coords is None and count < 2500):
    geo = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    count = count + 1
    geo_coords = geo.latlng
    
if (geo_coords is not None):
    latitude = geo_coords[0]
    longitude = geo_coords[1]
    print(latitude, longitude)
else:
    print('Try ' + str(count) + ' times to get the geographical coordinates through geocoder but not success.')

Try 2500 times to get the geographical coordinates through geocoder but not success.


Not able to get the geographical coordinates of the neighborhoods using the Geocoder package. Get the geographical coordinates for each postal code from csv file.

In [13]:
coordinate_data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv')
coordinate_data.columns = ['PostalCode', 'Latitude', 'Longitude']
coordinate_data.sort_values(by=['PostalCode'], ascending=True, inplace=True)
coordinate_data

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Sorting ascending on table_contents data.

In [14]:
table_df.sort_values(by=['PostalCode'], ascending=True, inplace=True, ignore_index=True)
table_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


Merge the table_df and coordinate_data dataframe together.

In [15]:
toronto_df = pd.merge(table_df, coordinate_data, how="left", on=["PostalCode"])
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [16]:
toronto_df.shape

(103, 5)

We get 103 records as we execute the left merge method on 'PosalCode' key.

---