# Segmenting and Clustering Neighborhoods in Toronto, Canada

## Part 3: Creating a DataFrame that contains PostalCode, Borough, and Neighborhoods of Toronto. Add Latitude and Longitude of each PostalCode to the DataFrame. Explore and cluster the neighborhoods in Toronto.

### Autor: Fereshteh Bashiri

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# Web scraping
base_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
res = requests.get(base_url)
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
df_wiki = pd.read_html(str(table))[0]

# Ignore cells with a borough that is "Not assigned" 
df_wiki = df_wiki[df_wiki['Borough']!="Not assigned"]
df_wiki.reset_index(inplace=True, drop=True)

# replace cells with a Neighborhood that is "Not assigned" with it's Borough's name
for i, cell in enumerate(df_wiki['Neighbourhood']):
    if cell == "Not assigned":
        df_wiki['Neighbourhood'][i] = df_wiki['Borough'][i]

# df_wiki

# merge Neighborhoods with a same Borough
df_wiki.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df_toronto = df_wiki.groupby(by=['PostalCode','Borough'])['Neighbourhood'].apply(list).reset_index(name='Neighbourhood')
df_toronto.head(10)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]
5,M1J,Scarborough,[Scarborough Village]
6,M1K,Scarborough,"[East Birchmount Park, Ionview, Kennedy Park]"
7,M1L,Scarborough,"[Clairlea, Golden Mile, Oakridge]"
8,M1M,Scarborough,"[Cliffcrest, Cliffside, Scarborough Village West]"
9,M1N,Scarborough,"[Birch Cliff, Cliffside West]"


Use the .shape method to print the number of rows of your dataframe

In [3]:
print('The number of rows of the datafram is: {}.'.format(df_toronto.shape[0]))

The number of rows of the datafram is: 103.


Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [4]:
## One way is to download a csv file that has the geographical coordinates of each postal code:
# import sys
# !{sys.executable} -m pip install wget

# import wget
# lat_lng_url = 'https://cocl.us/Geospatial_data'
# wget.download(lat_lng_url, 'Geospatial_data.csv')

## The other way is to use the geocoder python package in a loop, to obtain lat-lng coordinates of each neighborhood
# import geocoder

# for postal_code in toronto_merged['Postcode']:
#     print('\nDownloading coordinates of ' + postal_code)
    
#     # initialize your variable to None
#     lat_lng_coords = None

#     # loop until you get the coordinates
#     while(lat_lng_coords is None):
#         g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#         lat_lng_coords = g.latlng
    
#     df_toronto.loc[postal_code,'Latitude'] = lat_lng_coords[0]
#     df_toronto.loc[postal_code,'Longitude'] = lat_lng_coords[1]

## Another way: download the file that contains geo coordinates on a local drive, and read from it
geo_df = pd.read_csv('./Geospatial_Coordinates.csv')
# geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
geo_df.sort_values(by='Postal Code', inplace=True)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
# merge Lat and Lng info to toronto dataframe
df_toronto[['Latitude','Longitude']] = geo_df[['Latitude','Longitude']]
df_toronto.tail()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
98,M9N,York,[Weston],43.706876,-79.518188
99,M9P,Etobicoke,[Westmount],43.696319,-79.532242
100,M9R,Etobicoke,"[Kingsview Village, Martin Grove Gardens, Rich...",43.688905,-79.554724
101,M9V,Etobicoke,"[Albion Gardens, Beaumond Heights, Humbergate,...",43.739416,-79.588437
102,M9W,Etobicoke,[Northwest],43.706748,-79.594054


Create a new dataframe from df_toronto by choosing Boroughs that contains the word 'Toronto' in them.

In [6]:
toronto_data = df_toronto[df_toronto['Borough'].str.contains('Toronto')]
toronto_data.reset_index(inplace=True, drop=True)
toronto_data.head()
# toronto_data.shape

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,[The Beaches],43.676357,-79.293031
1,M4K,East Toronto,"[The Danforth West, Riverdale]",43.679557,-79.352188
2,M4L,East Toronto,"[The Beaches West, India Bazaar]",43.668999,-79.315572
3,M4M,East Toronto,[Studio District],43.659526,-79.340923
4,M4N,Central Toronto,[Lawrence Park],43.72802,-79.38879


### Use geopy library to get the geographical coordinates of Toronto, ON.

In [7]:
from geopy.geocoders import Nominatim
address = 'Toronto, ON'

geolocator = Nominatim(user_agent='tn_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto, ON are {}, {}'.format(latitude, longitude))


The geographical coordinates of Toronto, ON are 43.653963, -79.387207


### Create a map of Toronto, ON with Boroughs which contains the name "Toronto"

In [8]:
import folium

# create map of Toronto, using latitude and longitude
map_toronto = folium.Map(location=[latitude,longitude], zoom_start=12)

# add markers to the map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['PostalCode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = 'red',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_toronto)

map_toronto

### Define Foursquare Credentials and Version

In [9]:
CLIENT_ID = 'YYVEI3BSFIIROPFBI3MKO2D1TPYZYZWFADBW3XPZFKZJW1SF' # 'your-client-ID' # your Foursquare ID
CLIENT_SECRET = 'MZ2UYCNQQUB2IXRY143BXFLRKBL21YVAXOWEBH5RRV1KQB1L' #'your-client-secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YYVEI3BSFIIROPFBI3MKO2D1TPYZYZWFADBW3XPZFKZJW1SF
CLIENT_SECRET:MZ2UYCNQQUB2IXRY143BXFLRKBL21YVAXOWEBH5RRV1KQB1L


### Let's explore some neighborhoods in Toronto

First, define a few things. Some variables and a function that sends API request and extracts useful information from the response

In [10]:
# top 100 venues within a radius of 500 m
LIMIT = 100
RADIUS = 500

# A function that will create a list of veunues and their information within each neighborhood
def getNearbyVenues (names, latitudes, longitudes):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
        
        # create a url
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat,
            lng,
            RADIUS,
            LIMIT)
        
        # send get request
        results = requests.get(url).json()
        results = results['response']['groups'][0]['items']
        
        # extract useful info and store them in a list
        for v in results:
            venues_list.append([
                name,
                lat,
                lng,
                v['venue']['name'],
                v['venue']['location']['lat'],
                v['venue']['location']['lng'],
                v['venue']['categories'][0]['name']])
        
    # convert list into DataFrame
    nearby_venues = pd.DataFrame(venues_list, columns=['Neighborhood',
                                                      'Neighborhood Latitude',
                                                      'Neighborhood longitude',
                                                      'Venue',
                                                      'Venue Latitude',
                                                      'Venue Longitude',
                                                      'Venue Category'])

    return(nearby_venues)



Now, run the above function on each neighborhood of toronto_data

In [11]:
# get toronto venues
toronto_venues = getNearbyVenues(names=toronto_data['PostalCode'], 
                                 latitudes=toronto_data['Latitude'], 
                                 longitudes=toronto_data['Longitude'])
toronto_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,M4K,43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
6,M4K,43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
7,M4K,43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
8,M4K,43.679557,-79.352188,Mezes,43.677962,-79.350196,Greek Restaurant
9,M4K,43.679557,-79.352188,Messini Authentic Gyros,43.677827,-79.350569,Greek Restaurant


### Analyze Each PostalCode

In [12]:
# the size of toronto_venues
print('The shape of toronto_venues is ({},{})'.format(toronto_venues.shape[0], toronto_venues.shape[1]))
print('There are {} unique venue categories within {} postal codes.'.format(len(toronto_venues['Venue Category'].unique()),
                                                                           toronto_data.shape[0]))

# one hot encoding
venue_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code to the table
venue_onehot.insert(0, 'Postal Code', toronto_venues['Neighborhood'])
# venue_onehot.head()

# group by neighborhood and take the mean of the freq of occurence in each category
toronto_venues_freq = venue_onehot.groupby(by='Postal Code').mean()
toronto_venues_freq.head()

The shape of toronto_venues is (1685,7)
There are 229 unique venue categories within 38 postal codes.


Unnamed: 0_level_0,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051282,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641
M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cluster Postal Codes

In [13]:
from sklearn.cluster import KMeans

# number of clusters
kclusters = 5

# instantiate
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(toronto_venues_freq)

# add clustering labels
toronto_clustered = toronto_data[:]
toronto_clustered.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_clustered.head()

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,4,M4E,East Toronto,[The Beaches],43.676357,-79.293031
1,0,M4K,East Toronto,"[The Danforth West, Riverdale]",43.679557,-79.352188
2,0,M4L,East Toronto,"[The Beaches West, India Bazaar]",43.668999,-79.315572
3,0,M4M,East Toronto,[Studio District],43.659526,-79.340923
4,3,M4N,Central Toronto,[Lawrence Park],43.72802,-79.38879


### Visualize clusters on a map

In [14]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create a map
map_cluster = folium.Map(location=[latitude,longitude], zoom_start=11)

# set color scheme
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers
for pscode, lat, lng, cls in zip(toronto_clustered['PostalCode'], toronto_clustered['Latitude'], 
                                 toronto_clustered['Longitude'], toronto_clustered['Cluster Labels']):
    label = folium.Popup(pscode+' Cluster '+str(cls) , parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cls-1],
        fill=True,
        fill_color=rainbow[cls-1],
        fill_opacity=0.7).add_to(map_cluster)

map_cluster