# Assignment PART I

Import pandas:

In [1]:
import pandas as pd

Scrap all tables in webpage using a single pandas line of code:

In [2]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

How many tables were scrapped?

In [3]:
len(data)


3

Let's be sure that the first table is the one we want:

In [4]:
data[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Yes, that's the right one! How many rows have 'Not assigned' boroughs?

In [5]:
data[0].Borough[data[0]['Neighbourhood'] == 'Not assigned'].count()

77

Well, let's remove these rows, and hopefully we'll end up with 103 rows...

In [6]:
nonnull_data = data[0][data[0]['Borough'] != 'Not assigned']

In [7]:
nonnull_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Nice, we are left with the expected number of rows

Now let's find out if there are any rows with 'Not assigned' noeighbourhoods...

In [8]:
nonnull_data.Borough[nonnull_data['Neighbourhood'] == 'Not assigned'].count()

0

No, there aren't. So our cleanup is nearly done!

Finally, let's reset the table index, rename some columns so that they have exactly the names that appear in the assignment task text, take a look at the first 5 rows to be double sure that everything seems OK, and use the .shape method to print the number of rows of our dataframe:

In [9]:
nonnull_data.reset_index(drop=True,inplace=True)
nonnull_data.columns = ['PostalCode','Borough','Neighborhood']
print(nonnull_data.head())
nonnull_data.shape


  PostalCode           Borough                                 Neighborhood
0        M3A        North York                                    Parkwoods
1        M4A        North York                             Victoria Village
2        M5A  Downtown Toronto                    Regent Park, Harbourfront
3        M6A        North York             Lawrence Manor, Lawrence Heights
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


(103, 3)

# Assignment PART II

Tried the Geocoder package. Unfortunately it doesn't work...
Will use the csv file located at https://cocl.us/Geospatial_data:

In [10]:
!wget -O Geospatial_Coordinates.csv https://cocl.us/Geospatial_data

--2020-07-27 14:35:53--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 159.8.69.21, 159.8.72.228, 159.8.69.24
Connecting to cocl.us (cocl.us)|159.8.69.21|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-07-27 14:35:56--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-07-27 14:35:56--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [f

Now we read the file into a pandas dataframe:

In [11]:
coord = pd.read_csv("Geospatial_Coordinates.csv")

And we rename the 'Postal Code' column to 'PostalCode' so that it is equal to the column in nonnull_data and it can be used as the key to merge:

In [12]:
coord.rename(columns={"Postal Code": "PostalCode"}, inplace=True)

In [13]:
coord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, let's merge the nonnull_data and coord data frames:

In [14]:
final_df = nonnull_data.merge(coord, on='PostalCode')

Let's see if our final dataframe is as expected...

In [15]:
print(final_df.head())
print(final_df.shape)

  PostalCode           Borough                                 Neighborhood  \
0        M3A        North York                                    Parkwoods   
1        M4A        North York                             Victoria Village   
2        M5A  Downtown Toronto                    Regent Park, Harbourfront   
3        M6A        North York             Lawrence Manor, Lawrence Heights   
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

    Latitude  Longitude  
0  43.753259 -79.329656  
1  43.725882 -79.315572  
2  43.654260 -79.360636  
3  43.718518 -79.464763  
4  43.662301 -79.389494  
(103, 5)


Looks good!!

# Assignment PART III

Use geopy library to get the latitude and longitude values of Toronto:

In [16]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


In [17]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Create a map of New York with neighborhoods superimposed on top:

In [18]:
import folium

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(final_df['Latitude'], final_df['Longitude'], final_df['Borough'], final_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [19]:
# @hidden_cell
CLIENT_ID = 'B4VGNB4EAPO2VA2GPPBPARNJOGW4LE1SAV4YNH4G4KAR0IFG' # your Foursquare ID
CLIENT_SECRET = 'FG5GCAFKU3E3MSSG25QEU4YNVSRYNPZDMXQFEIJZBNZ52S50' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Let's create a function to get venues from all the neighborhoods in Toronto

In [20]:
import requests
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [21]:
# We get the top 100 venues that are in each neighborhood within a radius of 500 meters

LIMIT = 100 # limit of number of venues returned by Foursquare API
toronto_venues = getNearbyVenues(names=final_df['Neighborhood'],
                                   latitudes=final_df['Latitude'],
                                   longitudes=final_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Let's encode the venue categories:

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

Let's write a function to sort the venues in descending order:

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood:

In [24]:
import numpy as np

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Now run *k*-means to cluster the neighborhoods into 3 clusters:

In [25]:
# set number of clusters
kclusters = 5

from sklearn.cluster import KMeans

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
print(toronto_grouped_clustering.shape)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
len(kmeans.labels_)

(95, 266)


95

In [26]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = final_df
#toronto_merged.head()
# merge toronto_grouped with final_df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.dropna(inplace=True)
toronto_merged = toronto_merged.astype({'Cluster Labels':'int32'})

print(toronto_merged.head())

  PostalCode           Borough                                 Neighborhood  \
0        M3A        North York                                    Parkwoods   
1        M4A        North York                             Victoria Village   
2        M5A  Downtown Toronto                    Regent Park, Harbourfront   
3        M6A        North York             Lawrence Manor, Lawrence Heights   
4        M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

    Latitude  Longitude  Cluster Labels 1st Most Common Venue  \
0  43.753259 -79.329656               2                  Park   
1  43.725882 -79.315572               0     French Restaurant   
2  43.654260 -79.360636               1           Coffee Shop   
3  43.718518 -79.464763               1        Clothing Store   
4  43.662301 -79.389494               1           Coffee Shop   

  2nd Most Common Venue 3rd Most Common Venue   4th Most Common Venue  \
0     Food & Drink Shop           Yoga Studio                

Finally, let's visualize the resulting clusters

In [27]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Last step: examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, assign a name to each cluster.

In [28]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Victoria Village,0,French Restaurant,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Distribution Center,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
6,"Malvern, Rouge",0,Print Shop,Fast Food Restaurant,Yoga Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant
8,"Parkview Hill, Woodbine Gardens",0,Pizza Place,Pharmacy,Gastropub,Bank,Intersection,Athletics & Sports,Café,Breakfast Spot,Gym / Fitness Center,Pet Store
10,Glencairn,0,Pizza Place,Bakery,Japanese Restaurant,Pub,Yoga Studio,Distribution Center,Dim Sum Restaurant,Diner,Discount Store,Dog Run
17,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",0,Pharmacy,Beer Store,Pet Store,Pizza Place,Coffee Shop,Convenience Store,Café,Shopping Plaza,Liquor Store,Gas Station
29,Thorncliffe Park,0,Sandwich Place,Indian Restaurant,Gym / Fitness Center,Restaurant,Pizza Place,Pharmacy,Park,Liquor Store,Yoga Studio,Supermarket
51,"Cliffside, Cliffcrest, Scarborough Village West",0,American Restaurant,Motel,Skating Rink,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Yoga Studio
55,"Bedford Park, Lawrence Manor East",0,Restaurant,Sandwich Place,Coffee Shop,Thai Restaurant,Italian Restaurant,Indian Restaurant,Fast Food Restaurant,Japanese Restaurant,Liquor Store,Juice Bar
56,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0,Sandwich Place,Coffee Shop,Discount Store,Bar,Yoga Studio,Dog Run,Dim Sum Restaurant,Diner,Distribution Center,Doner Restaurant
70,Westmount,0,Pizza Place,Chinese Restaurant,Intersection,Coffee Shop,Sandwich Place,Middle Eastern Restaurant,Discount Store,Yoga Studio,Dim Sum Restaurant,Diner


In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Regent Park, Harbourfront",1,Coffee Shop,Bakery,Pub,Park,Theater,Breakfast Spot,Café,Yoga Studio,Shoe Store,Restaurant
3,"Lawrence Manor, Lawrence Heights",1,Clothing Store,Accessories Store,Arts & Crafts Store,Furniture / Home Store,Event Space,Miscellaneous Shop,Coffee Shop,Boutique,Women's Store,Vietnamese Restaurant
4,"Queen's Park, Ontario Provincial Government",1,Coffee Shop,Diner,Yoga Studio,Sandwich Place,Restaurant,Park,Mexican Restaurant,Hobby Shop,Fried Chicken Joint,Distribution Center
7,Don Mills,1,Gym,Japanese Restaurant,Café,Sporting Goods Shop,Coffee Shop,Beer Store,Restaurant,Supermarket,Dim Sum Restaurant,Italian Restaurant
9,"Garden District, Ryerson",1,Coffee Shop,Clothing Store,Italian Restaurant,Bubble Tea Shop,Café,Japanese Restaurant,Cosmetics Shop,Tea Room,Bookstore,Hotel
...,...,...,...,...,...,...,...,...,...,...,...,...
96,"St. James Town, Cabbagetown",1,Pizza Place,Restaurant,Coffee Shop,Café,Bakery,Park,Market,Convenience Store,Italian Restaurant,Pet Store
97,"First Canadian Place, Underground city",1,Coffee Shop,Café,Hotel,Gym,Restaurant,Japanese Restaurant,Steakhouse,Seafood Restaurant,Deli / Bodega,American Restaurant
99,Church and Wellesley,1,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Café,Pub,Men's Store,Mediterranean Restaurant,Hotel
100,"Business reply mail Processing Centre, South C...",1,Light Rail Station,Yoga Studio,Butcher,Skate Park,Auto Workshop,Burrito Place,Fast Food Restaurant,Farmers Market,Garden Center,Garden


In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Parkwoods,2,Park,Food & Drink Shop,Yoga Studio,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
21,Caledonia-Fairbanks,2,Park,Women's Store,Pool,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Donut Shop
35,"East Toronto, Broadview North (Old East York)",2,Park,Convenience Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio
52,"Willowdale, Newtonbrook",2,Park,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Donut Shop
61,Lawrence Park,2,Park,Bus Line,Swim School,Yoga Studio,Dog Run,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
64,Weston,2,Convenience Store,Park,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio
66,York Mills West,2,Park,Convenience Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio
77,"Kingsview Village, St. Phillips, Martin Grove ...",2,Park,Sandwich Place,Yoga Studio,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
85,"Milliken, Agincourt North, Steeles East, L'Amo...",2,Park,Playground,Yoga Studio,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Donut Shop
91,Rosedale,2,Park,Trail,Playground,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center


In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,Humber Summit,3,Gym,Pizza Place,Afghan Restaurant,Falafel Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
83,"Moore Park, Summerhill East",3,Gym,Tennis Court,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,Deli / Bodega


In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Roselawn,4,Music Venue,Garden,Yoga Studio,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Deli / Bodega


I would say the neighborhoods could be classified as being rich in:
0. Restaurants
1. Cafés
2. Parks
3. Gyms
4. Music