# IBM Data Science Professional Certification 
## Course 9: Applied Data Science Capstone
### Peer-graded assignment (Week 3):  Segmenting and Clustering Neighborhoods in Toronto - Part 3

The python program in this notebook performs the clustering of the neighborhood of Toronto using the k-means method. This notebook was produced using the Jupyter Notebook IDE provided by the Anaconda Python distribution.

**Importing the necessary libraries.**

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from sklearn import preprocessing

import matplotlib.pyplot as plt
%matplotlib inline

**Reading the data frame of the Toronto neighborhoods with their respectives latitudes and longitudes.**

In [2]:
df_neigh=pd.read_csv("Toronto_neighborhoods_with_lat_long.csv")
df_neigh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


**Use geopy library to get the latitude and longitude values of Toronto.** 

In [3]:
address = "Toronto, Ontário"

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the city of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of the city of Toronto are 43.653963, -79.387207.


**Create a map of New York with neighborhoods superimposed on top.**

In [4]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_neigh["Latitude"], df_neigh["Longitude"], df_neigh["Borough"], df_neigh["Neighborhood"]):
    label = '{}, {}'.format(neighborhood,borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto


**Define the foursquare credentials and version.**

In [5]:
CLIENT_ID = 'OBCEPCRHGRGJYDYQJDIO5JS2N4K3VG03I0L3G4QVYFPVXKKD' # your Foursquare ID
CLIENT_SECRET = 'A0HOIXJYKRCJKUQXLPXM3WX0SDIP5U0UXLZ24W1LV3MGGFKY' # your Foursquare Secret
VERSION = '20190430' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: OBCEPCRHGRGJYDYQJDIO5JS2N4K3VG03I0L3G4QVYFPVXKKD
CLIENT_SECRET:A0HOIXJYKRCJKUQXLPXM3WX0SDIP5U0UXLZ24W1LV3MGGFKY


**Builds a pandas data frame in which each line contains a postal code (first column) the correponding borough (second column) and the corresponding neighborhoods (third column).**

In [6]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
toronto_venues=getNearbyVenues(names=df_neigh["Neighborhood"],latitudes=df_neigh["Latitude"],longitudes=df_neigh["Longitude"])

**Checking the resulting frame.**

In [8]:
print(toronto_venues.shape)
#toronto_venues.head()

(2244, 7)


**Number of catetories of venues.**

In [9]:
print('There are {} venues belonging to {} uniques categories.'.format(toronto_venues.shape[0],len(toronto_venues['Venue Category'].unique())))

There are 2244 venues belonging to 274 uniques categories.


**Creating a data frame of dummies variables representing each category.**

In [10]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#toronto_onehot.head()

**Size of the data frame.**

In [11]:
toronto_onehot.shape

(2244, 274)

**Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category.**

In [12]:
toronto_grouped=toronto_onehot.groupby("Neighborhood").mean().reset_index()
#toronto_grouped

**Size of the grouped data.**

In [13]:
toronto_grouped.shape

(100, 274)

**The top 5 most common venues.**

**Displaying the top 10 venues for each neighborhood.**

Function to sort the venues in descending order.

In [14]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood.

In [15]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

#neighborhoods_venues_sorted.head()

**Performing the K-means clustering.**

K-means clustering.

In [16]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 0, 0, 0, 0, 0, 0])

Building a frame with the top 10 venues for each neighborhood and its respective cluster label.

In [17]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged=pd.merge(df_neigh[["Borough","Neighborhood","Latitude","Longitude"]],neighborhoods_venues_sorted,left_on="Neighborhood",right_on="Neighborhood")

#toronto_merged.head()

Visualizing the clusters.

In [18]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Examining clusters.**

Cluster 1.

In [19]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Rouge, Malvern",Fast Food Restaurant,Women's Store,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
1,"Highland Creek, Rouge Hill, Port Union",Bar,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore
2,"Guildwood, Morningside, West Hill",Electronics Store,Spa,Pizza Place,Rental Car Location,Intersection,Mexican Restaurant,Breakfast Spot,Medical Center,Diner,Dessert Shop
3,Woburn,Coffee Shop,Insurance Office,Korean Restaurant,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
4,Cedarbrae,Hakka Restaurant,Fried Chicken Joint,Thai Restaurant,Caribbean Restaurant,Bakery,Bank,Athletics & Sports,Dumpling Restaurant,Drugstore,Donut Shop
5,Scarborough Village,Playground,Convenience Store,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
6,"East Birchmount Park, Ionview, Kennedy Park",Bus Station,Department Store,Coffee Shop,Discount Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Dog Run,Women's Store
7,"Clairlea, Golden Mile, Oakridge",Bakery,Bus Line,Soccer Field,Park,Intersection,Fast Food Restaurant,Metro Station,Dim Sum Restaurant,Diner,Discount Store
9,"Birch Cliff, Cliffside West",College Stadium,Café,Skating Rink,General Entertainment,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
10,"Dorset Park, Scarborough Town Centre, Wexford ...",Indian Restaurant,Vietnamese Restaurant,Latin American Restaurant,Chinese Restaurant,Pet Store,College Stadium,Deli / Bodega,Electronics Store,Eastern European Restaurant,College Gym


In [20]:
aux=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
first=aux.groupby(by="1st Most Common Venue").count()["Neighborhood"].idxmax()
second=aux.groupby(by="2nd Most Common Venue").count()["Neighborhood"].idxmax()
third=aux.groupby(by="3rd Most Common Venue").count()["Neighborhood"].idxmax()
print("To the majority of the neighborhoods in the cluster 1, the 1st, 2nd and 3rd most common vanues are {}, {} and {}, respectively".format(first,second,third))

To the majority of the neighborhoods in the cluster 1, the 1st, 2nd and 3rd most common vanues are Coffee Shop, Café and Coffee Shop, respectively


From the above information, it can be concluded that coffe shops are the main venues for the majority of the neighborhoods in the cluster 1.

Cluster 2.

In [21]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
89,"Humber Bay, King's Mill Park, Kingsway Park So...",Baseball Field,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore
94,"Emery, Humberlea",Construction & Landscaping,Baseball Field,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store


There are two groups of neighborhoods in the cluster 2. For one of the groups, the 1st most common venue is a basketball field, while basketball field is the 2nd most common place to the other group of neighbohoods. Then, it can be concluded that the main venue in this cluster is the Basketball field.

CLuster 3.

In [22]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Coffee Shop,Playground,Dog Run,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
19,"Silver Hills, York Mills",Park,Cafeteria,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
21,York Mills West,Bank,Park,Electronics Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
28,"CFB Toronto, Downsview East",Park,Snack Place,Airport,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
38,East Toronto,Park,Convenience Store,Metro Station,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
48,Rosedale,Park,Playground,Trail,Eastern European Restaurant,Dumpling Restaurant,Electronics Store,Drugstore,Donut Shop,Doner Restaurant,Curling Ice
72,Caledonia-Fairbanks,Park,Women's Store,Fast Food Restaurant,Market,Pharmacy,Gluten-free Restaurant,Gift Shop,Gourmet Shop,Dumpling Restaurant,Drugstore
88,"The Kingsway, Montgomery Road, Old Mill North",Park,River,Women's Store,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Dog Run
95,Weston,Park,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore


For all neighborhoods in the cluster 3, the 1st most common venues are Parks. Between the 2nd most common places, there are cafeterias, coffee shops, parks, snack places, playground and rivers. Probably, the neighborhoods of the cluster 3 are characterized by being regions with green spaces.

CLuster 4.

In [23]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Cliffcrest, Cliffside, Scarborough Village West",Motel,American Restaurant,Dance Studio,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant


The 1st most common venue for the neighborhoods in the cluster 4 are motels. This may characterize these neighborhoods as regions with entertainment for the adult public.