# Applied Data Science Capstone

## WEEK 3 Peer Graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### Table of Contents

[Task 1: Creating Dataframe with Postal Codes, Boroughs and Neighborhoods](#one)
    
[Task 2: Adding Latitude and Longitudes Coordinates to Each Borough](#two)
        
[________2.1 Using CSV-file to Get Coordinates](#three)
        
[________2.2 Checking GeoCoder, which didn't work and using Google-API instead to create dataframe with coordinates.](#four)

[Task 3: Clustering and Visualasing the Neighborhoods in Toronto](#five)

# Task 1<a name="one"></a>
### Creating Dataframe with Postal Codes, Boroughs & Neighborhoods

1.1 Scraping the table of canadian postal codes from Wikipedia to pandas dataframe

In [3]:
import pandas as pd
import numpy as np

#Scraping the table of canadian postal codes from Wikipedia to pandas dataframe
df_can_zip=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

1.2 Preparing pandas dataframe according to the conditions in the assignment:
- Leaving only Postal Codes with assigned Boroughs and Neighborhoods: removing rows, where Borough is 'Not assigned';
- Combining rows with similar postal codes, values from 'Neighbourhood' are joined in one cell and separated by comma and space (although it is not needed, the data was already organized this way in the internet-table on WikiPedia);
- If 'Neighbourhood' cell in row has 'Not assigned' value, then I put to 'Neighbourhood' cell the same value as in 'Borough' cell of this row (although it is not needed, the data is already organized without 'Not assigned' Neighbourhoods in the internet-table on WikiPedia);
- creating exactly the same dataframe as shown in example to the assignment (12 rows, rows are in the same order as in example );
- printing the number of rows in the original dataframe, that will be used for further tasks.

In [4]:
#Leaving only Postal Codes with assigned Boroughs and Neighborhoods:
df_can_zip=df_can_zip[df_can_zip['Borough'] !='Not assigned'] 

#Combining rows with similar postal codes, values from 'Neighbourhood' are joined in one cell and separated by comma and space:
df_can_zip = df_can_zip.groupby(['Postal Code','Borough'], as_index = False).agg({'Neighbourhood': ', '.join})

#If Neighbourhood in row is 'Not assigned', then put to Neighbourhood cell the same value as in Borough cell of this row:
df_can_zip.loc[df_can_zip['Neighbourhood']=='Not assigned','Neighbourhood']=df_can_zip.loc[df_can_zip['Neighbourhood']=='Not assigned', 'Borough']

#creating dataframe as shown in assignment table with postal codes
df_can_zip.rename(columns={'Postal Code': 'PostalCode'}, inplace=True) #Changing name of "Postal Code" column to "PostalCode"
sorter=['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'] #adding list of PostalCodes from assignment
df_coursera=df_can_zip[df_can_zip['PostalCode'].isin(sorter)] #leaving only rows with PostalCodes that are showon in assignment
true_sort = [s for s in sorter if s in df_coursera.PostalCode.unique()] #preparing the list to sort rows the same way...
df_coursera = df_coursera.set_index('PostalCode').loc[true_sort].reset_index() #...as it is shown in assignment

## Task 1 Answer
### Printing the number of rows in the dataframe and showing the same dataframe as on the picture in the assignment.

In [5]:
#printing number of rows in the dataframe with shape method
print('The dataframe has '+str(df_can_zip.shape[0])+' rows')
df_coursera #showing the same dataframe as in assignment

The dataframe has 103 rows


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


# Task 2<a name="two"></a>
### Adding Latitude & Longitudes Coordinates to each Borough (From CSV and from Google APIs)

#### 2.1 Using CSV-File to create dataframe with coordinates of Boroughs<a name="three"></a>

In [6]:
#reading CSV-File with coordinates given in the assignment to a dataframe:
df_csv_coord=pd.read_csv('https://cocl.us/Geospatial_data')
print(df_csv_coord.shape) #checking the shape

#Set postal code as index to further easily merge the corresponding geo coordinates in to dataframe with Boroughs and N-hoods:
df_csv_coord.set_index('Postal Code', inplace=True)
df_csv_coord.tail(1) #checking the names of columns, format of data in cells of dataframe

(103, 3)


Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M9W,43.706748,-79.594054


In [8]:
#Creatind dataframe "df_can_coord1", wich will include PostalCode, Borough, Neighbourhood, Latitude and Longitude information:
column_names = ['PostalCode', 'Borough', 'Neighbourhood','Latitude', 'Longitude'] #Creating new dataframe with Latitude & Longitude
df_can_coord1 = pd.DataFrame(columns=column_names)

#Using while loop to fill the df_can_coord1 dataframe with rows from df_can_zip dataframe and adding latitude and longitude:
j = 0
while j < df_can_zip.shape[0]:
    df_can_coord1 = df_can_coord1.append({'PostalCode':df_can_zip.loc[j,'PostalCode'],
                                          'Borough':df_can_zip.loc[j,'Borough'],
                                          'Neighbourhood':df_can_zip.loc[j,'Neighbourhood'],
                                          'Latitude': df_csv_coord.loc[df_can_zip.loc[j,'PostalCode'],'Latitude'],
                                          'Longitude': df_csv_coord.loc[df_can_zip.loc[j,'PostalCode'],'Longitude'],
                                          }, ignore_index=True)
    j=j+1

print(df_can_coord1.shape)
print(df_can_zip.tail(1))
df_can_coord1.tail(1)    

(103, 5)
    PostalCode    Borough                        Neighbourhood
102        M9W  Etobicoke  Northwest, West Humber - Clairville


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
102,M9W,Etobicoke,"Northwest, West Humber - Clairville",43.706748,-79.594054


# Task 2 Answer (Using CSV File)
## Showing the same dataframe as on the picture in the assignment.

In [9]:
#creating the same dataframe as shown in the assignment:
sorter=['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'] #adding list of PostalCodes from assignment
df_coursera1=df_can_coord1[df_can_coord1['PostalCode'].isin(sorter)] #leaving only rows with PostalCodes that are showon in assignment
true_sort = [s for s in sorter if s in df_coursera1.PostalCode.unique()] #preparing the list to sort rows the same way...
df_coursera1 = df_coursera1.set_index('PostalCode').loc[true_sort].reset_index() #...as it is shown in assignment
df_coursera1 #showing the same dataframe as in assignment

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


#### 2.2 Checking GeoCoder, which didn't work and using Google-API instead to create dataframe with coordinates.<a name="four"></a>

In [10]:
#Checking, that Geocoder is not working

#Installing and importing geocoder:
!pip install geocoder -q
import geocoder as gcd

lat_lng_coords=None #assigning 'None' value to "lat_lng_coords" variable to make further while loop
max_iteration=50 #create variable to limit number of attempts of getting coordinates from geocoder
current_iteration=0 #create variable to count number of attempts of getting coordinates from geocoder

while(lat_lng_coords is None): #Making loop to get coordinates
    current_iteration=current_iteration+1
    g = gcd.google('{}, Toronto, Ontario'.format('M5G'))
    lat_lng_coords = g.latlng
    if current_iteration > max_iteration:
        print('Tried 50 times not successfully!')
        break
    
print(g.ok) # if false - geocoder was able to contact the server, but no result could be found for the given search terms
print(lat_lng_coords) #show latitude&langitude for M5G posttal zip-code, if found

Tried 50 times not successfully!
False
None


#### Forget about geocoder, let's use google APIs!

In [11]:
import requests
import time

#Creating a function to access coordinates from google api
def google_adr_to_cord(address_or_zipcode):
    lat, lng = None, None
    api_key = 'AIzaSyAxVYzWU9Rl0PhDtEYrdwdD5zInZU4DLAg'
    base_url = "https://maps.googleapis.com/maps/api/geocode/json"
    endpoint = f"{base_url}?address={address_or_zipcode}&key={api_key}"
    r = requests.get(endpoint)
    results = r.json()['results'][0]
    lat = results['geometry']['location']['lat']
    lng = results['geometry']['location']['lng']
    return lat, lng
#google_adr_to_cord('{}, Toronto, Ontario'.format(df_can_zip.loc[0,'PostalCode']))[0] - latitude
#google_adr_to_cord('{}, Toronto, Ontario'.format(df_can_zip.loc[0,'PostalCode']))[1] - longitude

#Creating new dataframe with Latitude and Longitude:
column_names = ['PostalCode', 'Borough', 'Neighbourhood','Latitude', 'Longitude'] 
df_can_coord2 = pd.DataFrame(columns=column_names)

i = 0
while i < df_can_zip.shape[0]:
    df_can_coord2 = df_can_coord2.append({'PostalCode':df_can_zip.loc[i,'PostalCode'],'Borough':df_can_zip.loc[i,'Borough'],
                        'Neighbourhood':df_can_zip.loc[i,'Neighbourhood'],
                        'Latitude':google_adr_to_cord('{}, Toronto, Ontario'.format(df_can_zip.loc[i,'PostalCode']))[0],
                        'Longitude':google_adr_to_cord('{}, Toronto, Ontario'.format(df_can_zip.loc[i,'PostalCode']))[1]
                        }, ignore_index=True)
    time.sleep(0.2) #Adding timeout 200milliseconds per request to prevent google APIs from making mistakes
    i=i+1
print(df_can_coord2.shape)
print(df_can_zip.tail(1))
df_can_coord2.tail(1)

(103, 5)
    PostalCode    Borough                        Neighbourhood
102        M9W  Etobicoke  Northwest, West Humber - Clairville


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
102,M9W,Etobicoke,"Northwest, West Humber - Clairville",43.706748,-79.594054


# Task 2 Answer (using Google APIs)
## Showing the same dataframe as on the picture in the assignment.

In [12]:
#creating the same dataframe as shown in the assignment:
sorter=['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'] #adding list of PostalCodes from assignment
df_coursera2=df_can_coord2[df_can_coord2['PostalCode'].isin(sorter)] #leaving only rows with PostalCodes that are showon in assignment
true_sort = [s for s in sorter if s in df_coursera2.PostalCode.unique()] #preparing the list to sort rows the same way...
df_coursera2 = df_coursera2.set_index('PostalCode').loc[true_sort].reset_index() #...as it is shown in assignment
df_coursera2 #showing the same dataframe as in assignment

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750071,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


# Task 3<a name="five"></a>
### Clustering & Visualasing the Neighborhoods in Toronto

#### In this task I'll analyze all __103 Neighbourhood Groups of Toronto (here and further all the Neighbourhoods with one postal code are united in one Neighbourhood Group)__. I don't want to limit myself to Neighbourhood Groups that only have 'Toronto' in name, because to my mind they all can be very similar and can belong to one or two clusters. The bigger amount of Neighbourhood Groups can possibly help create clusters with more distinguished differences between them.

After making several iterations I decided to divide all Neighbourhood Groups in to __four (4) clusters__ according to venues from Forsquare.

__Additional (fifth)__ cluster consists of 4 Neighbourhood Groups, that didn't have any venues from Forsquare at all. I discoverd this group, when I feeded GPS-coordinates of all 103 Neighbourhood Groups to Forsquare API, but got the result only for 99 of them. As I see the number of Neighbourhood Groups with venues can vary (In the 1-st run I got 100 Neighbourhood Groups with venues and only 3 without)

In [61]:
#Importing all necessary libaries:
import json # library to handle JSON files
!conda install -c conda-forge geopy -q --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 -q --yes
import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [62]:
#Getting coordinates of Toronto with Nominatim.
address = 'Toronto, Ontario, Canada'
geolocator = Nominatim(user_agent="Toronto_Assignment")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Coordinates of Toronto are {}, {}.'.format(latitude, longitude))

Coordinates of Toronto are 43.6534817, -79.3839347.


In [63]:
CLIENT_ID = 'DDK52XZQFSD0U03IF2V4NGRWHCEFNFZ1XV14FYUR3F4IJIDR' # your Foursquare ID
CLIENT_SECRET = 'ZTYPISFBBFT2D1QKIQ5EJRKMXTGBCIME1WCXYJXFJ13ELARW' # your Foursquare Secret

In [64]:
#Using Forsquare to get venues info on each Neighbourhood Group of Toronto
#Defining Forsquare version:

VERSION = '20210108' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [65]:
#Adding function to extract categories from Forsquare
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [66]:
#Creating a function to return top100 venues in every Neighbourhood Group of Toronto within a radius of 500 meters
#from the centre of Neighbourhood Group.
def getNearbyVenues(names, postals, latitudes, longitudes, radius=500):
    
    venues_list=[]
    k=1
    for name, postal, lat, lng in zip(names, postals, latitudes, longitudes):
        print('[Neighbourhood Group '+str(k)+': '+postal, end='];  ')
        k=k+1    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postal+': '+name,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood Group', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [67]:
#Putting venues of all Neighbourhood Groups of Toronto in one dataframe: 
toronto_ven = getNearbyVenues(names=df_can_coord2['Neighbourhood'],
                              postals=df_can_coord2['PostalCode'],
                              latitudes=df_can_coord2['Latitude'],
                              longitudes=df_can_coord2['Longitude']
                              )

[Neighbourhood Group 1: M1B];  [Neighbourhood Group 2: M1C];  [Neighbourhood Group 3: M1E];  [Neighbourhood Group 4: M1G];  [Neighbourhood Group 5: M1H];  [Neighbourhood Group 6: M1J];  [Neighbourhood Group 7: M1K];  [Neighbourhood Group 8: M1L];  [Neighbourhood Group 9: M1M];  [Neighbourhood Group 10: M1N];  [Neighbourhood Group 11: M1P];  [Neighbourhood Group 12: M1R];  [Neighbourhood Group 13: M1S];  [Neighbourhood Group 14: M1T];  [Neighbourhood Group 15: M1V];  [Neighbourhood Group 16: M1W];  [Neighbourhood Group 17: M1X];  [Neighbourhood Group 18: M2H];  [Neighbourhood Group 19: M2J];  [Neighbourhood Group 20: M2K];  [Neighbourhood Group 21: M2L];  [Neighbourhood Group 22: M2M];  [Neighbourhood Group 23: M2N];  [Neighbourhood Group 24: M2P];  [Neighbourhood Group 25: M2R];  [Neighbourhood Group 26: M3A];  [Neighbourhood Group 27: M3B];  [Neighbourhood Group 28: M3C];  [Neighbourhood Group 29: M3H];  [Neighbourhood Group 30: M3J];  [Neighbourhood Group 31: M3K];  [Neighbourhood Gr



As seen above - I feeded __all 103 Neighbourhood Groups__ (Neighbourhoods with same postal code) to Forsquare. But as You'll see later - __I got venues only for 99 of them__.

In [68]:
#checking the shape and contents of "Toronto_ven" dataframe:
print(toronto_ven.shape)
toronto_ven.tail(1)

(2099, 7)


Unnamed: 0,Neighbourhood Group,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2098,"M9W: Northwest, West Humber - Clairville",43.706748,-79.594054,Vectra Heavy Haulers,43.704891,-79.59941,Truck Stop




In the cell below I grouped venues by Neighbourhood groups. As You can see __the shape of this grouping is 99 x 6__. This means, that __we've got only 99 Neighbourhood groups out of 103 in total, that have at least 1 venue__. 4 Neighbourhood groups don't have any venues and __will not participate in cluster modeling__. I will return them __manually later on as a separate "No venues cluster"__.

In [69]:
toronto_ven.groupby('Neighbourhood Group').count().sort_values(by=['Venue'], ascending=False)

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"M5X: First Canadian Place, Underground city",100,100,100,100,100,100
"M5L: Commerce Court, Victoria Hotel",100,100,100,100,100,100
"M5K: Toronto Dominion Centre, Design Exchange",100,100,100,100,100,100
"M5J: Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"M5B: Garden District, Ryerson",100,100,100,100,100,100
...,...,...,...,...,...,...
"M9B: West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale",1,1,1,1,1,1
"M8X: The Kingsway, Montgomery Road, Old Mill North",1,1,1,1,1,1
"M9M: Humberlea, Emery",1,1,1,1,1,1
M9N: Weston,1,1,1,1,1,1


In [70]:
print('There are {} uniques categories.'.format(len(toronto_ven['Venue Category'].unique())))

There are 268 uniques categories.


In [71]:
#Analyzing Neighbourhood groups:

# one hot encoding
toronto_ven2 = pd.get_dummies(toronto_ven[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_ven2['Neighbourhood Group'] = toronto_ven['Neighbourhood Group']

# move neighborhood column to the first column
fixed_columns = [toronto_ven2.columns[-1]] + list(toronto_ven2.columns[:-1])
toronto_ven2 = toronto_ven2[fixed_columns]

toronto_ven2.head(5)

Unnamed: 0,Neighbourhood Group,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"M1B: Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"M1C: Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"M1C: Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"M1E: Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"M1E: Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
# Calculating the percentage (the weight) of one particular venue type in all venues of each Neighbourhood Group
toronto_mean = toronto_ven2.groupby('Neighbourhood Group').mean().reset_index()
print(toronto_mean.shape)
toronto_mean.head(5)

(99, 269)


Unnamed: 0,Neighbourhood Group,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"M1B: Malvern, Rouge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"M1C: Rouge Hill, Port Union, Highland Creek",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"M1E: Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G: Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H: Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
#Getting top 2 venues for each Neighbourhood Group:  

num_top_venues = 2

for hood in toronto_mean['Neighbourhood Group']:
    print("----"+hood+"----")
    temp = toronto_mean[toronto_mean['Neighbourhood Group'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B: Malvern, Rouge----
                  venue  freq
0  Fast Food Restaurant   1.0
1                 Motel   0.0


----M1C: Rouge Hill, Port Union, Highland Creek----
         venue  freq
0          Bar   0.5
1  Golf Course   0.5


----M1E: Guildwood, Morningside, West Hill----
            venue  freq
0  Medical Center  0.12
1            Bank  0.12


----M1G: Woburn----
                   venue  freq
0            Coffee Shop  0.67
1  Korean BBQ Restaurant  0.33


----M1H: Cedarbrae----
                venue  freq
0    Hakka Restaurant  0.12
1  Athletics & Sports  0.12


----M1J: Scarborough Village----
        venue  freq
0         Spa   0.5
1  Playground   0.5


----M1K: Kennedy Park, Ionview, East Birchmount Park----
              venue  freq
0  Department Store  0.17
1        Hobby Shop  0.17


----M1L: Golden Mile, Clairlea, Oakridge----
      venue  freq
0  Bus Line  0.22
1    Bakery  0.22


----M1M: Cliffside, Cliffcrest, Scarborough Village West----
                 venue  

                  venue  freq
0  Fast Food Restaurant  0.25
1     Convenience Store  0.25


----M6N: Runnymede, The Junction North----
               venue  freq
0  Convenience Store  0.25
1     Breakfast Spot  0.25


----M6P: High Park, The Junction South----
                venue  freq
0     Thai Restaurant  0.08
1  Mexican Restaurant  0.08


----M6R: Parkdale, Roncesvalles----
            venue  freq
0  Breakfast Spot  0.14
1       Gift Shop  0.14


----M6S: Runnymede, Swansea----
         venue  freq
0  Coffee Shop  0.09
1         Café  0.09


----M7A: Queen's Park, Ontario Provincial Government----
              venue  freq
0       Coffee Shop  0.21
1  Sushi Restaurant  0.06


----M7R: Canada Post Gateway Processing Centre----
            venue  freq
0     Gas Station  0.14
1  Sandwich Place  0.14


----M7Y: Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
           venue  freq
0           Park  0.07
1  Auto Workshop  0.07


----M8V: New To

In [74]:
#defining a function to return the most common venues

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [75]:
#Creating a table with top10 common venues for every Neighbourhood group:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood Group']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood Group'] = toronto_mean['Neighbourhood Group']

for ind in np.arange(toronto_mean.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_mean.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(3)
neighborhoods_venues_sorted.shape

(99, 11)




After preparation and visual analysis __I finally proceed with clustering.__ As mentioned above __only 99 Neighbourhood Groups out of 103__ are taking part in clustering.

In [76]:
# set number of clusters
kclusters = 4

toronto_clustering = toronto_mean.drop('Neighbourhood Group', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:99] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0,
       2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       1, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0])

In [77]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#creating dataframe "toronto_merged" whith details on each Neighbourhood Group as well as it's cluster and top10 venues: 
column_names = ['Neighbourhood Group', 'PostalCode', 'Borough', 'Neighbourhood','Latitude', 'Longitude']
toronto_merged = pd.DataFrame(columns=column_names)

#print(toronto_merged)
k = 0
while k < df_can_coord2.shape[0]:
    toronto_merged = toronto_merged.append({'Neighbourhood Group':df_can_coord2.loc[k,'PostalCode']+': '+df_can_coord2.loc[k,'Neighbourhood'],
                                          'PostalCode':df_can_coord2.loc[k,'PostalCode'],  
                                          'Borough':df_can_coord2.loc[k,'Borough'],
                                          'Neighbourhood':df_can_coord2.loc[k,'Neighbourhood'],
                                          'Latitude': df_can_coord2.loc[k,'Latitude'],
                                          'Longitude': df_can_coord2.loc[k,'Longitude'],
                                          }, ignore_index=True)
    k=k+1     
#merge latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood Group'), on='Neighbourhood Group')

# check the last columns!
#First I wanted to drop rows with no venues by using "toronto_merged.dropna(inplace=True)".
#But later I decided to show this 3 Neighbourhood Groups as a separate cluster
print(toronto_merged.shape)

# I sort the merged dataframe by cluster labels and put parameter na_position='last',
#this way the rows, where there were no venues and accoringly they were not clusterd sorted as last 3 rows:
toronto_merged.sort_values(by = 'Cluster Labels', axis=0, ascending=True, inplace=False, na_position='last').tail(4)

(103, 17)


Unnamed: 0,Neighbourhood Group,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,M1X: Upper Rouge,M1X,Scarborough,Upper Rouge,43.836125,-79.205636,,,,,,,,,,,
20,"M2L: York Mills, Silver Hills",M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,,,,,,,,,,,
21,"M2M: Willowdale, Newtonbrook",M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493,,,,,,,,,,,
93,"M9A: Islington Avenue, Humber Valley Village",M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,,,,,,,,,,,




As seen in the output above __the Neighbourhood Groups with postal codes M1X, M2L, M2M and M9A__ are the 4 groups, that __have no venues in Forsquare.__ To put them correct on the map, I'll assign them Cluster label "4", so that we have 5 clusters from 0 to 4. Afterwards I will generate the map with all 103 Neighbourhood Groups coloured according to their cluster.



In [78]:
#Putting cluster label 4 to the cells instead of NaN
toronto_merged.loc[toronto_merged['Cluster Labels'].isna(), 'Cluster Labels'] = 4

In [79]:
toronto_merged.sort_values(by = 'Cluster Labels', axis=0, ascending=True, inplace=False, na_position='last').tail(4)

Unnamed: 0,Neighbourhood Group,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
93,"M9A: Islington Avenue, Humber Valley Village",M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,4.0,,,,,,,,,,
16,M1X: Upper Rouge,M1X,Scarborough,Upper Rouge,43.836125,-79.205636,4.0,,,,,,,,,,
21,"M2M: Willowdale, Newtonbrook",M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493,4.0,,,,,,,,,,
20,"M2L: York Mills, Silver Hills",M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,4.0,,,,,,,,,,


# Task 3 Answer:

#### 3.1 Plotting map of toronto with clustered Neighbourhood Groups on it:

In [80]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters+1) #I put here +1 because I have extra cluster with no venues.
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood Group'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### 3.2 Showing clusters in tables and describing clusters

Cluster 1: "Food & Services". This cluster is the biggest one. All Neighbourhood groups of this cluster are reach on different cafes, restaurants as well as stores and banks.

In [81]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood Group,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"M1B: Malvern, Rouge",0.0,Fast Food Restaurant,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
1,"M1C: Rouge Hill, Port Union, Highland Creek",0.0,Golf Course,Bar,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
2,"M1E: Guildwood, Morningside, West Hill",0.0,Electronics Store,Mexican Restaurant,Restaurant,Rental Car Location,Breakfast Spot,Intersection,Bank,Medical Center,Diner,Discount Store
3,M1G: Woburn,0.0,Coffee Shop,Korean BBQ Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
4,M1H: Cedarbrae,0.0,Bank,Hakka Restaurant,Fried Chicken Joint,Thai Restaurant,Caribbean Restaurant,Athletics & Sports,Gas Station,Bakery,Discount Store,Dim Sum Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
96,M9L: Humber Summit,0.0,Intersection,Furniture / Home Store,Pizza Place,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Yoga Studio
98,M9N: Weston,0.0,Convenience Store,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Yoga Studio,Department Store
99,M9P: Westmount,0.0,Pizza Place,Chinese Restaurant,Intersection,Coffee Shop,Sandwich Place,Middle Eastern Restaurant,Discount Store,Yoga Studio,Dim Sum Restaurant,Diner
101,"M9V: South Steeles, Silverstone, Humbergate, J...",0.0,Grocery Store,Fried Chicken Joint,Beer Store,Sandwich Place,Discount Store,Fast Food Restaurant,Pizza Place,Pharmacy,Drugstore,Donut Shop


Cluster 2: "Baseball Field". This cluster is small on number of Neighbourhood groups. But it has it's unique feature - baseball fields.

In [82]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood Group,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,"M8Y: Old Mill South, King's Mill Park, Sunnyle...",1.0,Deli / Bodega,Construction & Landscaping,Baseball Field,Yoga Studio,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
97,"M9M: Humberlea, Emery",1.0,Baseball Field,Yoga Studio,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,Farmers Market


Cluster 3: "Cluster of Parks". This cluster is not very large as well. But as cluster 2 it has it's unique feature - parks.

In [83]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood Group,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"M1V: Milliken, Agincourt North, Steeles East, ...",2.0,Park,Intersection,Playground,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
23,M2P: York Mills West,2.0,Convenience Store,Park,Electronics Store,Construction & Landscaping,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
25,M3A: Parkwoods,2.0,Park,Food & Drink Shop,Construction & Landscaping,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run
30,M3K: Downsview,2.0,Park,Airport,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
40,"M4J: East Toronto, Broadview North (Old East Y...",2.0,Park,Intersection,Convenience Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Dessert Shop,Donut Shop
50,M4W: Rosedale,2.0,Park,Playground,Trail,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant
79,"M6L: North Park, Maple Leaf Park, Upwood Park",2.0,Park,Basketball Court,Bakery,Construction & Landscaping,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
100,"M9R: Kingsview Village, St. Phillips, Martin G...",2.0,Park,Sandwich Place,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,College Gym


Cluster 4: "A place by the river". This cluster consist of only one Neighbourhood group, but it is has nice special venue - The Humber river.  

In [86]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood Group,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
90,"M8X: The Kingsway, Montgomery Road, Old Mill N...",3.0,River,Yoga Studio,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Doner Restaurant,Deli / Bodega


Cluster 5: "No luck for venues". Surprisingly the 4 Neighbourhood groups of this cluster have no venues within 500 meter radius. Must be a very good place for dull and quiet living:)

In [87]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood Group,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,M1X: Upper Rouge,4.0,,,,,,,,,,
20,"M2L: York Mills, Silver Hills",4.0,,,,,,,,,,
21,"M2M: Willowdale, Newtonbrook",4.0,,,,,,,,,,
93,"M9A: Islington Avenue, Humber Valley Village",4.0,,,,,,,,,,






# Backup





In [16]:
#Creating map of Toronto with all 103 Neighbourhood Groups with Folium. No segmentation
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood, postal in zip(df_can_coord2['Latitude'], df_can_coord2['Longitude'], df_can_coord2['Borough'], df_can_coord2['Neighbourhood'], df_can_coord2['PostalCode']):
    label = '{}-{}:  {}'.format(postal, borough, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#ff0000',
        fill=True,
        fill_color='#ffffff',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto