# Toronto Neighborhood Segmentation and Clustering, Parts 1 - 3

## This notebook satisfies the requirements to get, transform, and groom the data that will be used for the subsequent neighborhood segmentation and clustering exercise.  This is the COMPLETE submission for Week 3 of the IBM Applied Data Science Capstone course offered via Coursera.  The assignment may be viewed at https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit.

### Please see the comments in the code for detailed explanations of each processing step.  Thanks!

### Part 1: Get, Transform, and Clean the Data

In [None]:
#install the necessary packages
#commented out here to clean up the notebook
#!pip install bs4;
#!pip install requests;
#!pip install lxml;
#!pip install cchardet;

In [59]:
#import the necessary libraries
import urllib;
import pandas as pd;
from bs4 import BeautifulSoup;
import numpy as np;

#set the url to scrape
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M';

#get the parsed text from the url using the lxml parser
html = urllib.request.urlopen(url);
bs = BeautifulSoup(html,'lxml');

#use BeautifulSoup4 to find the table we need
table_object = bs.find(lambda tag: tag.name=='table',attrs={"class": "wikitable sortable"}); 

#use BeautifulSoup4 to get all rows in the table
row_objects = table_object.tbody.find_all(lambda tag: tag.name=='tr');

#create and populate a list to stage the data for pandas
toronto_data_row_list = [];

#define a function to get rows by tag (is it header or data?)
def get_rows_by_tag(tr, column_tag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(column_tag)];

#append the header row to the list
toronto_headers = get_rows_by_tag(row_objects[0], 'th');
toronto_data_row_list.append(toronto_headers);

#append the data rows to the list
for tr in row_objects:
    toronto_data_row_list.append(get_rows_by_tag(tr, 'td'));

#use pandas to create a dataframe from the list
toronto_pc_df = pd.DataFrame(toronto_data_row_list[1:], columns=toronto_data_row_list[0]);

#delete rows with null Postcode
toronto_pc_df = toronto_pc_df[toronto_pc_df.Postcode.notnull()] 

#delete rows with Borough = 'Not assigned'
toronto_pc_df = toronto_pc_df[toronto_pc_df.Borough != 'Not assigned'] 

#aggregate rows with the same Postcode and Borough to create a comma-delimited Neighbourhood column
toronto_pc_df = toronto_pc_df.groupby(['Postcode']).  \
    agg({'Borough' : 'first' , 'Neighbourhood' : ', '.join})  \
   .reset_index()  \
   .reindex(columns = toronto_pc_df.columns)

#rename Postcode column to PostalCode as shown in assignment
toronto_pc_df.rename(columns = {'Postcode' : 'PostalCode'}, inplace = True)

#where 'Neighbourhood' is Not assigned, copy the Borough to Neighbourhood
toronto_pc_df.loc[(toronto_pc_df.Neighbourhood=='Not assigned'), 'Neighbourhood'] = toronto_pc_df.Borough

#print the result as a string for inspection
print(toronto_pc_df.to_string())

#display the dataframe in the format specified in the assignment
toronto_pc_df.head(12)


    PostalCode           Borough                                      Neighbourhood
0          M1B       Scarborough                                     Rouge, Malvern
1          M1C       Scarborough             Highland Creek, Rouge Hill, Port Union
2          M1E       Scarborough                  Guildwood, Morningside, West Hill
3          M1G       Scarborough                                             Woburn
4          M1H       Scarborough                                          Cedarbrae
5          M1J       Scarborough                                Scarborough Village
6          M1K       Scarborough        East Birchmount Park, Ionview, Kennedy Park
7          M1L       Scarborough                    Clairlea, Golden Mile, Oakridge
8          M1M       Scarborough    Cliffcrest, Cliffside, Scarborough Village West
9          M1N       Scarborough                        Birch Cliff, Cliffside West
10         M1P       Scarborough  Dorset Park, Scarborough Town Centre, Wexf

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [2]:
#as required by the assignment return the shape of the transformed and cleaned Toronto dataset
toronto_pc_df.shape;

(103, 3)

### Part 2: Get Geocodes for Each Borough and Add Them to the Dataset

In [4]:
#long and repeated attempts to use geocoder failed miserably, completely useless
#now shifting to the csv file.

#ingest the data from the csv file into a pandas dataframe
#note that the original Postal Code column name was changed to PostalCode using Excel
#the csv file was downloaded to the development Windows workstation.
toronto_gc_df = pd.read_csv('C:/Users/mike/Desktop/Geospatial_Coordinates.csv', sep = ',');

#merge the original dataframe with the geocode dataframe joining on PostalCode
toronto_pc_gc_df = pd.merge(toronto_pc_df, toronto_gc_df, on='PostalCode');

#print the result for inspection
print(toronto_pc_gc_df.to_string());

#display the dataframe in the format specified in the assignment
toronto_pc_gc_df.head(12);


    PostalCode           Borough                                      Neighbourhood   Latitude  Longitude
0          M1B       Scarborough                                     Rouge, Malvern  43.806686 -79.194353
1          M1C       Scarborough             Highland Creek, Rouge Hill, Port Union  43.784535 -79.160497
2          M1E       Scarborough                  Guildwood, Morningside, West Hill  43.763573 -79.188711
3          M1G       Scarborough                                             Woburn  43.770992 -79.216917
4          M1H       Scarborough                                          Cedarbrae  43.773136 -79.239476
5          M1J       Scarborough                                Scarborough Village  43.744734 -79.239476
6          M1K       Scarborough        East Birchmount Park, Ionview, Kennedy Park  43.727929 -79.262029
7          M1L       Scarborough                    Clairlea, Golden Mile, Oakridge  43.711112 -79.284577
8          M1M       Scarborough    Cliffcrest

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [5]:
#check the shape of the completed dataset to avoid any surprises
toronto_pc_gc_df.shape;

(103, 5)

## Part 3: Explore and Cluster Toronto Neighborhoods

In [55]:
#get all required libraries

#!pip install geopy --quiet; 
from geopy.geocoders import Nominatim; # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm;
import matplotlib.colors as colors;

# import k-means from clustering stage
from sklearn.cluster import KMeans;

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium; # map rendering library

In [56]:
#get the geographical coordinates of the City of Toronto
address = 'Toronto, ON';
geolocator = Nominatim(user_agent="t_explorer");
location = geolocator.geocode(address);
latitude = location.latitude;
longitude = location.longitude;
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude));

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [57]:
#generate a map of the city of Toronto, with markers for the neighborhoods
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_pc_gc_df['Latitude'], toronto_pc_gc_df['Longitude'], toronto_pc_gc_df['Borough'], toronto_pc_gc_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Use the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [9]:

CLIENT_ID = 'XWT3TKOHVZGYK01HQXA55GEVK0ELBASPAATUMIEZZUU0O4TZ' #  Foursquare ID

CLIENT_SECRET = '1BSG4SRZIW4GRSPXJ4Q5DR0BQ0FWMWGP4XJYP02WYPM2ITQ' #  Foursquare Secret

VERSION = '20180605' # Foursquare API version

print('Your credentials:')

print('CLIENT_ID: ' + CLIENT_ID)

print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: XWT3TKOHVZGYK01HQXA55GEVK0ELBASPAATUMIEZZUU0O4TZ
CLIENT_SECRET:1BSG4SRZIW4GRSPXJ4Q5DR0BQ0FWMWGP4XJYP02WYPM2ITQ


#### Print the very first neighborhood(s) in our Toronto dataframe

In [10]:
toronto_pc_gc_df.loc[0, 'Neighbourhood']

'Rouge, Malvern'

#### Reuse the function from the first Foursquare lab so we can get venue categories.

In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Reuse the "nearby venues" function from the New York lab.

In [15]:
TheMagicToken = "TVUYECOF53QZZCSQOQ4QPQ55LZRW2SKDJNIE5VUWJCGFLFIY"

import requests
LIMIT = 50 # limit of number of venues returned by Foursquare API

radius = 500 # define radius from neighborhood centroid in meters

def getNearbyVenues(names, latitudes, longitudes, radius=300):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
       # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&oauth_token={}&v={}&ll={},{}&radius={}&limit={}'.format(
            TheMagicToken,
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Get the response

In [32]:
toronto_venues = getNearbyVenues(names=toronto_pc_gc_df['Neighbourhood'],
                                   latitudes=toronto_pc_gc_df['Latitude'],
                                   longitudes=toronto_pc_gc_df['Longitude']
                                  )

In [None]:
### Take a quick look at our data again.

In [34]:
print(toronto_venues.shape)
toronto_venues.head()

(1411, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Rex Pak Food Packaging Ltd,43.805459,-79.194344,Business Service
1,"Rouge, Malvern",43.806686,-79.194353,R & K Woodworking Specialists Inc,43.808233,-79.196857,Construction & Landscaping
2,"Rouge, Malvern",43.806686,-79.194353,NT Home Service Inc.,43.806411,-79.197736,Home Service
3,"Rouge, Malvern",43.806686,-79.194353,shield shutters,43.804665,-79.196306,Business Service
4,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course


### Now get a count of venues by neighborhood.

In [35]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",50,50,50,50,50,50
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",2,2,2,2,2,2
"Alderwood, Long Branch",6,6,6,6,6,6
"Bathurst Manor, Downsview North, Wilson Heights",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",31,31,31,31,31,31
Berczy Park,29,29,29,29,29,29
"Birch Cliff, Cliffside West",2,2,2,2,2,2


### Wide variation in the number of venues per neighborhood.  How many unique categories?

In [37]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 253 unique categories.


### OK, 253 categories in 103 postal codes.  K-means will have plenty to chew on.  Let's run it for 5 clusters.

In [45]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
colms = list(toronto_onehot)
colms
colms.insert(0, colms.pop(colms.index('Neighborhood')))
toronto_onehot = toronto_onehot.reindex(columns = colms)

toronto_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arepa Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Inspect the shape of toronto_onehot

In [46]:
toronto_onehot.shape

(1411, 253)

### Continue to follow the New York example: group toronto_onehot by neighborhood into a new dataframe and get the mean of category frequency.

In [47]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arepa Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### We already know the shape but anyway...

In [48]:
toronto_grouped.shape

(98, 253)

In [None]:
### Print each neighborhood along with the top 5 most common venues.

In [50]:
num_top_venues = 5

for turf in toronto_grouped['Neighborhood']:
    print("----"+turf+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == turf].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0        Steakhouse  0.08
1  Asian Restaurant  0.06
2             Hotel  0.04
3               Bar  0.04
4              Café  0.04


----Agincourt----
                        venue  freq
0           Convenience Store  0.25
1   Latin American Restaurant  0.25
2              Breakfast Spot  0.25
3  Construction & Landscaping  0.25
4           Outdoor Sculpture  0.00


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                      venue  freq
0                      Park  0.33
1       Arts & Crafts Store  0.33
2               Coffee Shop  0.33
3  Mediterranean Restaurant  0.00
4        Miscellaneous Shop  0.00


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                        venue  freq
0  Financial or Legal Service   0.5
1                    Pharmacy   0.5
2                         ATM   0.0
3           Mobile Phone Shop   0.0
4   

## Create a pandas dataframe containing the top n venues.

### First sort venues in descending order

In [51]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Now create a new dataframe with the top 10 venues in each neighborhood.

In [61]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Steakhouse,Asian Restaurant,Bar,American Restaurant,Seafood Restaurant,Café,Sushi Restaurant,Pizza Place,Japanese Restaurant,Hotel
1,Agincourt,Construction & Landscaping,Convenience Store,Latin American Restaurant,Breakfast Spot,Dive Bar,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Coffee Shop,Arts & Crafts Store,Yoga Studio,Dry Cleaner,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Electronics Store
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Pharmacy,Financial or Legal Service,Yoga Studio,Diner,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
4,"Alderwood, Long Branch",Gym,Coffee Shop,Dance Studio,Pizza Place,Pharmacy,Pub,Costume Shop,Diner,Event Space,Event Service


## At long last, run K-means to calculate 5 clusters of neighborhoods.

In [76]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 2, 1, 1, 1, 1, 1, 1, 1])

### Now create a dataframe for each cluster and its top 10 categories

In [77]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_);

toronto_merged = toronto_pc_gc_df;

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
# note the 'u' in the original Toronto dataframe column 'Neighbourhood!!'
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood');

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,3.0,Business Service,Construction & Landscaping,Home Service,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0,Construction & Landscaping,Golf Course,Yoga Studio,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner,Ethiopian Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,3.0,Electronics Store,Business Service,Yoga Studio,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Dumpling Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Korean Restaurant,Yoga Studio,Fast Food Restaurant,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Caribbean Restaurant,Lounge,Shipping Store,Thai Restaurant,Athletics & Sports,Spa,Bank,Hakka Restaurant,Gym,Event Service


## And now, draw a nice map so we can visualize the clusters of neighborhoods.

In [100]:
## As so many have discovered toronto_merged has a NaN row in Etobicoke!  Major barfage!  Get rid of it!  It's row 102.

#toronto_merged

toronto_merged = toronto_merged.dropna()

In [102]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))

rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    print(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters



3.0
0.0
3.0
4.0
1.0
1.0
1.0
3.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
1.0
1.0
1.0
1.0
2.0
2.0
2.0
1.0
2.0
1.0
1.0
1.0
1.0
3.0
1.0
1.0
3.0
1.0
1.0
1.0
2.0
1.0
1.0
2.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
1.0
0.0
1.0
1.0
0.0
2.0
1.0
1.0
1.0


## So there's the clusters mapped onto Toronto in different colors.  Now lets look at each cluster.

### Cluster 0: On The Outskirts, Construction, Services, Restaurant, Recreation, Retail (mapped in red)

In [103]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,0.0,Construction & Landscaping,Golf Course,Yoga Studio,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner,Ethiopian Restaurant
91,Etobicoke,0.0,Construction & Landscaping,Breakfast Spot,Locksmith,Cosmetics Shop,Costume Shop,Farm,Falafel Restaurant,Event Space,Convenience Store,Event Service
94,Etobicoke,0.0,Golf Course,Print Shop,Yoga Studio,Dry Cleaner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dumpling Restaurant
97,North York,0.0,Construction & Landscaping,Yoga Studio,Diner,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant


### Cluster 1: The Salad Bowl, most populated cluster with greatest variety (mapped in purple)

In [104]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Scarborough,1.0,Caribbean Restaurant,Lounge,Shipping Store,Thai Restaurant,Athletics & Sports,Spa,Bank,Hakka Restaurant,Gym,Event Service
5,Scarborough,1.0,Playground,Yoga Studio,Dim Sum Restaurant,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Dry Cleaner
6,Scarborough,1.0,Convenience Store,Pharmacy,Yoga Studio,Diner,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
8,Scarborough,1.0,Movie Theater,Skating Rink,Intersection,Motel,American Restaurant,Yoga Studio,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner
9,Scarborough,1.0,Café,Farm,Yoga Studio,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Dry Cleaner
10,Scarborough,1.0,Construction & Landscaping,Brewery,Thrift / Vintage Store,Indian Restaurant,Light Rail Station,Event Service,Event Space,Ethiopian Restaurant,Falafel Restaurant,Dumpling Restaurant
11,Scarborough,1.0,Miscellaneous Shop,Coffee Shop,Bookstore,Intersection,Indian Restaurant,Event Service,Event Space,Ethiopian Restaurant,Electronics Store,Discount Store
12,Scarborough,1.0,Construction & Landscaping,Convenience Store,Latin American Restaurant,Breakfast Spot,Dive Bar,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant
13,Scarborough,1.0,Shopping Mall,Flower Shop,Bank,Rental Car Location,Yoga Studio,Discount Store,Event Space,Event Service,Ethiopian Restaurant,Electronics Store
15,Scarborough,1.0,Mobile Phone Shop,Fast Food Restaurant,Clothing Store,Electronics Store,Grocery Store,Coffee Shop,Chinese Restaurant,Pharmacy,Sandwich Place,Bubble Tea Shop


### Cluster 2: The Good Life, Recreation, Services, and Retail (mapped in medium blue)

In [105]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,2.0,Park,Coffee Shop,Arts & Crafts Store,Yoga Studio,Dry Cleaner,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Electronics Store
21,North York,2.0,Gym,Home Service,Park,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner,Dumpling Restaurant
22,North York,2.0,Park,Coffee Shop,Yoga Studio,Dry Cleaner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Electronics Store
23,North York,2.0,Park,Convenience Store,Bank,Yoga Studio,Dry Cleaner,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Electronics Store
25,North York,2.0,Park,Construction & Landscaping,Convenience Store,Diner,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
37,East Toronto,2.0,Other Great Outdoors,Park,Trail,Dry Cleaner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dumpling Restaurant
40,East York,2.0,Park,Film Studio,Farmers Market,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Dry Cleaner
50,Downtown Toronto,2.0,Park,Dim Sum Restaurant,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Dry Cleaner,Donut Shop
81,York,2.0,Outdoor Supply Store,Park,Bakery,Diner,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
82,West Toronto,2.0,Park,Event Service,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant,Dry Cleaner,Donut Shop


### Cluster 3: The Rest Of Us, Service, Retail, Farm, and Ethnic Restaurant

In [106]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,3.0,Business Service,Construction & Landscaping,Home Service,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store
2,Scarborough,3.0,Electronics Store,Business Service,Yoga Studio,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Dumpling Restaurant
7,Scarborough,3.0,Business Service,Yoga Studio,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant
30,North York,3.0,Construction & Landscaping,Electronics Store,Bus Stop,Discount Store,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Dumpling Restaurant
33,North York,3.0,Electronics Store,Liquor Store,Yoga Studio,Diner,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Dumpling Restaurant


### Cluster 4: New Arrivals, Restaurant, Services, Farm, and Retail

In [110]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Scarborough,4.0,Korean Restaurant,Yoga Studio,Fast Food Restaurant,Farm,Falafel Restaurant,Event Space,Event Service,Ethiopian Restaurant,Electronics Store,Dumpling Restaurant


#  Thanks y'all.  That was interesting.