## Segmenting and Clustering Neighborhoods in Toronto
(All submissions will be in this notebook, seperated by Part/Submission headers)

In [196]:
# Install all required modules if required
# !pip install BeautifulSoup4
# !pip install geocoder
# !pip install geopy
# !pip install folium
# !pip install -U scikit-learn scipy

In [188]:
# Import all required libraries
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import geocoder # import geocoder
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

### Part 1 (Submission 1)

First, use beautifulSoup to parse the Toronto neighborhoods table with postal codes, following the requirements of the assignment

In [189]:
url= 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html=requests.get(url).text

soup = BeautifulSoup(html, 'html.parser')


table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

Now let's make sure our dataframe follows our requirements by checking for Nan values, duplicate postal codes, etc. The process is described in the code comments below. Then, the results and inspection of the dataframe show requirements are met

In [192]:
# Assuming non assigned boroughs would show as Nan, check for Nan in df:
print("Nan values- \n", df.isna().sum(), "\n")
# Confirming if there are no Empty or Nan or "Not assigned" Boroughs:
print("Unique Borough values- \n", df['Borough'].unique(), "\n")
# Check for duplicate postal codes by checking shape of original df and unique() results
print("Original df shape: ", df.shape, "    Shape of unique df series: ", (df["PostalCode"].unique()).shape, "\n")
# (In addition, as can be seen in the first rows of table below, neighborhoods already combined with ",")

# Also make sure no 'Not Assigned' neighborhoods:
print("Unique Neighborhood values- \n", df["Neighborhood"].unique())

Nan values- 
 PostalCode      0
Borough         0
Neighborhood    0
dtype: int64 

Unique Borough values- 
 ['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'East York/East Toronto'
 'Central Toronto' 'Mississauga' 'Downtown Toronto Stn A'
 'Etobicoke Northwest' 'East Toronto Business'] 

Original df shape:  (103, 3)     Shape of unique df series:  (103,) 

Unique Neighborhood values- 
 ['Parkwoods' 'Victoria Village' 'Regent Park, Harbourfront'
 'Lawrence Manor, Lawrence Heights' 'Ontario Provincial Government'
 'Islington Avenue' 'Malvern, Rouge' 'Don Mills North'
 'Parkview Hill, Woodbine Gardens' 'Garden District, Ryerson' 'Glencairn'
 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'
 'Rouge Hill, Port Union, Highland Creek' 'Don Mills South'
 'Woodbine Heights' 'St. James Town' 'Humewood-Cedarvale'
 'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood'
 'Guildwood, Mornings

**Then, we present the first 12 rows of our dataframe**

In [193]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


**Finally, show the the number of rows in the dataframe**

In [194]:
print("num rows in toronto neighborhoods dataframe: ", df.shape[0])

num rows in toronto neighborhoods dataframe:  103


### Part 2 (Submission 2)

The code below would have obtained the latitude and longitude coordinates of all the postal codes, but it was not used due to the limiations of geocoder

In [185]:
# # Add empty Latitude and Longitude columns to df:
# df["Latitude"] = ""
# df["Longitude"] = ""

# # Loop through rows in df and assign latitude and longitude (known each row has unique postal code)
# for ind in df.index:
#     # initialize your variable to None
#     lat_lng_coords = None
    
#     postal_code = df["PostalCode"][ind]

#     # loop until you get the coordinates
#     while(lat_lng_coords is None):
#       g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#       lat_lng_coords = g.latlng

#     latitude = lat_lng_coords[0]
#     longitude = lat_lng_coords[1]
#     df["Latitude"][ind] = latitude
#     df["Longitude"][ind] = longitude

In [186]:
# Just used to drop unused columns to avoid duplicates 
#df.drop(labels=["Latitude", "Longitude"], axis=1, inplace=True)

So now, we will just join our .csv with latitude and longitude to our table
First, get the csv and convert to pandas df:

In [197]:
# Get .csv with geospatial data
url2 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
latLonDf = pd.read_csv(url2)
# Standardize Post Code column name
latLonDf.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
latLonDf # Check the dataframe is as intended

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now, join tables based on the postal code column

In [198]:
location_df = df.join(latLonDf.set_index('PostalCode'), on='PostalCode') # NEED TO JOIN ONLY ON POSTALCODE

**Finally, display the new location df, which includes latitude and longitude values**

In [199]:
location_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Submission 3 (Part 3)

First, let's create a df with only boroughs that contain the word Toronto 

In [200]:
torontoBr_data = location_df.query('Borough.str.contains("Toronto")', engine='python').reset_index(drop=True)
torontoBr_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [201]:
from geopy.geocoders import Nominatim
import folium

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="cap_explorer")
location = geolocator.geocode(address)
latitudeT = location.latitude
longitudeT = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


Now, all the neighborhoods with 'Toronto' are mapped to see their geographical distribution

In [203]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitudeT, longitudeT], zoom_start=11)

# add markers to map
for lat, lng, label in zip(torontoBr_data['Latitude'], torontoBr_data['Longitude'], torontoBr_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**After mapping the Neighborhoods with Toronto normally, EXPLORE the venues in the neighborhoods**

In [204]:
CLIENT_ID = 'FJICEMVPP0HJSSI4WPW41JQZHKG0QPAB4OGNRPCLC24NWFIU' # your Foursquare ID
CLIENT_SECRET = 'NB14VDI1V2OWJB0W3X1QEVIKUJWHHJ0BLO4GMVHYJT4IVEWC' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Then, define the getNearbyVenues function below to explore nearby venues for all neighborhoods

In [220]:
def getNearbyVenues(names, latitudes, longitudes, radius=600):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then, get the nearby values and save them to toronto_venues df

In [221]:
# type your answer here
toronto_venues = getNearbyVenues(names=torontoBr_data['Neighborhood'],
                                   latitudes=torontoBr_data['Latitude'],
                                   longitudes=torontoBr_data['Longitude']
                                  )

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canadi

Then, print the shape of the venues df and display the first 5 rows to get an idea on the type and number of venues found 

In [222]:
print('Shape:', toronto_venues.shape)
toronto_venues.head()

Shape: (2034, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


Now count how many venues were returned in each neighborhood. This will give us some idea of the distribution of venues

In [223]:
group_neigh_venues = (toronto_venues[['Neighborhood', 'Venue']]).rename(columns={'Venue': 'Venue Count'})
group_neigh_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue Count
Neighborhood,Unnamed: 1_level_1
Berczy Park,94
"Brockton, Parkdale Village, Exhibition Place",41
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17
Central Bay Street,98
Christie,17
Church and Wellesley,100
"Commerce Court, Victoria Hotel",100
Davisville,47
Davisville North,14
"Dufferin, Dovercourt Village",20


Let's find out how many unique categories can be curated from all the returned venues

In [224]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 253 uniques categories.


It seems there are a lot of unique categories of venues even though it is only a small subset of neighborhoods <br>

Now ANALYZE each neighborhood

So we first apply one hot encoding to our venue data

In [225]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Tram Station,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [226]:
toronto_onehot.shape

(2034, 253)

There are a good amount of venues and over 200 features for the different venues 

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [227]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,...,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Tram Station,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,Berczy Park,0.010638,0.0,0.0,0.0,0.0,0.0,0.0,0.010638,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.058824,0.058824,0.117647,0.117647,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.010204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.010204,0.0,0.0,0.010204
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01
7,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's confirm the new size

In [228]:
toronto_grouped.shape

(39, 253)

Now print each neighborhood along with the top 3 most common venues. We can already see some patterns in the top venue results

In [229]:
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                 venue  freq
0          Coffee Shop  0.10
1   Seafood Restaurant  0.04
2  Japanese Restaurant  0.03


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.07
1     Coffee Shop  0.07
2  Breakfast Spot  0.05


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0      Coffee Shop  0.12
1   Airport Lounge  0.12
2  Airport Service  0.12


----Central Bay Street----
             venue  freq
0      Coffee Shop  0.16
1             Café  0.07
2  Bubble Tea Shop  0.04


----Christie----
           venue  freq
0  Grocery Store  0.24
1           Café  0.18
2           Park  0.12


----Church and Wellesley----
                 venue  freq
0          Coffee Shop  0.11
1  Japanese Restaurant  0.06
2     Sushi Restaurant  0.05


----Commerce Court, Victoria Hotel----
         venue  freq
0  Coffee Shop  0.08
1         Café  0.07
2    

Now put that into a pandas dataframe
To do this, write a function to sort the venues in descending order.

In [230]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now create the new dataframe and display the top 10 venues for each neighborhood.

In [231]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Seafood Restaurant,Japanese Restaurant,Bakery,Cocktail Bar,Restaurant,Pub,Café,Hotel,Lounge
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Restaurant,Sandwich Place,Gift Shop,Bakery,Office,Nightclub,Convenience Store
2,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Boat or Ferry,Sculpture Garden,Bar,Rental Car Location,Boutique
3,Central Bay Street,Coffee Shop,Café,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Sushi Restaurant,Burger Joint,Sandwich Place,Ramen Restaurant,Hotel
4,Christie,Grocery Store,Café,Park,Nightclub,Italian Restaurant,Playground,Baby Store,Candy Store,Athletics & Sports,Coffee Shop


Finally, with our data properly formatted CLUSTER the neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [232]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 2,
       0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [233]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = torontoBr_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Verify the clustered data is as expected

In [234]:
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Pub,Theater,Breakfast Spot,Café,Shoe Store,Gastropub,Spa
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Restaurant,Hotel,Movie Theater,Sandwich Place,Sushi Restaurant,Falafel Restaurant,Japanese Restaurant,Italian Restaurant
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Café,Italian Restaurant,Bakery,Japanese Restaurant,Breakfast Spot,Cocktail Bar,Seafood Restaurant,Cosmetics Shop,Gastropub
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Pub,Gastropub,Bakery,Mexican Restaurant,Ice Cream Shop,French Restaurant,Pizza Place,Ramen Restaurant,Thai Restaurant,Indian Restaurant
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Seafood Restaurant,Japanese Restaurant,Bakery,Cocktail Bar,Restaurant,Pub,Café,Hotel,Lounge


**Finally, the resulting clusters can be visualized using a folium map centered on Toronto**

In [235]:
# create map
map_clusters = folium.Map(location=[latitudeT, longitudeT], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Then, EXAMINE the 5 Clusters and determine the determining factors for the clusters

In [236]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",0,Coffee Shop,Bakery,Park,Pub,Theater,Breakfast Spot,Café,Shoe Store,Gastropub,Spa
1,"Garden District, Ryerson",0,Coffee Shop,Clothing Store,Restaurant,Hotel,Movie Theater,Sandwich Place,Sushi Restaurant,Falafel Restaurant,Japanese Restaurant,Italian Restaurant
2,St. James Town,0,Coffee Shop,Café,Italian Restaurant,Bakery,Japanese Restaurant,Breakfast Spot,Cocktail Bar,Seafood Restaurant,Cosmetics Shop,Gastropub
3,The Beaches,0,Pub,Gastropub,Bakery,Mexican Restaurant,Ice Cream Shop,French Restaurant,Pizza Place,Ramen Restaurant,Thai Restaurant,Indian Restaurant
4,Berczy Park,0,Coffee Shop,Seafood Restaurant,Japanese Restaurant,Bakery,Cocktail Bar,Restaurant,Pub,Café,Hotel,Lounge
5,Central Bay Street,0,Coffee Shop,Café,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Sushi Restaurant,Burger Joint,Sandwich Place,Ramen Restaurant,Hotel
6,Christie,0,Grocery Store,Café,Park,Nightclub,Italian Restaurant,Playground,Baby Store,Candy Store,Athletics & Sports,Coffee Shop
7,"Richmond, Adelaide, King",0,Café,Coffee Shop,Hotel,Gym,Sushi Restaurant,Restaurant,Deli / Bodega,Theater,Concert Hall,Ramen Restaurant
8,"Dufferin, Dovercourt Village",0,Bakery,Pharmacy,Park,Sporting Goods Shop,Grocery Store,Bar,Bank,Brewery,Supermarket,Coffee Shop
9,The Danforth East,0,Coffee Shop,Beer Bar,Convenience Store,Hostel,Pizza Place,Café,Sports Bar,Gastropub,Dim Sum Restaurant,Park


There sure are a lot of coffee shops in this first and largest cluster. That seems to be the biggest commonality

In [237]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,1,Playground,Garden,Liquor Store,Music Venue,Spa,Home Service,Pharmacy,Museum,Mediterranean Restaurant,Men's Store


This cluster just has a single neighborhood, Roselawn, with with playgrounds an a variety of venues, interesting

In [238]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park, Summerhill East",2,Park,Playground,Gym,Music Store,Museum,Movie Theater,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Miscellaneous Shop
33,Rosedale,2,Park,Playground,Trail,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Martial Arts School,Music Venue,Middle Eastern Restaurant


We can consider the 3rd cluster the park and fitness cluster

In [240]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Forest Hill North & West,3,Trail,Jewelry Store,Sushi Restaurant,Park,Miscellaneous Shop,Movie Theater,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Yoga Studio


This cluster just has a single neighborhood, Forest Hill North & West, with a lot of trails and some jewelry stores

In [241]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,4,Park,Bus Line,Swim School,Yoga Studio,Music Venue,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop


The last cluster just has Lawrence Park, with a lot of parks and another variety of venues