# Segmenting and Clustering Neighborhoods in Toronto

## Question 1: Creating the DataFrame for Toronto Postal Codes

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs

### DataFrame Generation

We must obtain the data from the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. It includes the Postal codes from Toronto, with the Boriugh and Neighborhood.


#### 1. Reading Data from HTML Wikipedia page
I've used BeautifulSoup library to import the HTML code

In [2]:
# Read the URL page and save it as a html page
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
text = bs(page.content, 'lxml')

#### 2. Extracting values from HTML Table elements
The columns are selected from <th> items and the rows are populated with values inside <td> tags.

In [3]:
# Read all  columns and include an empty list for each one
table = text.find('table',{'class':'wikitable sortable'})
tr_rows = table.find_all('tr')
df_cols = []
df_rows = []

# Search for header tags. They will be the DataFrame columns name
th_rows = table.find_all('th')
for th in th_rows:
    df_cols.append(th.text)

# Search for row tags. They will be the DataFrame rows
for tr in tr_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    df_rows.append(row)

#### 3. Creating the DataFrame from extracted elements
There is an empty row (from header) that must be deleted

In [4]:
# DataFrame creation with df_cols and df_rows. Null or None rows are dropped.
df = pd.DataFrame(df_rows, columns=df_cols)
df.dropna(inplace = True)
df.rename(columns={'Neighbourhood\n':'Neighborhood', 'Postcode':'PostalCode'}, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n


#### 4. The values in the last column include an innecesary '\n' special char. We must remove it.

In [5]:
df = df.replace('\\n','', regex=True)
df
df.head()
print(df.shape)

(289, 3)


#### 5. Rows without borough must be dropped

In [6]:
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames, inplace = True)
df.shape

(212, 3)

#### 6. If Neighborhood is Not assigned, this value will be equal to Borough

In [7]:
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Queen's Park
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


#### 7. Rows with same Postcode must be merged, combining the Neighbourhood name
As every PostalCode has the same Borough, we group 'df' by both value and join the Neighborhoods in the same row.
Finally we reset the index. The DataFrame is ready.

In [8]:
dfgrouped = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
dfgrouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### ANSWER 1: Renaming and Printing the shape of the new DataFrame

In [9]:
TorontoPD = dfgrouped
print(TorontoPD.columns)

#DataFrame Shape
TorontoPD.shape

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')


(103, 3)

## Question 2: Adding Latitude and Longitude for each Postal Code
By using geocoder, we will iterate each DataFrame row to add the related (Latitude, Longitude) coordinates as new columns.
I've tried several times to use geocoder to get the geographical coordinates without success. So, I'm finally using the CSV file.

In [10]:
# A CSV file with all the geographical coordinates will be used:
url_coords = 'http://cocl.us/Geospatial_data'
Toronto_coords = pd.read_csv(url_coords)

# Change the Postal Code column name to be equal than in Toronto_PD
Toronto_coords.rename(columns={'Postal Code':'PostalCode'}, inplace = True)
Toronto_coords.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### ANSWER 2: This is the new DataFrame with Geographical Coordinates

In [11]:
TorontoDF = pd.merge(TorontoPD, Toronto_coords, on = 'PostalCode')
TorontoDF.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Question 3: Generate maps to visualize neighborhoods and how they cluster together
With the complete Toronto Data Frame (TorontoDF). we will plot a Toronto map, including neighborhood representations and clustering them by proximity. Each cluster will be represented in a different colour.

### Importing all required libraries (plot, clustering,...)

In [12]:
import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Preparing Geographic and map information
First, we'll check the number of boroughs and neighborhoods at Toronto:

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(TorontoDF['Borough'].unique()),
        TorontoDF.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Toronto.

In [14]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="To_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Let's create a map with all the Neighborhood.
NOTE: I'm adding +0.05 to latitude for a better displaying of all the points

In [15]:
# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[latitude+0.05, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(TorontoDF['Latitude'], TorontoDF['Longitude'], TorontoDF['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define Foursquare Credentials and Version

In [16]:
CLIENT_ID = 'CXOBLEEBQ52ILMYSLGVWSMNKMMWX3JM3P20T2WGWMLVNBX4M' # your Foursquare ID
CLIENT_SECRET = 'V1IN4T5G5PRFXG2WFXFBIF5VIIQENSD0REEPJRCHSTMKJGX5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: CXOBLEEBQ52ILMYSLGVWSMNKMMWX3JM3P20T2WGWMLVNBX4M
CLIENT_SECRET:V1IN4T5G5PRFXG2WFXFBIF5VIIQENSD0REEPJRCHSTMKJGX5


#### We will explore the first neighborhood:

In [17]:
TorontoDF.loc[0, 'Neighborhood']

'Rouge, Malvern'

In [18]:
neighborhood_latitude = TorontoDF.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = TorontoDF.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = TorontoDF.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


In [19]:
# type your answer here
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=CXOBLEEBQ52ILMYSLGVWSMNKMMWX3JM3P20T2WGWMLVNBX4M&client_secret=V1IN4T5G5PRFXG2WFXFBIF5VIIQENSD0REEPJRCHSTMKJGX5&ll=43.806686299999996,-79.19435340000001&v=20180605&radius=500&limit=100'

In [20]:
results = requests.get(url).json()
venues = results['response']['groups'][0].keys()
venues

dict_keys(['type', 'name', 'items'])

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056
1,Interprovincial Group,Print Shop,43.80563,-79.200378


### Explore Neighborhoods in Toronto

In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


#### Let's create a function to get the top 100 venues that are in a radius of 500 metersame process to all the neighborhoods in Toronto

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    print(venues_list)
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [25]:
# After this execution, every Neighborhood is printed with its venues
toronto_venues = getNearbyVenues(names=TorontoDF['Neighborhood'],
                                   latitudes=TorontoDF['Latitude'],
                                   longitudes=TorontoDF['Longitude']
                                  )

[[('Rouge, Malvern', 43.806686299999996, -79.19435340000001, "Wendy's", 43.80744841934756, -79.19905558052072, 'Fast Food Restaurant'), ('Rouge, Malvern', 43.806686299999996, -79.19435340000001, 'Interprovincial Group', 43.8056297, -79.2003784, 'Print Shop')], [('Highland Creek, Rouge Hill, Port Union', 43.7845351, -79.16049709999999, 'Royal Canadian Legion', 43.78253332838298, -79.16308473261682, 'Bar'), ('Highland Creek, Rouge Hill, Port Union', 43.7845351, -79.16049709999999, 'Affordable Toronto Movers', 43.787918740889744, -79.16297673924419, 'Moving Target'), ('Highland Creek, Rouge Hill, Port Union', 43.7845351, -79.16049709999999, 'Scarborough Historical Society', 43.78875526434558, -79.162437915802, 'History Museum')], [('Guildwood, Morningside, West Hill', 43.7635726, -79.1887115, 'Swiss Chalet Rotisserie & Grill', 43.76769708292701, -79.1899135003439, 'Pizza Place'), ('Guildwood, Morningside, West Hill', 43.7635726, -79.1887115, 'G & G Electronics', 43.765309, -79.191537, 'El

#### Let's check the size of the resulting dataframe

In [26]:
print(toronto_venues.shape)
toronto_venues.head()

(2235, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
4,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum


Let's check how many venues were returned for each neighborhood

In [27]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9,9,9,9,9,9
"Alderwood, Long Branch",10,10,10,10,10,10
"Bathurst Manor, Downsview North, Wilson Heights",18,18,18,18,18,18
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",24,24,24,24,24,24
Berczy Park,58,58,58,58,58,58
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [28]:
TorontoDF.shape

(103, 5)

#### Let's find out how many unique categories can be curated from all the returned venues

In [29]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 275 uniques categories.


### Analyze Each Neighborhood

In [30]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
toronto_onehot.shape

(2235, 275)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.010000,0.000000,0.000000,0.000000,0.000000,0.010000,0.000000,0.000000,0.010000
1,Agincourt,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,"Alderwood, Long Branch",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,"Bathurst Manor, Downsview North, Wilson Heights",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.055556,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,Bayview Village,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,"Bedford Park, Lawrence Manor East",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,Berczy Park,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,"Birch Cliff, Cliffside West",0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


#### Let's confirm the new size

In [33]:
toronto_grouped.shape

(99, 275)

#### Let's print each neighborhood along with the top 5 most common venues

In [34]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.06
1             Café  0.05
2              Bar  0.04
3  Thai Restaurant  0.04
4       Steakhouse  0.04


----Agincourt----
                       venue  freq
0             Sandwich Place  0.25
1                     Lounge  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4  Middle Eastern Restaurant  0.00


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                 venue  freq
0                 Park  0.67
1           Playground  0.33
2        Metro Station  0.00
3                Motel  0.00
4  Monument / Landmark  0.00


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                 venue  freq
0        Grocery Store  0.22
1          Pizza Place  0.11
2           Beer Store  0.11
3  Fried Chicken Joint  0.11
4          Coffee Shop  0.11


----Alderwood, Long Branch----
            venue  

                       venue  freq
0              Metro Station  0.25
1                Coffee Shop  0.25
2          Convenience Store  0.25
3                       Park  0.25
4  Middle Eastern Restaurant  0.00


----Emery, Humberlea----
                        venue  freq
0              Baseball Field  0.33
1  Construction & Landscaping  0.33
2      Furniture / Home Store  0.33
3                 Yoga Studio  0.00
4               Movie Theater  0.00


----Fairview, Henry Farm, Oriole----
                  venue  freq
0        Clothing Store  0.17
1  Fast Food Restaurant  0.08
2           Coffee Shop  0.06
3         Women's Store  0.03
4                Bakery  0.03


----First Canadian Place, Underground city----
         venue  freq
0  Coffee Shop  0.08
1         Café  0.07
2        Hotel  0.06
3   Steakhouse  0.04
4   Restaurant  0.04


----Flemingdon Park, Don Mills South----
              venue  freq
0               Gym  0.10
1  Asian Restaurant  0.10
2       Coffee Shop  0.10
3     

                       venue  freq
0                Coffee Shop  0.33
1          Health Food Store  0.17
2       Other Great Outdoors  0.17
3                        Pub  0.17
4  Middle Eastern Restaurant  0.00


----The Beaches West, India Bazaar----
                  venue  freq
0                  Park  0.11
1        Sandwich Place  0.11
2             Pet Store  0.05
3    Italian Restaurant  0.05
4  Fast Food Restaurant  0.05


----The Danforth West, Riverdale----
                venue  freq
0    Greek Restaurant  0.23
1         Coffee Shop  0.09
2      Ice Cream Shop  0.07
3  Italian Restaurant  0.05
4           Bookstore  0.05


----The Junction North, Runnymede----
                  venue  freq
0           Pizza Place  0.25
1  Caribbean Restaurant  0.25
2              Bus Line  0.25
3         Grocery Store  0.25
4           Yoga Studio  0.00


----The Kingsway, Montgomery Road, Old Mill North----
                venue  freq
0                Pool  0.33
1                Park  0.33
2 

#### Let's put that into a *pandas* dataframe
First, let's write a function to sort the venues in descending order.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [36]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Thai Restaurant,Steakhouse,Bar,Asian Restaurant,American Restaurant,Bakery,Hotel,Gym
1,Agincourt,Lounge,Sandwich Place,Breakfast Spot,Skating Rink,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Women's Store,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pizza Place,Fried Chicken Joint,Coffee Shop,Sandwich Place,Beer Store,Fast Food Restaurant,Pharmacy,Gluten-free Restaurant,Eastern European Restaurant
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Skating Rink,Sandwich Place,Pub,Dance Studio,Pool,Coffee Shop,Gym,Drugstore


In [37]:
neighborhoods_venues_sorted.shape

(99, 11)

## 4. Cluster Neighborhoods
Run *k*-means to cluster the neighborhood into 5 clusters.

In [38]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 4, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 3, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 4, 1, 4, 0, 1, 1,
       1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 4, 1, 1, 1, 4, 1, 1, 4,
       1, 2, 1, 1, 1, 1, 4, 1, 4, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 4])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [41]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = TorontoDF

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

But there are several venues without cqtegory. We will lavel these venues with a 6th cluster number = 5

In [42]:
toronto_merged['Cluster Labels'].fillna(5, inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype('int32')

toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0,Fast Food Restaurant,Print Shop,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Women's Store
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,1,Moving Target,Bar,History Museum,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1,Pizza Place,Electronics Store,Rental Car Location,Breakfast Spot,Medical Center,Mexican Restaurant,Doner Restaurant,Diner,Discount Store,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Insurance Office,Korean Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Athletics & Sports,Hakka Restaurant,Fried Chicken Joint,Thai Restaurant,Bakery,Caribbean Restaurant,Bank,Doner Restaurant,Discount Store,Dog Run
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,2,Convenience Store,Playground,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,1,Discount Store,Department Store,Bus Station,Hobby Shop,Coffee Shop,Chinese Restaurant,Donut Shop,Diner,Dog Run,Doner Restaurant
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,1,Bakery,Bus Line,Park,Intersection,Soccer Field,Metro Station,Bus Station,Fast Food Restaurant,Coworking Space,Construction & Landscaping
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,0,Motel,American Restaurant,Women's Store,Dessert Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,1,College Stadium,Café,General Entertainment,Skating Rink,Donut Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore


Finally, let's visualize the resulting clusters

In [43]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters.
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster:

In [44]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,0,Fast Food Restaurant,Print Shop,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Women's Store
8,Scarborough,0,Motel,American Restaurant,Women's Store,Dessert Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
32,North York,0,Baseball Field,Home Service,Food Truck,Business Service,Women's Store,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
80,York,0,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,Turkish Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Dessert Shop
91,Etobicoke,0,Baseball Field,Eastern European Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Women's Store,Dim Sum Restaurant
97,North York,0,Baseball Field,Furniture / Home Store,Construction & Landscaping,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Women's Store


### Cluster Label 0 is 'Donuts and Dessert Shops' 

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

### Cluster Label 1 is 'Donuts and Dessert Shops' 

In [45]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,2,Convenience Store,Playground,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
48,Central Toronto,2,Playground,Women's Store,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


### Cluster Label 2 is 'Drugstores'

In [46]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,Etobicoke,3,Bank,Women's Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Dim Sum Restaurant


### Cluster Label 3 is 'Banks'

In [48]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,4,Park,Playground,Women's Store,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
23,North York,4,Park,Electronics Store,Bank,Women's Store,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
25,North York,4,Fast Food Restaurant,Park,Food & Drink Shop,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
30,North York,4,Bus Stop,Park,Airport,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
31,North York,4,Park,Grocery Store,Bank,Shopping Mall,Women's Store,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant
40,East York,4,Park,Coffee Shop,Convenience Store,Metro Station,Women's Store,Drugstore,Discount Store,Dog Run,Doner Restaurant,Donut Shop
44,Central Toronto,4,Park,Bus Line,Swim School,Women's Store,Donut Shop,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore
50,Downtown Toronto,4,Park,Trail,Playground,Women's Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
64,Central Toronto,4,Jewelry Store,Park,Sushi Restaurant,Trail,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
72,North York,4,Pizza Place,Pub,Park,Japanese Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop


### Cluster Label 4 is 'Parks'

In [49]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Scarborough,5,,,,,,,,,,
20,North York,5,,,,,,,,,,
21,North York,5,,,,,,,,,,
93,Etobicoke,5,,,,,,,,,,


### Cluster Label 5 is 'Non Categorized'