# Assignment: Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this notebook we will explore and cluster the neighborhoods of Toronto.
Following will be implemented:
- Dataframe with Postal Code, Borough and Neighborhood
    * The Datafame has below features:
        1. Cells with a borough that is Not assigned are being ignored.
        2. Comma seperated Neighbourhoods for same Postal Code
        3. For a with Not assigned neighborhood, the neighborhood is updated to be same as the borough
- Latitude and Longitude coordinates (using Geocoder Python package/CSV) have been added for each Neighborhood
- Created a cluster and mapped using Folium for Borough's containing 'Toronto' in the name
- Data in each cluster has been explored at the end

### Importing Libraries

In [192]:
import pandas as pd
import requests
#!conda install -c conda-forge lxml --yes
import lxml.html as lh

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

### Reading Data from below link (web page)

In [6]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [7]:
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')

In [8]:
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[:]:
    i+=1
    name=t.text_content()
    tup=name.split('\n')
    col.append((tup))

In [25]:
df = pd.DataFrame(col)

df.drop(df.iloc[:, 4:66], inplace = True, axis = 1) 
df.iloc[:,0]='ToBeDeleted'
df.drop(df.iloc[:, 4:66], inplace = True, axis = 1) 
df.rename(columns = {0:'TEST'}, inplace = True) 
df.drop(['TEST'], axis = 1,inplace=True) 
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header

df=df.iloc[:287,:] # removing extra rows at the end
df.rename(columns = {'Neighbourhood':'Neighborhood'}, inplace = True) 
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [26]:
df=df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor


In [27]:
for i in range(len(df)):
    if df.iloc[i,2] == 'Not assigned':
       df.iloc[i,2] = df.iloc[i,1] 
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor


In [28]:
df_Toronto=df
df_Toronto=df_Toronto.groupby('Postcode')['Borough','Neighborhood'].agg(lambda x: ', '.join(set(x))).reset_index()

df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Morningside, Guildwood, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Expected Dataframe and it size

In [29]:
df_Toronto.shape

(103, 3)

## Fetching geocodes for the Neighborhoods in above dataframe

Inputing credentials to connect to foursquare.com

In [30]:
CLIENT_ID = 'D1IXAYF5EGVSEDQ2PFFO01KVVXCYAICFE0F5YXMDPX1Z4STE' # your Foursquare ID
CLIENT_SECRET = 'WMFCF5KXW4N5PH0Q0J4NEZHE134H441NKFZFEGD4I1YJ2Z2H' # your Foursquare Secret
VERSION = '20200304'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: D1IXAYF5EGVSEDQ2PFFO01KVVXCYAICFE0F5YXMDPX1Z4STE
CLIENT_SECRET:WMFCF5KXW4N5PH0Q0J4NEZHE134H441NKFZFEGD4I1YJ2Z2H


In [16]:
#Not working Solution

# for i in range(len(df_Toronto)):
    
#     geolocator = Nominatim(user_agent="foursquare_agent")
#     location = geolocator.geocode('{}, Toronto, Ontario'.format(df_Toronto.iloc[i,0]))
#     print(location)
#     df_Toronto.iloc[i,3] = location.latitude
#     df_Toronto.iloc[i,4] = location.longitude
# print(i)
# df_Toronto.head()

### Since fetching of geocode is failing for some of the postcodes using foursquare_agent. I will be using the csv to get the latitude and longitude values

In [32]:
df_ll=pd.read_csv('https://cocl.us/Geospatial_data')
df_ll.rename(columns = {'Postal Code':'Postcode'}, inplace = True) 
df_ll.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging original dataframe with the one with geo coordinates

In [34]:
df_Toronto.drop(['Latitude','Longitude'], axis = 1,inplace=True) 
df_T=pd.merge(df_Toronto, df_ll, how='inner', on='Postcode', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
df_T.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, Guildwood, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [35]:
df_T.shape

(103, 5)

## Exploring and clustering the neighborhoods in Toronto

#### Create a map of Toronto with neighborhoods superimposed on top.

In [20]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [36]:
df_N=pd.merge(df, df_ll, how='inner', on='Postcode', left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
df_N.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Rouge,43.806686,-79.194353
1,M1B,Scarborough,Malvern,43.806686,-79.194353
2,M1C,Scarborough,Highland Creek,43.784535,-79.160497
3,M1C,Scarborough,Rouge Hill,43.784535,-79.160497
4,M1C,Scarborough,Port Union,43.784535,-79.160497


In [37]:
# create map of Toronto using latitude and longitude values
neighborhoods=df_N
latitude= 43.6532 
longitude= -79.3832
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Creating a subset of the dataframe with Neighborhood containing words in (North,East,West,South)

In [103]:
df_York=pd.DataFrame()
df_York=df_N[df_N['Borough'].str.contains("Toronto")]
df_York

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
67,M4E,East Toronto,The Beaches,43.676357,-79.293031
71,M4K,East Toronto,The Danforth West,43.679557,-79.352188
72,M4K,East Toronto,Riverdale,43.679557,-79.352188
73,M4L,East Toronto,The Beaches West,43.668999,-79.315572
74,M4L,East Toronto,India Bazaar,43.668999,-79.315572
...,...,...,...,...,...
155,M6R,West Toronto,Roncesvalles,43.648960,-79.456325
156,M6S,West Toronto,Runnymede,43.651571,-79.484450
157,M6S,West Toronto,Swansea,43.651571,-79.484450
158,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [104]:
df_York.reset_index(inplace=True)
df_York.head()

Unnamed: 0,index,Postcode,Borough,Neighborhood,Latitude,Longitude
0,67,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,71,M4K,East Toronto,The Danforth West,43.679557,-79.352188
2,72,M4K,East Toronto,Riverdale,43.679557,-79.352188
3,73,M4L,East Toronto,The Beaches West,43.668999,-79.315572
4,74,M4L,East Toronto,India Bazaar,43.668999,-79.315572


Checking first neighborhood in the data

In [108]:
print(f"First neighborhood in the data is {df_York.loc[0,'Neighborhood']}")

First neighborhood in the data is The Beaches


Fetching coordinates for the first neighborhood

In [109]:
address = df_York.loc[0,'Neighborhood'] #First neighborhood in the data is Hillcrest Village

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Borough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Borough are 43.6710244, -79.296712.


Creating url to fetch venues for the above selected neighborhood

In [110]:
LIMIT=100
radius=500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=D1IXAYF5EGVSEDQ2PFFO01KVVXCYAICFE0F5YXMDPX1Z4STE&client_secret=WMFCF5KXW4N5PH0Q0J4NEZHE134H441NKFZFEGD4I1YJ2Z2H&ll=43.6710244,-79.296712&v=20200304&radius=500&limit=100'

Requesting data from foursquare.com for the above url

In [111]:
results = requests.get(url).json()

From the Foursquare lab in the previous module, we know that all the information is in the items key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [112]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [113]:
import json
from pandas.io.json import json_normalize 
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Kew Gardens,Park,43.669038,-79.298538
1,Mastermind Toys,Toy / Game Store,43.671453,-79.293971
2,The Ten Spot,Nail Salon,43.67034,-79.299363
3,Sunset Grill,Diner,43.670214,-79.299653
4,Green Eggplant,Mediterranean Restaurant,43.670517,-79.29866


In [114]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

49 venues were returned by Foursquare.


## 2. Explore Neighborhoods in East Toronto

### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [115]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Now write the code to run the above function on each neighborhood and create a new dataframe called york_venues.

In [135]:
# type your answer here

york_venues = getNearbyVenues(names=df_York['Neighborhood'],
                                   latitudes=df_York['Latitude'],
                                   longitudes=df_York['Longitude']
                                  )



The Beaches
The Danforth West
Riverdale
The Beaches West
India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park
Summerhill East
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West
Rosedale
Cabbagetown
St. James Town
Church and Wellesley
Harbourfront
Ryerson
Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide
King
Richmond
Harbourfront East
Toronto Islands
Union Station
Design Exchange
Toronto Dominion Centre
Commerce Court
Victoria Hotel
Roselawn
Forest Hill North
Forest Hill West
The Annex
North Midtown
Yorkville
Harbord
University of Toronto
Chinatown
Grange Park
Kensington Market
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
Railway Lands
South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place
Underground city
Christie
Dovercourt Village
Dufferin
Little Portugal
Trinity
Brockton
Exhibition Place
Parkdale Village
High Park
The Junction South
Parkdale
Roncesvalles
Runnymede

### Let's check the size of the resulting dataframe

In [136]:
print(york_venues.shape)
york_venues.head()

(3259, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [137]:
york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,17,17,17,17,17,17
Berczy Park,57,57,57,57,57,57
Brockton,23,23,23,23,23,23
Business Reply Mail Processing Centre 969 Eastern,18,18,18,18,18,18
...,...,...,...,...,...,...
Underground city,100,100,100,100,100,100
Union Station,100,100,100,100,100,100
University of Toronto,35,35,35,35,35,35
Victoria Hotel,100,100,100,100,100,100


### Let's find out how many unique categories can be curated from all the returned venue

In [138]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 238 uniques categories.


In [139]:
york_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Dip 'n Sip,43.678897,-79.297745,Coffee Shop


## 3. Analyze Each Neighborhood

In [144]:
# one hot encoding
#york_onehot=pd.DataFrame()
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

#move neighborhood column to the first column
#fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])

fixed_columns = [york_onehot.columns[-1]] + (york_onehot.columns[:-1]).values.tolist()
york_onehot = york_onehot1[fixed_columns]

york_onehot.head()


Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size

In [145]:
york_onehot['Neighborhood'].head()

0    The Beaches
1    The Beaches
2    The Beaches
3    The Beaches
4    The Beaches
Name: Neighborhood, dtype: object

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [195]:
york_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [196]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01
1,Bathurst Quay,0.0,0.0,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0
3,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Business Reply Mail Processing Centre 969 Eastern,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Let's write a function to sort the venues in descending order.

In [149]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [173]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Thai Restaurant,Café,Bar,Restaurant,Sushi Restaurant,Gastropub,Steakhouse,Lounge,Cosmetics Shop
1,Bathurst Quay,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden,Airport,Airport Food Court,Bar,Harbor / Marina,Rental Car Location
2,Berczy Park,Coffee Shop,Farmers Market,French Restaurant,Bakery,Seafood Restaurant,Restaurant,Cheese Shop,Café,Beer Bar,Cocktail Bar
3,Brockton,Café,Breakfast Spot,Coffee Shop,Bakery,Climbing Gym,Burrito Place,Japanese Restaurant,Italian Restaurant,Stadium,Restaurant
4,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Auto Workshop,Skate Park,Smoke Shop,Spa,Farmers Market,Fast Food Restaurant,Burrito Place,Restaurant
5,CN Tower,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden,Airport,Airport Food Court,Bar,Harbor / Marina,Rental Car Location
6,Cabbagetown,Coffee Shop,Bakery,Pizza Place,Italian Restaurant,Pharmacy,Restaurant,Café,Pub,Japanese Restaurant,Snack Place
7,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Ice Cream Shop,Juice Bar,Café,Burger Joint,Japanese Restaurant,Bar,Salad Place
8,Chinatown,Bar,Vietnamese Restaurant,Café,Chinese Restaurant,Bakery,Coffee Shop,Mexican Restaurant,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Grocery Store
9,Christie,Grocery Store,Café,Park,Baby Store,Restaurant,Diner,Italian Restaurant,Athletics & Sports,Nightclub,Candy Store


In [178]:
neighborhoods_venues_sorted.drop('Cluster Labels',axis=1,inplace=True)

## 4. Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [179]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
#kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)
kmeans = KMeans(init="k-means++",n_clusters=kclusters, random_state=0,n_init=12).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20] 

array([0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [180]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = df_York

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
#york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#york_merged = york_merged.join(neighborhoods_venues_sorted, on='Neighborhood')
york_merged=pd.merge(york_merged, neighborhoods_venues_sorted, how='inner', on='Neighborhood')


In [169]:
york_merged.shape

(74, 12)

In [181]:
#york_merged.columns
desired_columns=['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude',
       'Cluster Labels', '1st Most Common Venue', '2nd Most Common Venue',
       '3rd Most Common Venue', '4th Most Common Venue',
       '5th Most Common Venue']
york_merged=york_merged[desired_columns]
york_merged.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Coffee Shop,Health Food Store,Pub,Doner Restaurant
1,M4K,East Toronto,The Danforth West,43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Bookstore,Furniture / Home Store
2,M4K,East Toronto,Riverdale,43.679557,-79.352188,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Bookstore,Furniture / Home Store
3,M4L,East Toronto,The Beaches West,43.668999,-79.315572,0,Park,Sandwich Place,Pizza Place,Gym,Board Shop
4,M4L,East Toronto,India Bazaar,43.668999,-79.315572,0,Park,Sandwich Place,Pizza Place,Gym,Board Shop


### Finally, let's visualize the resulting clusters

In [182]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Exploring clusters

In [198]:
york_merged.groupby('Cluster Labels').mean()

Unnamed: 0_level_0,Latitude,Longitude
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
0,43.661612,-79.394321
1,43.70037,-79.397233
2,43.711695,-79.416936
3,43.628947,-79.39442
4,43.689574,-79.38316


### 1st Cluster

In [185]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,East Toronto,0,Trail,Coffee Shop,Health Food Store,Pub,Doner Restaurant
1,East Toronto,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Bookstore,Furniture / Home Store
2,East Toronto,0,Greek Restaurant,Italian Restaurant,Coffee Shop,Bookstore,Furniture / Home Store
3,East Toronto,0,Park,Sandwich Place,Pizza Place,Gym,Board Shop
4,East Toronto,0,Park,Sandwich Place,Pizza Place,Gym,Board Shop
5,East Toronto,0,Café,Coffee Shop,Gastropub,Bakery,Brewery
7,Central Toronto,0,Hotel,Gym,Breakfast Spot,Food & Drink Shop,Sandwich Place
8,Central Toronto,0,Clothing Store,Coffee Shop,Sporting Goods Shop,Diner,Mexican Restaurant
9,Central Toronto,0,Dessert Shop,Sandwich Place,Gym,Sushi Restaurant,Coffee Shop
12,Central Toronto,0,Coffee Shop,Pub,Light Rail Station,American Restaurant,Sushi Restaurant


### 2nd Cluster

In [186]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Central Toronto,1,Park,Swim School,Bus Line,Women's Store,Dim Sum Restaurant
17,Downtown Toronto,1,Park,Trail,Playground,Doner Restaurant,Dim Sum Restaurant
38,Central Toronto,1,Trail,Park,Bus Line,Sushi Restaurant,Jewelry Store
39,Central Toronto,1,Trail,Park,Bus Line,Sushi Restaurant,Jewelry Store


### 3rd Cluster

In [187]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
37,Central Toronto,2,Garden,Women's Store,Dessert Shop,Event Space,Ethiopian Restaurant


### 4th Cluster

In [188]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
48,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
49,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
50,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
51,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
52,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
53,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden
54,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boutique,Sculpture Garden


### 5th Cluster

In [189]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
10,Central Toronto,4,Restaurant,Playground,Tennis Court,Trail,Distribution Center
11,Central Toronto,4,Restaurant,Playground,Tennis Court,Trail,Distribution Center


----------------------------------------

### *******End of Assignment*******