## This is the notebook for the final capstone project

#### A description of the problem and a discussion of the background:

> Problem Statement: Help find neighborhoods in Toronto, ON that are similar to a given neighborhood in New York. 
 <u>Who will be interested:</u> This can be used by relocation agencies to helpm people whon are looking to move to Toronto from New York.
> <b>Background:</b> Toronto is a fast growing city in North America. Having said so, there are loads of new opportunities that are sprawling across Toronto that is of interest to people. This has led to a recent growth in immigration to the city. As per this <a href='https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=2ahUKEwiiiv7ihqbjAhWPtp4KHX-kCA8QFjABegQIDBAF&url=https%3A%2F%2Fwww12.statcan.gc.ca%2Fnhs-enm%2F2011%2Fas-sa%2F99-010-x%2F99-010-x2011001-eng.cfm&usg=AOvVaw2_J8qJuBqdE4dWLNyJOajW'>site</a>; of all immigrants in Ontario, 7 out of 10 lived in Toronto. This has led to many immigration services that cater to the city of Toronto,ON. This has also led to a rich culturally diverse Toronto. Huffington POst also came up with this widely read <a href='https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjcsfPMh6bjAhXHjp4KHVPFDAUQFjAAegQIABAB&url=https%3A%2F%2Fwww.huffpost.com%2Fentry%2F9-reasons-to-leave-new-york-for-toronto_b_58094c97e4b099c434319388&usg=AOvVaw0v70aurDmwqx3Yh_WFJ8ra'> article </a> highlighting the migration from New York to Toronto. Upon reading the article, one thing that strikes is that people do not just move because of new opportunity, but also factor in the lifestyle. Rising cultural diversity of Toronto has been a magnet for people of culturally-diverse New York. This is something relocation agencies cash upon. They have come up with lots of services that help people explore their future neighborhood. We intend to come up with a smart solution that will provide an impetus to this effort. Our algorithm will factor in varipous lifestyle pointers of a neighborhood and use machine learning to find localities that have matching cultural and lifestyle offerings. This will help find neighborhoods in Toronto, ON that are similar to a given neighborhood in New York.

#### A description of the data and how it will be used to solve the problem.

> The data for 2 cities: Toronto and New York will be used to compare the neighborhoods, and then we rank the neighborhoods from the selected boroughs for the cities.
>> We will use the previous assignments to retrieve the neighborhood and geo co-ordinates for New York and Toronto. We will use Machine Learning (unsupervised learning) to cluster the neighborhoods from the two cities. Data sources are newyor_data.json & the <a href='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'>Toronto Wiki page </a>. Geo locations for Toronto will use this <a href='https://cocl.us/Geospatial_data'>source file</a>. Implementation logic is as follows:

<ul>
    <li>Create a dataset that holds the Geo-cordinates for Toronto's neighborhood</li>
    <li>Use Foursquare APIs to get venues for each of the Toronto's neighborhood</li>
    <li>Sort through the data to identify top 10 common venue categories for each of the Toronto's neighborhood</li>
    <li>Perform the above for New York's neighborhoods</li>
    <li>This step wil be the input. Select a neighborhood in NY for which we are looking for lookalikes in Toronto</li>
    <li>Use K-means on dataset that has all Toronto's neighborhoods plus this neighhborhood</li>
    <li>Then find the cluster which has the NY neighborhood in it and list all Toronto neighborhoods there</li>
</ul>


#### <u>Create a dataset that holds the Geo-cordinates for Toronto's neighborhood</u>

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
CLIENT_ID = 'IXNL21NRPG4DRXR1KB4S3MBSW4AEGAEGWBTLJMZFPKISSGSE' # your Foursquare ID
CLIENT_SECRET = 'SXM4DBWGGYUUNY4GHMT42GJQKZSENSFRMD3I1FJRCPXFBAS3' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
PO_Table = soup.find('table',class_='wikitable sortable')
df = pd.read_html(str(PO_Table))
df_PO = df[0]
df_PO1 = df_PO[df_PO['Borough']!='Not assigned']
df_PO2 = df_PO1.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
df_PO2['Neighbourhood'] = df_PO2.apply(
    lambda row: row['Borough'] if (row['Neighbourhood']== 'Not assigned') else row['Neighbourhood'],
    axis=1
)
geo_coord_master = pd.read_csv('https://cocl.us/Geospatial_data')
df_PO3 = pd.merge(df_PO2, geo_coord_master, 
                   left_on='Postcode', right_on = 'Postal Code', how='inner')
df_PO4 = df_PO3[['Postcode','Borough','Neighbourhood','Latitude','Longitude']]
neighborhoods = df_PO4[df_PO4['Borough'].str.contains('Toronto', na=False)][['Borough','Neighbourhood','Latitude','Longitude']]
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 4 boroughs and 38 neighborhoods.


#### <u>Further explore Toronto's neighborhood</u>

In [4]:
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [5]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
Toronto_data = neighborhoods.reset_index(drop=True)

# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Toronto_data['Latitude'], Toronto_data['Longitude'], Toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

#### <u>Use Foursquare APIs to get venues for each of the Toronto's neighborhood</u>

In [6]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
Toronto_venues = getNearbyVenues(names=Toronto_data['Neighbourhood'],
                                   latitudes=Toronto_data['Latitude'],
                                   longitudes=Toronto_data['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront,Regent Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvall

In [8]:
neighborhood_latitude = Toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = Toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood'] = Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()



Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [10]:
Toronto_grouped = Toronto_onehot.groupby('Neighbourhood').mean().reset_index()

#### <u>Sort through the data to identify top 10 common venue categories for each of the Toronto's neighborhood</u>

In [11]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Toronto_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
Toronto_neighborhoods_venues_sorted['Neighbourhood'] = Toronto_grouped['Neighbourhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    Toronto_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

Toronto_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Steakhouse,Asian Restaurant,Café,Pizza Place,Hotel,Neighborhood,Lounge,Burger Joint,Seafood Restaurant,Smoke Shop
1,Berczy Park,Seafood Restaurant,Coffee Shop,Cocktail Bar,Beer Bar,Café,Farmers Market,Greek Restaurant,Jazz Club,Basketball Stadium,Fish Market
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Breakfast Spot,Café,Climbing Gym,Stadium,Burrito Place,Restaurant,Caribbean Restaurant,Pet Store,Bakery
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Fast Food Restaurant,Park,Comic Shop,Pizza Place,Butcher,Burrito Place,Recording Studio,Restaurant,Brewery
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Sculpture Garden,Airport Food Court,Airport Gate,Bar,Boat or Ferry,Boutique


#### <u>Perform the above for New York's neighborhoods</u>
<ul>
    <li>Create a dataset that holds the Geo-cordinates for NY's neighborhood</li>
    <li>Use Foursquare APIs to get venues for each of the NY's neighborhood</li>
    <li>Sort through the data to identify top 10 common venue categories for each of the NY's neighborhood</li>
</ul>

In [12]:
import json # library to handle JSON files
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods_data = newyork_data['features']

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [13]:
address = 'New York, NY'
geolocator = Nominatim(user_agent="NY_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
NY_data = neighborhoods.reset_index(drop=True)
NY_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [14]:
NY_venues = getNearbyVenues(names=NY_data['Neighborhood'],
                                   latitudes=NY_data['Latitude'],
                                   longitudes=NY_data['Longitude']
                                  )

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [15]:
# one hot encoding
NY_onehot = pd.get_dummies(NY_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NY_onehot['Neighbourhood'] = NY_venues['Neighbourhood'] 

# move neighborhood column to the first column
cols = list(NY_onehot)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Neighbourhood')))
# use ix to reorder
NY_onehot = NY_onehot.loc[:, cols]

NY_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
NY_grouped = NY_onehot.groupby('Neighbourhood').mean().reset_index()
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
NY_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
NY_neighborhoods_venues_sorted['Neighbourhood'] = NY_grouped['Neighbourhood']

for ind in np.arange(NY_grouped.shape[0]):
    NY_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(NY_grouped.iloc[ind, :], num_top_venues)

NY_neighborhoods_venues_sorted.head()



Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Pharmacy,Spa,Deli / Bodega,Supermarket,Department Store,Fried Chicken Joint,Breakfast Spot,Bus Station,Gas Station
1,Annadale,Pizza Place,Park,Sports Bar,Restaurant,Food,Diner,Train Station,Pharmacy,Field,Event Space
2,Arden Heights,Pharmacy,Coffee Shop,Pizza Place,Bus Stop,Yoga Studio,Financial or Legal Service,Factory,Falafel Restaurant,Farm,Farmers Market
3,Arlington,Bus Stop,Deli / Bodega,American Restaurant,Boat or Ferry,Food,Grocery Store,Fish Market,Farm,Farmers Market,Fast Food Restaurant
4,Arrochar,Deli / Bodega,Pizza Place,Italian Restaurant,Bus Stop,Athletics & Sports,Middle Eastern Restaurant,Bagel Shop,Liquor Store,Supermarket,Hotel


## This step wil be the input. Select a neighborhood in NY for which we are looking for lookalikes in Toronto
> Here, I am selecting <b>Chelsea</b> ion New York as the input. So, I will be looking for places in Toronto that are similar to Chelsea, NY

In [17]:
NY_neighborhood_input = 'Chelsea'

In [18]:
NY_neighborhoods_venues_sorted[NY_neighborhoods_venues_sorted['Neighbourhood'] == 'Chelsea']

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
49,Chelsea,Hotel,Ice Cream Shop,Nightclub,Theater,Sandwich Place,Beer Bar,Market,Scenic Lookout,Taco Place,Gift Shop


In [19]:
NY_grouped[NY_grouped['Neighbourhood'] == 'Chelsea']

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,...,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
49,Chelsea,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### <u>Use K-means on dataset that has all Toronto's neighborhoods plus this neighhborhood</u>

In [20]:
combined_grouped = pd.concat([Toronto_grouped,NY_grouped[NY_grouped['Neighbourhood'] == 'Chelsea']], axis=0,sort=False, ignore_index=True)
combined_grouped.fillna(0, inplace=True)

In [21]:
combined_grouped[combined_grouped['Neighbourhood'] == 'Chelsea'] ## see if Chelsea is added to the DF

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Veterinarian,Video Store,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Shop,Wings Joint,Women's Store
38,Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
combined_grouped.head()

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Veterinarian,Video Store,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Run k-means

In [23]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 10

combined_grouped_clustering = combined_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(combined_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 7, 1, 3, 7, 7, 1, 7, 3], dtype=int32)

In [24]:
# add clustering labels
combined_neighborhoods_venues_sorted = Toronto_neighborhoods_venues_sorted.append(NY_neighborhoods_venues_sorted[NY_neighborhoods_venues_sorted['Neighbourhood'] == 'Chelsea'])
combined_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = Toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(combined_neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Toronto_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,The Beaches,43.676357,-79.293031,9,Neighborhood,Other Great Outdoors,Health Food Store,Trail,Pub,Cuban Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run
1,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,8,Greek Restaurant,Ice Cream Shop,Italian Restaurant,Yoga Studio,Bookstore,Restaurant,Spa,Diner,Juice Bar,Dessert Shop
2,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,3,Park,Pet Store,Ice Cream Shop,Liquor Store,Sandwich Place,Burger Joint,Fast Food Restaurant,Fish & Chips Shop,Burrito Place,Steakhouse
3,East Toronto,Studio District,43.659526,-79.340923,7,Café,Coffee Shop,Bakery,Italian Restaurant,American Restaurant,Middle Eastern Restaurant,Stationery Store,Fish Market,Latin American Restaurant,Seafood Restaurant
4,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Bus Line,Park,Swim School,Dance Studio,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


In [33]:
cluster_map = pd.DataFrame()
cluster_map['data_index'] = combined_grouped_clustering.index.values
cluster_map['cluster'] = kmeans.labels_

In [34]:
## get the index of Chelsea
combined_grouped[combined_grouped['Neighbourhood']=='Chelsea']

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Veterinarian,Video Store,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Shop,Wings Joint,Women's Store
38,Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
cluster_map[cluster_map['data_index']==38] ##get the cluster that Chelsea is in

Unnamed: 0,data_index,cluster
38,38,1


In [27]:
cluster_map.groupby(['cluster']).count() ## double check if records are clustered

Unnamed: 0_level_0,data_index
cluster,Unnamed: 1_level_1
0,1
1,11
2,1
3,11
4,1
5,1
6,1
7,10
8,1
9,1


In [36]:
combined_grouped_with_cluster=combined_grouped.merge(cluster_map.set_index('data_index'), left_index=True, right_on='data_index')
combined_grouped_with_cluster[combined_grouped_with_cluster['cluster']==1]['Neighbourhood']

data_index
0                                Adelaide,King,Richmond
1                                           Berczy Park
3     Business Reply Mail Processing Centre 969 Eastern
7               Chinatown,Grange Park,Kensington Market
11                                           Davisville
12                                     Davisville North
15                          Dovercourt Village,Dufferin
30                              Ryerson,Garden District
32                      Stn A PO Boxes 25 The Esplanade
34                    The Annex,North Midtown,Yorkville
38                                              Chelsea
Name: Neighbourhood, dtype: object

In [39]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### End of assignment