# Applied Data Science - Capstone Project

## Introduction

### Bussines Problem

A company dedicated to the rent and sale of real estate housing wants to reduce the operational cost in the process of recommending neighborhoods for their customers.

### Objetive

Automatically recommend neighborhoods with greater similarity to the client's current neighborhood in the desired destination city.

This task can be achieved by using clustering algorithms to find out the similarity between neighbourhoods (actual -> target) and determine good recomendations for customers while reducing operational costs to the company.

For this project we're going to simulate a customer who lives in a random neighbourhood of Toronto, CA and is willing to move to New York City, US.

> **NOTE:** For this project, similarity of neighborhoods is determined by the common venues available where customers are used to live.

## Data

This project requires different kinds of data such as geolocation data and venues information for each selected city.

Geolocation Infomation:

* Neighbourhoods in Toronto.
* Neighbourhoods in New Yor City.

Venues Information:

* Venues for each neighbourhood in Toronto.
* Venues for each neighbourhood in New York.

### Data Sources

Follow is a list of the data sources to use for this project.

* [New York City Open Data](https://data.cityofnewyork.us/City-Government/Neighborhood-Names-GIS/99bc-9p23): Geolocation data.
* [Toronto Most Common Venues](https://github.com/cdCarlos/coursera_capston/blob/master/toronto_common_venues.csv): Geolocation data.
* [Foursquare API](https://developer.foursquare.com/): Venues data

## Methodology & Analysis

### Import packages

Let's start importing the required python packages.

In [2]:
import pandas as pd
import numpy as np
import requests
import random

import folium
from geopy.geocoders import Nominatim

import json

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline

# import k-means from clustering stage
from sklearn.cluster import KMeans

### Load Toronto Common Venues

As we are simulating a customer who lives in Toronto, we need to load our dataset for Toronto's neighborhoods.

This dataset has extra information that we do not need for our development. In this particular case, we are interested in the customer's ten most common venues only but we can keep the data as it is since it does not affect the final results.

In [3]:
toronto_common_venues_dataset_url = 'https://github.com/cdCarlos/coursera_capston/blob/master/toronto_common_venues.csv'
#toronto_common_venues_dataset_url = 'toronto_common_venues.csv' # offline work

In [4]:
tdf = pd.read_csv(toronto_common_venues_dataset_url) # Toronto Common Venues DataFrame
tdf.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,6,Fast Food Restaurant,Dumpling Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,College Rec Center
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,8,Bar,Yoga Studio,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Dim Sum Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1,Rental Car Location,Mexican Restaurant,Electronics Store,Pizza Place,Breakfast Spot,Medical Center,Dim Sum Restaurant,Diner,Discount Store,Dog Run
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Insurance Office,Convenience Store,Korean Restaurant,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Athletics & Sports,Hakka Restaurant,Caribbean Restaurant,Bakery,Thai Restaurant,Bank,Fried Chicken Joint,Donut Shop,Dog Run,Doner Restaurant


### Load New York City Common Venues

Let's load NY dataset.

In [5]:
ny_url = 'https://data.cityofnewyork.us/api/views/xyye-rtrs/rows.csv'
#ny_url = 'NHoodNameCentroids.csv' # offline work

nydf = pd.read_csv(ny_url)
nydf.head()

Unnamed: 0,the_geom,OBJECTID,Name,Stacked,AnnoLine1,AnnoLine2,AnnoLine3,AnnoAngle,Borough
0,POINT (-73.8472005205491 40.89470517661004),1,Wakefield,1,Wakefield,,,0,Bronx
1,POINT (-73.82993910812405 40.87429419303015),2,Co-op City,2,Co-op,City,,0,Bronx
2,POINT (-73.82780644716419 40.88755567735082),3,Eastchester,1,Eastchester,,,0,Bronx
3,POINT (-73.90564259591689 40.895437426903875),4,Fieldston,1,Fieldston,,,0,Bronx
4,POINT (-73.91258546108577 40.89083449389134),5,Riverdale,1,Riverdale,,,0,Bronx


#### NY Pre-Processing

For NY, we have to performe some data pre-processing first since we are getting the neighborhoods information directly from NY's web site.

##### Clean coordinates

In [6]:
lat, long = [], []
def split_coords(v):
    v = v[v.find('(') + 1:v.find(')')].split(" ")
    lat.append(v[1])
    long.append(v[0])
    return v

nydf['the_geom'] = nydf['the_geom'].apply(split_coords)

nydf_coords = pd.DataFrame(data=[lat, long]).T
nydf_coords.columns = ['Latitude', 'Longitude']
print(nydf_coords.head())

             Latitude           Longitude
0   40.89470517661004   -73.8472005205491
1   40.87429419303015  -73.82993910812405
2   40.88755567735082  -73.82780644716419
3  40.895437426903875  -73.90564259591689
4   40.89083449389134  -73.91258546108577


In [7]:
nydf['Latitude'] = nydf_coords['Latitude']
nydf['Longitude'] = nydf_coords['Longitude']
nydf.head()

Unnamed: 0,the_geom,OBJECTID,Name,Stacked,AnnoLine1,AnnoLine2,AnnoLine3,AnnoAngle,Borough,Latitude,Longitude
0,"[-73.8472005205491, 40.89470517661004]",1,Wakefield,1,Wakefield,,,0,Bronx,40.89470517661004,-73.8472005205491
1,"[-73.82993910812405, 40.87429419303015]",2,Co-op City,2,Co-op,City,,0,Bronx,40.87429419303015,-73.82993910812405
2,"[-73.82780644716419, 40.88755567735082]",3,Eastchester,1,Eastchester,,,0,Bronx,40.88755567735082,-73.82780644716419
3,"[-73.90564259591689, 40.895437426903875]",4,Fieldston,1,Fieldston,,,0,Bronx,40.89543742690388,-73.90564259591689
4,"[-73.91258546108577, 40.89083449389134]",5,Riverdale,1,Riverdale,,,0,Bronx,40.89083449389134,-73.91258546108577


##### Remove unnecessary columns

Let's clean up those columns that are not required to continue our analysis and development.

In [8]:
nydf = nydf = nydf[['Borough', 'Name', 'Latitude', 'Longitude']]
nydf.rename(columns={'Name': 'Neighbourhood'}, inplace=True)

nydf.to_csv('ny_coords.csv', index=False) # save point

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.89470517661004,-73.8472005205491
1,Bronx,Co-op City,40.87429419303015,-73.82993910812405
2,Bronx,Eastchester,40.88755567735082,-73.82780644716419
3,Bronx,Fieldston,40.89543742690388,-73.90564259591689
4,Bronx,Riverdale,40.89083449389134,-73.91258546108577


In [9]:
#nydf = pd.read_csv('ny_coords.csv') # offline work
nydf.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Pre Visualize NY Neighbourhoods

Fetch New York City coordinates for map reference.

In [10]:
address = 'New York City, US'

geolocator = Nominatim(user_agent="ny_explorer")
ny_coords = geolocator.geocode(address)
ny_lat = ny_coords.latitude
ny_long = ny_coords.longitude
print('New York City coordinates: {}, {}.'.format(ny_lat, ny_long))

New York City coordinates: 40.7308619, -73.9871558.


Generate NY map with its neighbourhoods

In [11]:
# create map of New York using latitude and longitude values
nmap = folium.Map(tiles='cartodbpositron', location=[ny_lat, ny_long], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(nydf['Latitude'], nydf['Longitude'], nydf['Borough'], nydf['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(nmap)  

nmap.save(outfile='map_ny_newighbourhoods.html')
nmap

![Previsualize NY Neighbourhoods](map_ny_newighbourhoods.png "Previsualize NY Neighbourhoods")

### Select Random Customer Location

As described in the project's introduction, we are going to simulate a customer who lives in Toronto and is willing to move to New York City. For this, we're going to select a random customer's neighbourhood in toronto.

> NOTE: We are interested in the ten most common venues only.

In [12]:
nmax = tdf.shape[0]
cidx = random.randint(0, nmax) # customer index in toronto df
customer = tdf.iloc[cidx]
print('Customer Neighbourhood: ', customer)

Customer Neighbourhood:  PostalCode                                        M3K
Borough                                    North York
Neighbourhood             CFB Toronto, Downsview East
Latitude                                      43.7375
Longitude                                    -79.4648
Cluster Labels                                      2
1st Most Common Venue                        Bus Stop
2nd Most Common Venue                            Park
3rd Most Common Venue                         Airport
4th Most Common Venue                     Yoga Studio
5th Most Common Venue             Dumpling Restaurant
6th Most Common Venue                  Discount Store
7th Most Common Venue                         Dog Run
8th Most Common Venue                Doner Restaurant
9th Most Common Venue                      Donut Shop
10th Most Common Venue                      Drugstore
Name: 27, dtype: object


### NY Venues

In [50]:
CLIENT_ID = '[YOUR_CLIENT_ID]' # your Foursquare ID
CLIENT_SECRET = '[YOUR_CLIENT_SECRET]' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('API Version: ', VERSION)

Your credentails:
CLIENT_ID: [YOUR_CLIENT_ID]
CLIENT_SECRET:[YOUR_CLIENT_SECRET]
API Version:  20180605


This function will create our Forsquare URLs for each NY neighborhood.

In [15]:
def get_venues_url(endpoint, lat=0, lon=0, limit=50, radius=500):
    BASE_HOST = 'https://api.foursquare.com/v2/venues'
    AUTH_PARAMS = '&client_id={}&client_secret={}&v={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION)
    
    return BASE_HOST + endpoint + '?&ll={},{}&radius={}&limit={}'.format(lat, lon, radius, limit) + AUTH_PARAMS

This function will help us to parse andget the venues category.

In [16]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

With this function we retrieve from Forsqueare API the venues for each NY neighborhood.

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        try:
            
            # create the API request URL
            url = get_venues_url('/explore', lat=lat, lon=lng, radius=radius, limit=100)

            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']
            print('\tTotal Venues: ', len(results))

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except Exception as e:
            print('\tFAILED: ', e)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
        'Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
        'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'
    ]
    
    return(nearby_venues)

#### Fetch Venues from Foursquare API

In [18]:
nyv = getNearbyVenues(
    names=nydf['Neighbourhood'],
    latitudes=nydf['Latitude'],
    longitudes=nydf['Longitude']
)

In [19]:
nyv.to_csv('ny_venues.csv', index=False) # save point
nyv.head()

In [20]:
# offline work
#nyv = pd.read_csv('ny_venues.csv')
nyv.shape

(9987, 7)

### NY Clustering

#### OneHot Encoding

Since the algorithm we are going to use for neighborhoods clustering does no work with categorical data (K-Means), we have to first encode categorical data to numeric values.

In [21]:
# one hot encoding
nyv_onehot = pd.get_dummies(nyv[['Venue Category']], prefix="", prefix_sep="")

# Delete Venue Category 'Neighborhood' to avoid confusion
nyv_onehot.drop(columns=['Neighborhood'], axis=1, inplace=True)

# add neighborhood column back to dataframe
nyv_onehot['Neighbourhood'] = nyv['Neighborhood']

# # # move neighborhood column to the first column
fixed_columns = [nyv_onehot.columns[-1]] + list(nyv_onehot.columns[:-1])
nyv_onehot = nyv_onehot[fixed_columns]


# # For Offline Work
nyv_onehot.to_csv('nyv_onehot.csv', index=False) # save point
print(nyv_onehot.shape)
nyv_onehot.head()

(9987, 421)


Unnamed: 0,Neighbourhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,Airport Tram,American Restaurant,Animal Shelter,...,Waste Facility,Watch Shop,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Wakefield,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
nyv_grouped = nyv_onehot.groupby('Neighbourhood').mean().reset_index()
print(nyv_grouped.shape)
nyv_grouped.head()

(293, 421)


Unnamed: 0,Neighbourhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,Airport Tram,American Restaurant,Animal Shelter,...,Waste Facility,Watch Shop,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arverne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### NY Ten Most Common Venues

Now that we have categorical data encoded to numeric values, we can filter out the ten most common venues for each neighborhood in NY.

Let's see how they look like.

In [24]:
n_top_venues = 10
def return_most_common_venues(row, num_top_venues = 10):
    #print('*********** row:\n', row[:num_top_venues])
    row_categories = row.iloc[1:] # Remove: Neighborhood, Neighborhood_NAME
    #print('\n\n*********** row_categories:\n', row_categories[:num_top_venues])
    row_categories_sorted = row_categories.sort_values(ascending=False)
    #print('\n\n*********** row_categories_sorted:\n', row_categories_sorted[:num_top_venues])
    
    return row_categories_sorted.index.values[0:num_top_venues]


indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(n_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = nyv_grouped['Neighbourhood']

for ind in np.arange(nyv_grouped.shape[0]):
# for ind in np.arange(2):
    #print('\n\n\n[{}] ------------------------------------'.format(ind))
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nyv_grouped.iloc[ind, :])

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(293, 11)


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Annadale,Sports Bar,Restaurant,Dance Studio,Pizza Place,Cosmetics Shop,Liquor Store,Diner,Train Station,Yoga Studio,Fast Food Restaurant
1,Arden Heights,Pharmacy,Deli / Bodega,Pizza Place,Coffee Shop,Bus Stop,Fish & Chips Shop,Eye Doctor,Factory,Falafel Restaurant,Farm
2,Arlington,Bus Stop,Deli / Bodega,Food Service,Grocery Store,Fish Market,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
3,Arrochar,Deli / Bodega,Bus Stop,Italian Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Sandwich Place,Liquor Store,Bagel Shop,Hotel,Pizza Place
4,Arverne,Surf Spot,Metro Station,Pizza Place,Donut Shop,Sandwich Place,Beach,Playground,Thai Restaurant,Bed & Breakfast,Board Shop


#### K-Means Clustering

K-Means is one of the simplest, fast and robust algorithm for unsupervised learning clustering and it also works really well with large datasets.

In [25]:
# set number of clusters
kclusters = 10

In [26]:
nyv_grouped_clustering = nyv_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nyv_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 2, 2, 9, 9, 9, 9, 9, 9])

In [27]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_, )

ny_merged = nydf.copy()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

# drop neighborhoods with NO venues
ny_merged.dropna(inplace=True)

ny_merged['Cluster Labels'] = ny_merged['Cluster Labels'].map(int)

ny_merged.head() # check the last columns!
# toronto_merged[(toronto_merged['Neighbourhood'] == 'Agincourt')]

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,9,Sandwich Place,Dessert Shop,Laundromat,Pharmacy,Ice Cream Shop,Food Truck,Donut Shop,Factory,Falafel Restaurant,Farm
1,Bronx,Co-op City,40.874294,-73.829939,9,Park,Mattress Store,Discount Store,Chinese Restaurant,Pharmacy,Grocery Store,Pizza Place,Gift Shop,Basketball Court,Baseball Field
2,Bronx,Eastchester,40.887556,-73.827806,3,Caribbean Restaurant,Bus Station,Diner,Metro Station,Deli / Bodega,Donut Shop,Bowling Alley,Bus Stop,Fast Food Restaurant,Chinese Restaurant
3,Bronx,Fieldston,40.895437,-73.905643,9,Playground,Plaza,Yoga Studio,Film Studio,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market
4,Bronx,Riverdale,40.890834,-73.912585,9,Park,Home Service,Food Truck,Gym,Bus Station,Playground,Plaza,Bank,Yoga Studio,Eye Doctor


In [None]:
ny_merged.to_csv('ny_clusters.csv', index=False) #save point

In [28]:
#ny_merged = pd.read_csv('ny_clusters.csv') # offline work

Let's see how our NY clusters look like.

In [29]:
# create map
nmap_clusters = folium.Map(tiles='cartodbpositron', location=[ny_lat, ny_long], zoom_start=10) # lat, lon corresponds to Toronto

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighbourhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(nmap_clusters)

nmap_clusters.save(outfile='map_ny_clusters.html')
nmap_clusters

![NY Neighbourhoods Clusters](map_ny_clusters.png "NY Neighbourhoods Clusters")

#### Examine Clusters

Now we can have a look at the venues for each cluster in NY.

In [30]:
print('TOTAL CLUSTERS: ', kclusters)
print('Classes: ', ny_merged['Cluster Labels'].unique())

TOTAL CLUSTERS:  10
Classes:  [9 3 0 2 5 1 6 4 8 7]


In [31]:
cols = ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]
cols

Index(['Neighbourhood', '1st Most Common Venue', '2nd Most Common Venue',
       '3rd Most Common Venue', '4th Most Common Venue',
       '5th Most Common Venue', '6th Most Common Venue',
       '7th Most Common Venue', '8th Most Common Venue',
       '9th Most Common Venue', '10th Most Common Venue'],
      dtype='object')

##### Cluster 0

In [32]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Kingsbridge,Pizza Place,Bar,Sandwich Place,Discount Store,Supermarket,Latin American Restaurant,Mexican Restaurant,Bakery,Pharmacy,Spanish Restaurant
8,Norwood,Park,Pizza Place,Deli / Bodega,Chinese Restaurant,Pharmacy,Bank,Fast Food Restaurant,American Restaurant,Spanish Restaurant,Bus Stop
11,Pelham Parkway,Pizza Place,Deli / Bodega,Italian Restaurant,Plaza,Chinese Restaurant,Performing Arts Venue,Metro Station,Mexican Restaurant,Eye Doctor,Smoke Shop
13,Bedford Park,Pizza Place,Deli / Bodega,Diner,Chinese Restaurant,Sandwich Place,Mexican Restaurant,Pharmacy,Supermarket,Bus Station,Spanish Restaurant
14,University Heights,Chinese Restaurant,Pizza Place,Fast Food Restaurant,Shoe Store,Food,Bank,Optical Shop,Seafood Restaurant,Bakery,Sandwich Place
15,Morris Heights,Pharmacy,Grocery Store,Food Truck,Latin American Restaurant,Bank,Burrito Place,Pizza Place,Spanish Restaurant,IT Services,Farmers Market
16,Fordham,Mobile Phone Shop,Shoe Store,Spanish Restaurant,Pizza Place,Donut Shop,Gym / Fitness Center,Fast Food Restaurant,Supplement Shop,Clothing Store,Chinese Restaurant
17,East Tremont,Pizza Place,Shoe Store,Supermarket,Mobile Phone Shop,Donut Shop,Mexican Restaurant,Spanish Restaurant,Lounge,Discount Store,Paella Restaurant
18,West Farms,Bus Station,Supermarket,Latin American Restaurant,Diner,Convenience Store,Donut Shop,Playground,Pizza Place,Sandwich Place,Coffee Shop
19,High Bridge,Pharmacy,Pizza Place,Discount Store,Sandwich Place,Chinese Restaurant,Spanish Restaurant,Gym,Grocery Store,Food,Bus Station


##### Cluster 1

In [33]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
194,Somerville,Park,Yoga Studio,Frozen Yogurt Shop,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
205,Todt Hill,Park,Yoga Studio,Frozen Yogurt Shop,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


##### Cluster 2

In [34]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
77,Manhattan Beach,Café,Bus Stop,Beach,Food,Ice Cream Shop,Playground,Sandwich Place,Yoga Studio,Factory,Falafel Restaurant
200,New Brighton,Bus Stop,Park,Deli / Bodega,Chinese Restaurant,Playground,Discount Store,Convenience Store,Bowling Alley,Farmers Market,Fish Market
204,Grymes Hill,Dog Run,Basketball Court,Moving Target,Bus Stop,Flea Market,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field
206,South Beach,Pier,Deli / Bodega,Beach,Athletics & Sports,Bus Stop,Yoga Studio,Fish Market,Falafel Restaurant,Farm,Farmers Market
208,Mariner's Harbor,Italian Restaurant,Deli / Bodega,Bus Stop,French Restaurant,Fish Market,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
218,Woodrow,Bus Stop,Gym / Fitness Center,Racetrack,Fish Market,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
219,Tottenville,Italian Restaurant,Bus Stop,Deli / Bodega,Thrift / Vintage Store,Cosmetics Shop,Mexican Restaurant,Home Service,Fish Market,Falafel Restaurant,Farm
226,Park Hill,Athletics & Sports,Bus Stop,Hotel,Coffee Shop,Gym / Fitness Center,Fish & Chips Shop,Factory,Falafel Restaurant,Farm,Farmers Market
229,Arlington,Bus Stop,Deli / Bodega,Food Service,Grocery Store,Fish Market,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
230,Arrochar,Deli / Bodega,Bus Stop,Italian Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Sandwich Place,Liquor Store,Bagel Shop,Hotel,Pizza Place


##### Cluster 3

In [35]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Eastchester,Caribbean Restaurant,Bus Station,Diner,Metro Station,Deli / Bodega,Donut Shop,Bowling Alley,Bus Stop,Fast Food Restaurant,Chinese Restaurant
23,Longwood,Sandwich Place,Diner,Metro Station,Grocery Store,Donut Shop,Latin American Restaurant,Bus Station,Chinese Restaurant,Deli / Bodega,Food & Drink Shop
26,Soundview,Chinese Restaurant,Grocery Store,Bus Station,Fried Chicken Joint,Pharmacy,Basketball Court,Video Store,Bus Stop,Breakfast Spot,Latin American Restaurant
29,Country Club,Sandwich Place,Chinese Restaurant,Fried Chicken Joint,Weight Loss Center,Playground,Comic Shop,Film Studio,Exhibit,Eye Doctor,Factory
41,Olinville,Caribbean Restaurant,Furniture / Home Store,Food,Supermarket,Basketball Court,Fast Food Restaurant,Laundromat,Chinese Restaurant,Fried Chicken Joint,Deli / Bodega
45,Edenwald,Chinese Restaurant,Grocery Store,Food,Bus Station,Fish Market,Supermarket,Deli / Bodega,Design Studio,Flea Market,Farm
56,East Flatbush,Chinese Restaurant,Caribbean Restaurant,Check Cashing Service,Park,Moving Target,Print Shop,Fast Food Restaurant,Food,Pharmacy,Hardware Store
71,Cypress Hills,Latin American Restaurant,Deli / Bodega,Chinese Restaurant,Donut Shop,Fried Chicken Joint,Discount Store,Food,South American Restaurant,Caribbean Restaurant,Pizza Place
74,Canarsie,Caribbean Restaurant,Gym,Asian Restaurant,Chinese Restaurant,Yoga Studio,Fish & Chips Shop,Factory,Falafel Restaurant,Farm,Farmers Market
89,Ocean Hill,Deli / Bodega,Convenience Store,Playground,Southern / Soul Food Restaurant,Bus Stop,Fried Chicken Joint,Intersection,Market,Donut Shop,Salad Place


##### Cluster 4

In [36]:
ny_merged.loc[ny_merged['Cluster Labels'] == 4, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
209,Port Ivory,Bar,Yoga Studio,Exhibit,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
214,Oakwood,Bar,Yoga Studio,Exhibit,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant


##### Cluster 5

In [37]:
ny_merged.loc[ny_merged['Cluster Labels'] == 5, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
181,Neponsit,Beach,Bar,Yoga Studio,Fish Market,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field


##### Cluster 6

In [38]:
ny_merged.loc[ny_merged['Cluster Labels'] == 6, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
195,Brookville,Deli / Bodega,Home Service,Yoga Studio,Fish Market,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


##### Cluster 7

In [39]:
ny_merged.loc[ny_merged['Cluster Labels'] == 7, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
258,Howland Hook,Pier,Event Service,Event Space,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


##### Cluster 8

In [40]:
ny_merged.loc[ny_merged['Cluster Labels'] == 8, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
227,Westerleigh,Convenience Store,Arcade,Yoga Studio,Fish Market,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant


##### Cluster 9

In [41]:
ny_merged.loc[ny_merged['Cluster Labels'] == 9, cols]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Wakefield,Sandwich Place,Dessert Shop,Laundromat,Pharmacy,Ice Cream Shop,Food Truck,Donut Shop,Factory,Falafel Restaurant,Farm
1,Co-op City,Park,Mattress Store,Discount Store,Chinese Restaurant,Pharmacy,Grocery Store,Pizza Place,Gift Shop,Basketball Court,Baseball Field
3,Fieldston,Playground,Plaza,Yoga Studio,Film Studio,Exhibit,Eye Doctor,Factory,Falafel Restaurant,Farm,Farmers Market
4,Riverdale,Park,Home Service,Food Truck,Gym,Bus Station,Playground,Plaza,Bank,Yoga Studio,Eye Doctor
6,Marble Hill,Coffee Shop,Discount Store,Yoga Studio,Kids Store,Steakhouse,Supplement Shop,Tennis Stadium,Shopping Mall,Shoe Store,Gym
7,Woodlawn,Playground,Deli / Bodega,Pizza Place,Food & Drink Shop,Supermarket,Train Station,Pub,Cosmetics Shop,Convenience Store,Donut Shop
9,Williamsbridge,Caribbean Restaurant,Nightclub,Bar,Convenience Store,Soup Place,Fish Market,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant
10,Baychester,Electronics Store,Supermarket,Bank,Discount Store,Donut Shop,Pizza Place,Sandwich Place,Pet Store,Fast Food Restaurant,Spanish Restaurant
12,City Island,Harbor / Marina,Park,Thrift / Vintage Store,Seafood Restaurant,Spanish Restaurant,Bank,Smoke Shop,Liquor Store,Pharmacy,Pizza Place
22,Port Morris,Latin American Restaurant,Cupcake Shop,Distillery,Brewery,Spanish Restaurant,Flower Shop,Donut Shop,Peruvian Restaurant,Food Truck,Music Venue


### Neighbourhood Recommendation

Now it's time to recommned neighborhoods to customer. Let's remember first who is our customer:

In [42]:
customer

PostalCode                                        M3K
Borough                                    North York
Neighbourhood             CFB Toronto, Downsview East
Latitude                                      43.7375
Longitude                                    -79.4648
Cluster Labels                                      2
1st Most Common Venue                        Bus Stop
2nd Most Common Venue                            Park
3rd Most Common Venue                         Airport
4th Most Common Venue                     Yoga Studio
5th Most Common Venue             Dumpling Restaurant
6th Most Common Venue                  Discount Store
7th Most Common Venue                         Dog Run
8th Most Common Venue                Doner Restaurant
9th Most Common Venue                      Donut Shop
10th Most Common Venue                      Drugstore
Name: 27, dtype: object

Here we are going to use the term **Similarity Level (SL)**.

Either for Toronto and NY we have their then most common venues. These venues are ordered starting at 1 to 10 being 1 the first one most representative, 2 the second one most representative and so on for the neighborhood. This is our similarity level.

In our case we recommend a NY neighborhood based on the Similarity Level it has with the customer's neighborhood. For instance, a recommended NY neighborhood with a SL equals to 3 means that the NY neighborhood has a perfect match with the 1st, 2nd and 3th customer's most common venues.

In [43]:
cols = ny_merged.columns[list(range(5, ny_merged.shape[1]))]
def add_indexes(row):
    print

def recommended_cluster(idf, tdf, cols):
#     tdf2 = tdf.copy()
    most_common_level = None
    clusters = None
    simlarity_levels = {}

    for c in cols:
        print('-------- Similarity Level:', c, '--------')
        tdf = tdf[(tdf[c] == idf[c])]

        if tdf.shape[0] == 0:
            print('\n\n<<<<<<<<<< STOP >>>>>>>>>>')
            print('Maximum Level of Similarity: ', most_common_level)
            print('Clusters: ', clusters)
            break

        most_common_level = c[0]
        clusters = tdf['Cluster Labels'].unique()
        print(tdf[['Neighbourhood', 'Cluster Labels', c]])
        print('Clusters: ', clusters)
        print('Indexes: ', tdf['idx'])
        simlarity_levels[most_common_level] = {
            'clusters': np.array(clusters),
            'indexes': np.array(tdf['idx'])
        }
        print('\n')
    return simlarity_levels

ny_tmp = ny_merged.copy()
ny_tmp['idx'] = range(ny_tmp.shape[0])
sl = recommended_cluster(customer, ny_tmp, cols)

-------- Similarity Level: 1st Most Common Venue --------
     Neighbourhood  Cluster Labels 1st Most Common Venue
73   Starrett City               0              Bus Stop
200   New Brighton               2              Bus Stop
218        Woodrow               2              Bus Stop
229      Arlington               2              Bus Stop
231       Grasmere               2              Bus Stop
248     Bulls Head               0              Bus Stop
259       Elm Park               2              Bus Stop
285    Willowbrook               2              Bus Stop
Clusters:  [0 2]
Indexes:  73      73
200    200
218    218
229    229
231    231
248    248
259    259
285    285
Name: idx, dtype: int32


-------- Similarity Level: 2nd Most Common Venue --------
    Neighbourhood  Cluster Labels 2nd Most Common Venue
200  New Brighton               2                  Park
Clusters:  [2]
Indexes:  200    200
Name: idx, dtype: int32


-------- Similarity Level: 3rd Most Common Venue -------

Here we can see how many neighborhoods are recommended and at which SL they are. Also, we have the clusters at what they belong to.

In [44]:
print('Similarity Levels:')
for level in sl:
    print('\tlevel:', level)
    print('\tclusters:', sl[level]['clusters'])
    print('\tindexes:', sl[level]['indexes'])
    print()

Similarity Levels:
	level: 1
	clusters: [0 2]
	indexes: [ 73 200 218 229 231 248 259 285]

	level: 2
	clusters: [2]
	indexes: [200]



Let's generate the final map with the recommended neighborhoods for our customer based on the Similarity Level we defined.

In [45]:
colors_list = cm.rainbow(np.linspace(0, 1, len(sl.keys()) + 1))
colors_list = [colors.rgb2hex(i) for i in colors_list]

x = np.arange(len(sl.keys()))
ys = [i + x + (i*x)**2 for i in range(len(sl.keys()))]
colors_list = cm.rainbow(np.linspace(0, 1, len(ys)))
colors_list = [colors.rgb2hex(i) for i in colors_list]

index_color = {}

for i, level in enumerate(sorted(sl.keys())):
    for idx in sl[level]['indexes']:
        index_color[idx] = {
            'color': colors_list[i],
            'sl': level
        }

In [46]:
# create map
rmap = folium.Map(tiles='cartodbpositron', location=[ny_lat, ny_long], zoom_start=10) # lat, lon corresponds to Toronto


# add markers to the map
clusters_colors = {}
for dfidx in range(len(ny_merged.index)):
    lat, lon, poi, cluster = ny_merged.iloc[dfidx]['Latitude'], ny_merged.iloc[dfidx]['Longitude'], ny_merged.iloc[dfidx]['Neighbourhood'], ny_merged.iloc[dfidx]['Cluster Labels']
    label = str(poi)

    color = '#777777'
    if dfidx in index_color:
        color = index_color[dfidx]['color']
        label += ' (SL= ' + index_color[dfidx]['sl'] + ')'
        
        rgb = np.array(colors.to_rgb(index_color[dfidx]['color']))
        cv = rgb.sum() / 2.5
        for i, c in enumerate(rgb):
            if c + cv <= 1:
                rgb[i] += cv
        clusters_colors[cluster] = colors.to_hex(rgb)
    else:
        if cluster in clusters_colors:
            color = clusters_colors[cluster]
    
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = folium.Popup(label, parse_html=True),
        color = color,
        stroke = 0,
        fill = True,
        fill_color = color,
        fill_opacity = 1).add_to(rmap)

rmap.save(outfile='map_ny_recommended_neighbourhoods.html')
rmap

![Recommended Neighbourhoods](map_ny_recommended_neighbourhoods.png "Recommended Neighbourhoods")

## Results & Discussion

In the above map we can see the NY recommended neighborhoods. Here we have deferenciated some neighborhoods with different colors.

In this case, the intense red point represents the neighborhood with the highest SL to the actual customer's neighborhood.
The other three light red points are those neighborhoods that are part of the cluster where the intense red point belongs to (which makes them good cantidates for recommendation).

The same applies for the purple points. The seven intense purple points are those neighborhoods with a SL equals to 1 and the light purple points are those who belongs to the same cluster.

Gray points are neighborhoods with zero SL (not recommended).

Now we can recommend neighborhoods to customer in a more easy and fast way reducing the opeartional cost by the company.

## Conclusion

By using unsupervised clustering algorithms we were able to recommend specific neighborhoods but also whole clusters to our imaginary customer.

At this point we've done the prove of concept for a bussiness problem reducing the operational cost and response time for recommending neighborhoods based on neighborhood's similarities were customer is used to live.

One next step should be our model's optimization by defining a proper clustering size based on F1-score, V-Measure and so on.