# Capstone Project: search for a location

### 1.1 A description of the problem and a discussion of the background

We want to open one coffee shop at 3 locations--Toronto (Canada), Chicago (US), Houston (US), and New York City (USA). All those three cities within the 2 to 3 million population range, except NYC--which is above 8 million. We want to find location options for one coffee shop in each city. We would like to open our coffee shop in the established coffee-lovers area with strong human traffic. Our approach isn't the optimal solution, however, it can be useful to get yourself familiar with potential areas for our business project. Also, the obtained data can be an optional backing base among others as data-driven support for protentional investors. If our data does not line up with expert's opinions on where to start our new coffee shop we could use our data as a good point to question any reals state expert recommendation. It is good to know what kind of data an expert has to make our data look less relevant.

Overall, the data from our project can be a starting point to find a local real state expert in the recommended areas.

Many companies are practicing remote "office" when workers stay and work from home. Due to decried demand for coffee at regular office locations, we can assume that that demand is redistributed among living areas--near staying at-home workers. Out model suppose to spot this "new normal". If so, that would be a good checking point for our model.

### 1.2 Description of the data and how it will be used to solve the problem

We would like to use the city's postal codes to scan for the most trending venues. Each postal code will help to split each city into parts, it is a good base for mapping and scanning for potential target--a good spot for a coffee shop. 

The postal codes can be obtained over the Internet. Geolocation of those ZIP codes can be sourced over the Geocoder Python module. Then those ZIP code coordinates can be used to obtain trending venues over Foursquare API. All the trending venues will be clustered. For clustering, we will use the DBSCAN clustering algorithm. We will look for clusters with trending Coffee Shop venues. Those clusters will be mapped with the Python Folium module. 
	
> Python modules:
- numpy
- pandas
- folium
- geocoder
- foursquare
- sklearn
- matplotlib
	

	
> Postal Codes Sources:
- Toronto Postal Codes: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- Chicago ZIP Codes: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Chicago-Zip-Code-and-Neighborhood-Map/mapn-ahfc
- Houston: https://web.har.com/zipcode'
- New York City: https://www.nycbynatives.com/nyc_info/new_york_city_zip_codes.php


## 2. Segmenting and Clustering Neighborhoods

> In this project, I will go over a detailed explanation of every step for the city of Toronto. For Chicago, Houston and NYC all the steps explained in the Toronto case are absolutely identical. For more compact code I will remove all the narration lines from all the cities except Toronto.

> All the needed tables with postal codes were preprocessed and saved into "CSV" files. The original sources of all the needed postal codes are listed in section one.

> All the needed tables with postal codes were preprocessed and saved into "CSV" files. The original sources of all the needed postal codes are listed in section one.

> Most of the variables will have las prefix, which is indicating to which city it belongs to:
- _t -- Toronto
- _h -- Houston
- _c -- Chicago
- _n -- NYC

In [18]:
# my python version is 3.8.6, 64-bit on Windows 10 Home
import numpy as np
import pandas as pd

> Loading postal codes for Toronto, Chicago, Houston, and NYC.

In [19]:
df_raw_t = pd.read_csv('https://www.dropbox.com/s/a1dawm3tgt1sgzv/toronto_zip.csv?dl=1')    # Toronto
df_raw_c = pd.read_csv('https://www.dropbox.com/s/ssq2szny9bldl4k/chicago_zip.csv?dl=1')    # Chicago
df_raw_h = pd.read_csv('https://www.dropbox.com/s/ff3y8odixiypg6v/houston_zip.csv?dl=1')    # Houston
df_raw_n = pd.read_csv('https://www.dropbox.com/s/rb9wt2tfkzsyo4s/nyc_zip.csv?dl=1')        # NYC
df_raw_t.head()

Unnamed: 0,Postal Code
0,M1A
1,M1B
2,M1C
3,M1E
4,M1G


> I picked "ArcGIS" provider for GeoCoder module.
>>  https://geocoder.readthedocs.io/providers/ArcGIS.html

>It takes a postal code and returns its latitude and longitude.

> Here we iterate over all the postal codes and append received coordinates into the dictionary. Then turn that dictionary into a DataFrame and append it to the Data Frame with postal codes so we have one table with postal codes and its geo coordinates.

In [20]:
import geocoder
from pandas.io.json import json_normalize

> First, let's get the geo-location for each city.

In [21]:
# coordinates of >>Toronto, Canada<<
g_t = geocoder.arcgis('Toronto Canada')
t_lat = g_t.json['lat']
t_lng = g_t.json['lng']
print(f'Toronto is located at ({t_lat:.2f}, {t_lng:.2f})')
# coordinates of >>Chicago, USA<<<
g_c = geocoder.arcgis('Chicago USA')
c_lat = g_c.json['lat']
c_lng = g_c.json['lng']
print(f'Chicago is located at ({c_lat:.2f}, {c_lng:.2f})')
# coordinates of >>Houston, USA<<<
g_h = geocoder.arcgis('Houston USA')
h_lat = g_h.json['lat']
h_lng = g_h.json['lng']
print(f'Houston is located at ({h_lat:.2f}, {h_lng:.2f})')
# coordinates of >>NYC, USA<<<
g_n = geocoder.arcgis('New York City USA')
n_lat = g_n.json['lat']
n_lng = g_n.json['lng']
print(f'New York City is located at ({n_lat:.2f}, {n_lng:.2f})')

Toronto is located at (43.65, -79.39)
Chicago is located at (41.88, -87.63)
Houston is located at (29.76, -95.37)
New York City is located at (40.71, -74.01)


> Takes each postal code and get its geo-coordinate. We do it for each city. Time consuming code execution.

In [22]:
# >>Toronto<<
coordinates_dict_t = {}
postal_codes_t = df_raw_t['Postal Code'].to_numpy()
for code in postal_codes_t :
    g_t = geocoder.arcgis(f'{code} Toronto')
    coordinates_dict_t[code] = (g_t.json['lat'], g_t.json['lng'])
# >>Chicago<<
coordinates_dict_c = {}
postal_codes_c = df_raw_c['ZIP'].to_numpy()
for code in postal_codes_c :
    g_c = geocoder.arcgis(f'{code} Chicago')
    coordinates_dict_c[code] = (g_c.json['lat'], g_c.json['lng'])
# >>Houston<<
coordinates_dict_h = {}
postal_codes_h = df_raw_h['ZIP'].to_numpy()
for code in postal_codes_h :
    g_h = geocoder.arcgis(f'{code} Houston')
    coordinates_dict_h[code] = (g_h.json['lat'], g_h.json['lng'])
# >>New Yourk City<<
coordinates_dict_n = {}
postal_codes_n = df_raw_n['ZIP'].to_numpy()
for code in postal_codes_n :
    g_n = geocoder.arcgis(f'{code} New York Cit')
    coordinates_dict_n[code] = (g_n.json['lat'], g_n.json['lng'])

> Turn dictionary with postal codes geo-coordinates into Data Frame.

In [23]:
# >>Toronto<<
fd_coordinates_t = pd.DataFrame.from_dict(coordinates_dict_t, orient='index', dtype='float')
fd_coordinates_t.rename(columns={0:'Latitude', 1:'Longitude'}, inplace=True)
fd_coordinates_t.reset_index(inplace = True)
fd_coordinates_t.rename(columns={'index':'Postal Code'}, inplace=True)
# >>Chicago<<
fd_coordinates_c = pd.DataFrame.from_dict(coordinates_dict_c, orient='index', dtype='float')
fd_coordinates_c.rename(columns={0:'Latitude', 1:'Longitude'}, inplace=True)
fd_coordinates_c.reset_index(inplace = True)
fd_coordinates_c.rename(columns={'index':'Postal Code'}, inplace=True)
# >>Houston<<
fd_coordinates_h = pd.DataFrame.from_dict(coordinates_dict_h, orient='index', dtype='float')
fd_coordinates_h.rename(columns={0:'Latitude', 1:'Longitude'}, inplace=True)
fd_coordinates_h.reset_index(inplace = True)
fd_coordinates_h.rename(columns={'index':'Postal Code'}, inplace=True)
# >>New Yourk City<<
fd_coordinates_n = pd.DataFrame.from_dict(coordinates_dict_n, orient='index', dtype='float')
fd_coordinates_n.rename(columns={0:'Latitude', 1:'Longitude'}, inplace=True)
fd_coordinates_n.reset_index(inplace = True)
fd_coordinates_n.rename(columns={'index':'Postal Code'}, inplace=True)

fd_coordinates_t.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1A,43.64869,-79.38544
1,M1B,43.81139,-79.19662
2,M1C,43.78574,-79.15875
3,M1E,43.76575,-79.1747
4,M1G,43.76812,-79.21761


### 2.2 Map Toronto Postal Codes

>Lets map all the Toronto's postal codes with Folium module. This part is not that important so I would like to show how it works with Toronto postal codes only.

> Folium module takes all the neighbourhood coordinates and maps them on the Toronto map.

In [24]:
import folium
from pandas.io.json import json_normalize

In [None]:
# create folium handler positioning at Toronto
toronto_map_postal = folium.Map(location=[t_lat, t_lng], zoom_start=10)
# adding all the neighbourhoods on the map
for lat, lng, label in zip(fd_coordinates_t['Latitude'], fd_coordinates_t['Longitude'], fd_coordinates_t['Postal Code']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(toronto_map_postal)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# toronto_map_postal 

> Apperantly GitHub doesn't load folium map, here is an picture of map

>![Image of Toronto Postal Codes](https://jo5u7g.by.files.1drv.com/y4mWVym4oBRWElWVWGHk3IWYJmAYl7BxdGA61FlydQ9UJkhedMrvjKiEIUxk-BqMwbmdCFstcZkh08PCLOeB-Md5wXcYI9HXLRNW4HXl37ETAFMNpOxmQuCBp68Tc3zRavLrhpdGwHzxQlM9sR8eER1qLyuqGOjkIDcUQvUViMvkgqzG5O48c6jSigXGh2DI7ZtkjJbmjXMEQYL2Elxu4T-mw/map.PNG?psid=1)

### 2.3 Trending Venuse for each postal code at each city


> Foursquare has an Python module connecting us to its API. We need need to have an accout at Foursquare to obrain API credentials 

> You can register your free account at https://foursquare.com/developers/apps

> Read about FOursquare module at https://github.com/mLewisLogic/foursquare

In [25]:
import foursquare

CLIENT_ID = '__add yours__' # your Foursquare ID
CLIENT_SECRET = '__add yours__' # your Foursquare Secret

# used FourSquare module to construct a handler for FourSquare API
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

>### Explore venues at the specific location
> client.venues.explore(params={'ll': f'{lat},{lng}', 'section': 'trending', 'limit': '5', 'sortByPopularity': '1'})

>- "ll" takes Latitude and Longitude of postal code in a city
- "section" : "trending" | give us trending venues at the given postal code location
- 'limit': '5' | we get only 5 venues
- 'sortByPopularity': '1' | we get only 5 top trending venues

> In return for our request we will get a "JSON" file, one of its fileds is dedicated to he "categorie" of the venue. That files has some extra information. We need to filter that field in order to get a "clean" venue categoie. For this we will use a small function below.

In [26]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

> This code makes API calls which are time consuming

In [27]:
# >> Toronto <<
df_list_t = []
for lat, lng, pcode in zip(fd_coordinates_t['Latitude'], fd_coordinates_t['Longitude'], fd_coordinates_t['Postal Code']):
    pcode_json = client.venues.explore(params={'ll': f'{lat},{lng}', 'section': 'trending', 'limit': '5', 'sortByPopularity': '1'})
    items = pcode_json['groups'][0]['items']
    dataframe = json_normalize(items)   # convert json into Data Frame
    dataframe = dataframe[['venue.id', 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    dataframe['venue.categories'] = dataframe.apply(get_category_type, axis=1) # clean categories fild with our function
    dataframe['Postal Code'] = pcode   # add columns representing/connecting venues with postal code
    df_list_t.append(dataframe)   # each Date Frame goes to the list for farther murge
df_venuse_t = pd.concat(df_list_t) # merge 103 tables in the list into one

# >>Chicago<<
df_list_c = []
for lat, lng, pcode in zip(fd_coordinates_c['Latitude'], fd_coordinates_c['Longitude'], fd_coordinates_c['Postal Code']):
    pcode_json = client.venues.explore(params={'ll': f'{lat},{lng}', 'section': 'trending', 'limit': '5', 'sortByPopularity': '1'})
    items = pcode_json['groups'][0]['items']
    dataframe = json_normalize(items)
    dataframe = dataframe[['venue.id', 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    dataframe['venue.categories'] = dataframe.apply(get_category_type, axis=1)
    dataframe['Postal Code'] = pcode
    df_list_c.append(dataframe)
df_venuse_c = pd.concat(df_list_c)

# >> Houston <<
df_list_h = []
for lat, lng, pcode in zip(fd_coordinates_h['Latitude'], fd_coordinates_h['Longitude'], fd_coordinates_h['Postal Code']):
    pcode_json = client.venues.explore(params={'ll': f'{lat},{lng}', 'section': 'trending', 'limit': '5', 'sortByPopularity': '1'})
    items = pcode_json['groups'][0]['items']
    dataframe = json_normalize(items)
    dataframe = dataframe[['venue.id', 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    dataframe['venue.categories'] = dataframe.apply(get_category_type, axis=1)
    dataframe['Postal Code'] = pcode
    df_list_h.append(dataframe)
df_venuse_h = pd.concat(df_list_h)

# >> New Yourk City <<
df_list_n = []
for lat, lng, pcode in zip(fd_coordinates_n['Latitude'], fd_coordinates_n['Longitude'], fd_coordinates_n['Postal Code']):
    pcode_json = client.venues.explore(params={'ll': f'{lat},{lng}', 'section': 'trending', 'limit': '5', 'sortByPopularity': '1'})
    items = pcode_json['groups'][0]['items']
    dataframe = json_normalize(items)
    dataframe = dataframe[['venue.id', 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
    dataframe['venue.categories'] = dataframe.apply(get_category_type, axis=1)
    dataframe['Postal Code'] = pcode
    df_list_n.append(dataframe)
df_venuse_n = pd.concat(df_list_n)

> Here is how __Toronto__ venues table looks like

In [28]:
df_venuse_t.head()

Unnamed: 0,venue.id,venue.name,venue.categories,venue.location.lat,venue.location.lng,Postal Code
0,4adf85e1f964a5206e7b21e3,Hudson's Bay,Department Store,43.65204,-79.380391,M1A
1,4ad4c05ef964a520a6f620e3,Nathan Phillips Square,Plaza,43.65227,-79.383516,M1A
2,4ad4c063f964a5201df820e3,Brookfield Place,Shopping Mall,43.646791,-79.378769,M1A
3,4ad7aa49f964a5207b0d21e3,Scotiabank Theatres,Movie Theater,43.648829,-79.390782,M1A
4,4ae5df5af964a520c4a221e3,Bell Trinity Square,Office,43.653475,-79.38247,M1A


> Top 10 venues categories at __Toronto__

In [29]:
df_venuse_t['venue.categories'].value_counts().head(10)

Shopping Mall       127
Grocery Store       106
Plaza                89
Department Store     88
Movie Theater        84
Office               80
Supermarket          74
Park                 57
Pharmacy             30
Coffee Shop          25
Name: venue.categories, dtype: int64

### 2.4 Now lets map all the trending venuse at each city on the Folium map

>__Toronto's__ trending venues 

In [None]:
map_ven_t = folium.Map(location=[t_lat, t_lng], zoom_start=10)
for lat, lng, label in zip(df_venuse_t['venue.location.lat'], df_venuse_t['venue.location.lng'], df_venuse_t['venue.name']):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        fill=True,
        color='green',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(map_ven_t)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# map_ven_t

> Apperantly GitHub doesn't load Folium map, here is a picture of map

<img src="https://ly5n7g.by.files.1drv.com/y4mXMFRM7fkmqL-SD_kqmbDSffzSxlcX3sHE7Czx7y-5ex8h40ohM1lEuFi28EQzmw_Lnlc8ANp7AdRsIhvpiW7y9Apj4vTpbWTNoLNPgv7VO1cNKiqdVostT9djKCXaOZgPoHcWzjUn79AyWJgj4i2dknduk9zfLbx_DG29_ebuEyz5iEWa69Poi8ueaf53WrkVGWsz-X2EMCaHhLDDdGMXA/tor_map_ven.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__Chicago's__ trending venues

In [None]:
map_ven_c = folium.Map(location=[c_lat, c_lng], zoom_start=10)
for lat, lng, label in zip(df_venuse_c['venue.location.lat'], df_venuse_c['venue.location.lng'], df_venuse_c['venue.name']):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        fill=True,
        color='green',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(map_ven_c)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# map_ven_c

> Apperantly GitHub doesn't load Folium map, here is a picture of map

><img src="https://li5p7g.by.files.1drv.com/y4mNui2N8HZWIPboVt2rOxKxNHSqefpzvdMuo25-6jcPKaZh0Shr6Ead7SFtQQyNyy_m123IeG6D9C5gziMAmBMbi-m8Y5gLdIOdWuOFx_v2Nf-z6BZH38kE_cancYRFrzrPYO-HUHWZ507BK08pYLWkJOT8sJgm6t3bXRo4iLuJ7jUgaY346ALA0DSidhsIDCrskem9WfbPWcC1IfMaE4asA/cgo_map_ven.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__Houston's__ trending venues

In [None]:
map_ven_h = folium.Map(location=[h_lat, h_lng], zoom_start=10)
for lat, lng, label in zip(df_venuse_h['venue.location.lat'], df_venuse_h['venue.location.lng'], df_venuse_h['venue.name']):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        fill=True,
        color='green',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(map_ven_h)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# map_ven_h

> Apperantly GitHub doesn't load Folium map, here is a picture of map

><img src="https://li5i7g.by.files.1drv.com/y4myDCJBSHGcPJEVK4fz7_dB8CkI8uNQM4lld93jKrrgEBuS2JXQdzcik5DF7mOIFiMsFuteG2oEIbGM2DDE2RV4sP4TWYlqKgzJP0NZoRlxp_xxcIEtZZEVy__JGEubQRDOpMBMpJS1LUvjg_r7fDYq8FM_nFSZbGs7yPUwFpPHbmjNFRsK6jHn5pNGH5M1bRaIng683w_BS08iaxnqPdtHg/hstn_map_ven.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__NYC's__ trending venues

In [None]:
map_ven_n = folium.Map(location=[n_lat, n_lng], zoom_start=10)
for lat, lng, label in zip(df_venuse_n['venue.location.lat'], df_venuse_n['venue.location.lng'], df_venuse_n['venue.name']):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        fill=True,
        color='green',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(map_ven_n)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# map_ven_n

> Apperantly GitHub doesn't load Folium map, here is a picture of __NYC__ map

><img src="https://ly5m7g.by.files.1drv.com/y4mCPyOxbZHGt_fvDxdVV8WOLXSLZNArvh9gGdBpH8QXUt_0VrTDk4dJCwO07hasXH_WTr4dudR06Dst07MWW705yKoQ0oHX8iYjIUWvWyGWQAO-3N9vlZRk1zEIo0vx7ueT9_S7QV7ak9Rykq7VbfuNPiW50JB2M-geQ9oUmgQW7Tol61seVmUqAxEGU0UwFYU3dSRd_KKgAYj0OnW-Z2a4w/nyc_map_ven.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

## 3. Postal Codes Clustering using DBSCAN & scikit-learn

>DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This technique is one of the most common clustering algorithms  which works based on density of object.
The whole idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.

>It works based on two parameters: Epsilon and Minimum Points  
>>**Epsilon** determine a specified radius that if includes enough number of points within, we call it dense area  
**minimumSamples** determine the minimum number of data points we want in a neighborhood to define a cluster.

>DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of postal codes can be used here, for instance, to find the group of venues within the same procsimity. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.

In [30]:
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt 
%matplotlib inline

In [31]:
# TORONTO
sklearn.utils.check_random_state(1000)
Clus_dataSet_t = df_venuse_t[['venue.location.lat','venue.location.lng']]
Clus_dataSet_t = np.nan_to_num(Clus_dataSet_t)
Clus_dataSet_t = StandardScaler().fit_transform(Clus_dataSet_t)

# Compute DBSCAN
db_t = DBSCAN(eps=0.2, min_samples=10).fit(Clus_dataSet_t)
# Firts, create an array of booleans using the labels from db.
core_samples_mask_t = np.zeros_like(db_t.labels_, dtype=bool)
core_samples_mask_t[db_t.core_sample_indices_] = True
labels_t = db_t.labels_

# Number of clusters in labels, ignoring noise if present.
realClusterNum_t = len(set(labels_t)) - (1 if -1 in labels_t else 0)
clusterNum_t = len(set(labels_t))

In [32]:
# CHICAGO
Clus_dataSet_c = df_venuse_c[['venue.location.lat','venue.location.lng']]
Clus_dataSet_c = np.nan_to_num(Clus_dataSet_c)
Clus_dataSet_c = StandardScaler().fit_transform(Clus_dataSet_c)
db_c = DBSCAN(eps=0.2, min_samples=10).fit(Clus_dataSet_c)
core_samples_mask_c = np.zeros_like(db_c.labels_, dtype=bool)
core_samples_mask_c[db_c.core_sample_indices_] = True
labels_c = db_c.labels_
realClusterNum_c=len(set(labels_c)) - (1 if -1 in labels_c else 0)
clusterNum_c = len(set(labels_c))

# HOUSTON
Clus_dataSet_h = df_venuse_h[['venue.location.lat','venue.location.lng']]
Clus_dataSet_h = np.nan_to_num(Clus_dataSet_h)
Clus_dataSet_h = StandardScaler().fit_transform(Clus_dataSet_h)
db_h = DBSCAN(eps=0.2, min_samples=10).fit(Clus_dataSet_h)
core_samples_mask_h = np.zeros_like(db_h.labels_, dtype=bool)
core_samples_mask_h[db_h.core_sample_indices_] = True
labels_h = db_h.labels_
realClusterNum_h = len(set(labels_h)) - (1 if -1 in labels_h else 0)
clusterNum_h = len(set(labels_h))

# NYC
Clus_dataSet_n = df_venuse_n[['venue.location.lat','venue.location.lng']]
Clus_dataSet_n = np.nan_to_num(Clus_dataSet_n)
Clus_dataSet_n = StandardScaler().fit_transform(Clus_dataSet_n)
db_n = DBSCAN(eps=0.2, min_samples=10).fit(Clus_dataSet_n)
core_samples_mask_n = np.zeros_like(db_n.labels_, dtype=bool)
core_samples_mask_n[db_n.core_sample_indices_] = True
labels_n = db_n.labels_
realClusterNum_n = len(set(labels_n)) - (1 if -1 in labels_n else 0)
clusterNum_n = len(set(labels_n))

print(f'Toronto has {clusterNum_t} vibrant clusters.')
print(f'Chicago has {clusterNum_c} vibrant clusters.')
print(f'Houston has {clusterNum_h} vibrant clusters.')
print(f'NYC has {clusterNum_n} vibrant clusters.')

Toronto has 14 vibrant clusters.
Chicago has 7 vibrant clusters.
Houston has 2 vibrant clusters.
NYC has 15 vibrant clusters.


### 3.2 Cluster visualization

> This computation is time consuming

> __Toronto__ DBSCAN plot

In [None]:
colors_t = plt.cm.Spectral(np.linspace(0, 1, len(labels_t)))
# Plot the points with colors
for k, col in zip(labels_t, colors_t):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask_t = (labels_t == k)

    # Plot the datapoints that are clustered
    xy = Clus_dataSet_t[class_member_mask_t & core_samples_mask_t]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)

    # Plot the outliers
    xy = Clus_dataSet_t[class_member_mask_t & ~core_samples_mask_t]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)

> Apperantly GitHub doesn't load Folium MatPlotLib plots, here is a picture of Toronto plot

><img src="https://li5v7g.by.files.1drv.com/y4mstDes1Nk9Eec6G-FE0b7VGJszG_AVEB5PBlMdCLIbUuYim1jektQShZML9WU8SrioQke4yJjvlAhagp-z1jkJ2zLjkwszlWdtSmAVgFvMuhrEMwiEckdshReSVz3ttDKsAI56kIoi3pT7ONfIgOqXi7URMEyenJ0OI5w2--49HDurib2rmNKjvaJmVe7ocy904xqoVyCAmbK45z2oCc7Ug/tor_map_dbscan.PNG?psid=1" alt="Drawing" style="width: 370px;"/>

> __Chicago__ DBSCAN plot

In [None]:
colors_c = plt.cm.Spectral(np.linspace(0, 1, len(labels_c)))
for k, col in zip(labels_c, colors_c):
    if k == -1:
        col = 'k'

    class_member_mask_c = (labels_c == k)
    xy = Clus_dataSet_c[class_member_mask_c & core_samples_mask_c]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)
    xy = Clus_dataSet_c[class_member_mask_c & ~core_samples_mask_c]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)

> Apperantly GitHub doesn't load Folium MatPlotLib plots, here is a picture of Chicago plot

><img src="https://li5m7g.by.files.1drv.com/y4mzCnYTweTzf3BC5YlMO_RS4aFKZ__F-gZqHF1mPGFploCh3BlUsDHpwkCopJ1JfOiXPgptveWZToaYlG1Lkm49w0bXF9fofxGYrv-O-n7Kfa9XpxrTHhnHDtNQHkPGkgO-cb1X694Ho85VYayhEQaTuLtQVy6XCuUBPK42W12dAF0HAHzGYMfDd_CLWlZcmodLe5og5hCJA_MPuRShzibCw/cgo_map_dbscan.PNG?psid=1" alt="Drawing" style="width: 370px;"/>

> __Houston__ DBSCAN plot

In [None]:
colors_h = plt.cm.Spectral(np.linspace(0, 1, len(labels_h)))
for k, col in zip(labels_h, colors_h):
    if k == -1:
        col = 'k'

    class_member_mask_h = (labels_h == k)
    xy = Clus_dataSet_h[class_member_mask_h & core_samples_mask_h]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)
    xy = Clus_dataSet_h[class_member_mask_h & ~core_samples_mask_h]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)

> Apperantly GitHub doesn't load Folium MatPlotLib plots, here is a picture of Chicago plot

><img src="https://li5o7g.by.files.1drv.com/y4mSBzboX8MhOnm6IT6BLKxr7aa7sN7w2FZqQ4VkZGfiGJNWSxrXttOYcW_1cCZc0Iaucy29Tm2TZ_7OZV0W4APaQ1n4Ul2ZQ4a238WbZKrsy8_wa8ynGlZu3MUiVgBXyuZGjLVNyCySdgKTTsVxVGsbQ7lQcL1jkgb9qRdXYaAoR4VOokKZonba2TSjcaG8WZLnBJRK-lH03x7stI0MngOCw/hstn_map_dbscan.PNG?psid=1" alt="Drawing" style="width: 370px;"/>

> __NYC__ DBSCAN plot

In [None]:
colors_n = plt.cm.Spectral(np.linspace(0, 1, len(labels_n)))
for k, col in zip(labels_n, colors_n):
    if k == -1:
        col = 'k'

    class_member_mask_n = (labels_n == k)
    # xy = Clus_dataSet_n[class_member_mask_n & core_samples_mask_n]
    plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)
    xy = Clus_dataSet_n[class_member_mask_n & ~core_samples_mask_n]
    # plt.scatter(xy[:, 0], xy[:, 1],s=30, c=[col], marker=u'o', alpha=0.5)

> Apperantly GitHub doesn't load Folium MatPlotLib plots, here is a picture of __NYC__ plot

><img src="https://li5l7g.by.files.1drv.com/y4mO1mUa3feBM9i6cg_IRK78-4icHBFuotL_hzWSoZi7rr6NtWnXVbAchzGTHy290NmzaycboDzZtdF4b14kk2NI7e_0dnKt1x_nhhcTrUivtOoP-fFKMZHK3B2_f_0HDwW6UGGD3w_dmcH-mK1GxstB6i-nuqT7ArUFI9jEEpkLYuF3IJqji6632bNxHw8rO8k_ieQgOHv0s3-tZIa9yBMmQ/nyc_map_dbscan.PNG?psid=1" alt="Drawing" style="width: 370px;"/>

### 3.3 Mark each venue in a Data Frame with its cluster if the venu has one

In [33]:
# TORONTO
df_venuse_t['ClusterNumber'] = labels_t
data_filter_t = df_venuse_t['ClusterNumber'] != -1 # cluster "-1" is for venuse outliers
df_cluster_t = df_venuse_t.where(data_filter_t).dropna()

# CHICAGO
df_venuse_c['ClusterNumber'] = labels_c
data_filter_c = df_venuse_c['ClusterNumber'] != -1
df_cluster_c = df_venuse_c.where(data_filter_c).dropna()

# HOUSTON
df_venuse_h['ClusterNumber'] = labels_h
data_filter_h = df_venuse_h['ClusterNumber'] != -1
df_cluster_h = df_venuse_h.where(data_filter_h).dropna()

# NYC
df_venuse_n['ClusterNumber'] = labels_n
data_filter_n = df_venuse_n['ClusterNumber'] != -1
df_cluster_n = df_venuse_n.where(data_filter_n).dropna()

# how Toronto's table looks like
df_venuse_t.head()

Unnamed: 0,venue.id,venue.name,venue.categories,venue.location.lat,venue.location.lng,Postal Code,ClusterNumber
0,4adf85e1f964a5206e7b21e3,Hudson's Bay,Department Store,43.65204,-79.380391,M1A,0
1,4ad4c05ef964a520a6f620e3,Nathan Phillips Square,Plaza,43.65227,-79.383516,M1A,0
2,4ad4c063f964a5201df820e3,Brookfield Place,Shopping Mall,43.646791,-79.378769,M1A,0
3,4ad7aa49f964a5207b0d21e3,Scotiabank Theatres,Movie Theater,43.648829,-79.390782,M1A,0
4,4ae5df5af964a520c4a221e3,Bell Trinity Square,Office,43.653475,-79.38247,M1A,0


## 4. Looking for city clusters with trending venues as a coffee shop

> To remind how many total clusters each city has:

In [34]:
print(f'Toronto has {clusterNum_t} vibrant clusters.')
print(f'Chicago has {clusterNum_c} vibrant clusters.')
print(f'Houston has {clusterNum_h} vibrant clusters.')
print(f'NYC has {clusterNum_n} vibrant clusters.')

Toronto has 14 vibrant clusters.
Chicago has 7 vibrant clusters.
Houston has 2 vibrant clusters.
NYC has 15 vibrant clusters.


> Not all of these clusters have coffee shops among its venues, lets find out which clusters are relevant

In [35]:
# TORONTO
test_t = df_cluster_t[['venue.categories', 'ClusterNumber']]
test_t = test_t[['venue.categories', 'ClusterNumber']].where(df_cluster_t['venue.categories'] == 'Coffee Shop').dropna()
rename_t = {'index': 'Cluster', 'venue.categories': 'Total Coffee shops'}
df_coffee_cluster_t = test_t.ClusterNumber.value_counts().reset_index(name='venue.categories').rename(columns=rename_t)
coffee_clusters_t = df_coffee_cluster_t['Cluster'].values
print(f'TORONTO has {len(coffee_clusters_t)} coffee cluster from total of {clusterNum_t}.')

# CHICAGO
test_c = df_cluster_c[['venue.categories', 'ClusterNumber']]
test_c = test_c[['venue.categories', 'ClusterNumber']].where(df_cluster_c['venue.categories'] == 'Coffee Shop').dropna()
rename_c = {'index': 'Cluster', 'venue.categories': 'Total Coffee shops'}
df_coffee_cluster_c = test_c.ClusterNumber.value_counts().reset_index(name='venue.categories').rename(columns=rename_c)
coffee_clusters_c = df_coffee_cluster_c['Cluster'].values
print(f'CHICAGO has {len(coffee_clusters_c)} coffee cluster from total of {clusterNum_c}.')

# HOUSTON
test_h = df_cluster_h[['venue.categories', 'ClusterNumber']]
test_h = test_h[['venue.categories', 'ClusterNumber']].where(df_cluster_h['venue.categories'] == 'Coffee Shop').dropna()
rename_h = {'index': 'Cluster', 'venue.categories': 'Total Coffee shops'}
df_coffee_cluster_h = test_h.ClusterNumber.value_counts().reset_index(name='venue.categories').rename(columns=rename_h)
coffee_clusters_h = df_coffee_cluster_h['Cluster'].values
print(f'HOUSTON has {len(coffee_clusters_h)} coffee cluster from total of {clusterNum_h}.')

# NYC
test_n = df_cluster_n[['venue.categories', 'ClusterNumber']]
test_n = test_n[['venue.categories', 'ClusterNumber']].where(df_cluster_n['venue.categories'] == 'Coffee Shop').dropna()
rename_n = {'index': 'Cluster', 'venue.categories': 'Total Coffee shops'}
df_coffee_cluster_n = test_n.ClusterNumber.value_counts().reset_index(name='venue.categories').rename(columns=rename_n)
coffee_clusters_n = df_coffee_cluster_n['Cluster'].values
print(f'NYC has {len(coffee_clusters_n)} coffee cluster from total of {clusterNum_n}.')

TORONTO has 4 coffee cluster from total of 14.
CHICAGO has 4 coffee cluster from total of 7.
HOUSTON has 1 coffee cluster from total of 2.
NYC has 4 coffee cluster from total of 15.


>__Observation__: Looks like at the south cities like Houston people don't drink that much coffee, at least they don't do it at that many locations as nothern cities do. 

## 5. Mapping Coffe Clusters 

>For each city cluster, we need to calculate geo coordinates to map its center on the Folium map. We can do that by calculating the mean latitude and mean longitude of all venues in the coffee clusters.

In [36]:
# TORONTO
# short table of all venues
df_cluster_loc_t = df_cluster_t[['venue.location.lat', 'venue.location.lng', 'ClusterNumber']]
# short table of all cluster with mean coordinates
df_cluster_loc_t = df_cluster_loc_t.groupby(by='ClusterNumber', as_index=False).mean()
# only clusters with trending Coffee shops
df_coffe_cluster_t = df_cluster_loc_t[df_cluster_loc_t['ClusterNumber'].isin(coffee_clusters_t)]

# CHICAGO
df_cluster_loc_c = df_cluster_c[['venue.location.lat', 'venue.location.lng', 'ClusterNumber']]
df_cluster_loc_c = df_cluster_loc_c.groupby(by='ClusterNumber', as_index=False).mean()
df_coffe_cluster_c = df_cluster_loc_c[df_cluster_loc_c['ClusterNumber'].isin(coffee_clusters_c)]

# HOUSTON
df_cluster_loc_h = df_cluster_h[['venue.location.lat', 'venue.location.lng', 'ClusterNumber']]
df_cluster_loc_h = df_cluster_loc_h.groupby(by='ClusterNumber', as_index=False).mean()
df_coffe_cluster_h = df_cluster_loc_h[df_cluster_loc_h['ClusterNumber'].isin(coffee_clusters_h)]

# NYC
df_cluster_loc_n = df_cluster_n[['venue.location.lat', 'venue.location.lng', 'ClusterNumber']]
df_cluster_loc_n = df_cluster_loc_n.groupby(by='ClusterNumber', as_index=False).mean()
df_coffe_cluster_n = df_cluster_loc_n[df_cluster_loc_n['ClusterNumber'].isin(coffee_clusters_n)]

# look at Toronto's table
df_coffe_cluster_t

Unnamed: 0,ClusterNumber,venue.location.lat,venue.location.lng
0,0.0,43.651413,-79.382973
9,9.0,43.680916,-79.415979
11,11.0,43.642784,-79.422971
12,12.0,43.738982,-79.588984


> Now let's map the coffee clusters on the Folium mat to see where are those areas are located.

> TORONTO COFFEE AREAS

In [None]:
# map only Coffe Shop clusters
t_map_coffe_cluster = folium.Map(location=[t_lat, t_lng], zoom_start=11)
for lat, lng, label in zip(df_coffe_cluster_t['venue.location.lat'], df_coffe_cluster_t['venue.location.lng'], df_coffe_cluster_t['ClusterNumber']):
    folium.CircleMarker(
        [lat, lng],
        radius=25,
        popup=label,
        tooltip=label,
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(t_map_coffe_cluster)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# t_map_coffe_cluster

> Apperantly GitHub doesn't load Folium map, here is a picture of __TORONTO__ COFFEE map

><img src="https://li5u7g.by.files.1drv.com/y4m5H-xUnjCw9awl0f9sBhFSMv-6dV6ndjSkEogsZiYG7TizCvomFLKOIbQBKP44JGHuiWSxQiR1Tj43CU8eskziMzV5e4ZjCvJypkfVgynGg0qUVfMKBcjPrFs2uIGDY3c40EfMR-wBb667eWLSCVXPFKg7PVdg1coiPzDfsX5wv-WQdPg_27zAG8KZHBxQSmPMV6SPLWxqsNKSRDpk3Fwpw/tor_map_rec.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__CHICAGO__ COFFEE AREAS

In [None]:
# map only Coffe Shop clusters
c_map_coffe_cluster = folium.Map(location=[c_lat, c_lng], zoom_start=11)
for lat, lng, label in zip(df_coffe_cluster_c['venue.location.lat'], df_coffe_cluster_c['venue.location.lng'], df_coffe_cluster_c['ClusterNumber']):
    folium.CircleMarker(
        [lat, lng],
        radius=25,
        popup=label,
        tooltip=label,
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(c_map_coffe_cluster)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# c_map_coffe_cluster

> Apperantly GitHub doesn't load Folium map, here is a picture of CHICAGO COFFEE map

><img src="https://li5n7g.by.files.1drv.com/y4m_PXDgUKPXh9Hcijw1dzRMpS5irFY5WBWClMV5M_MiDbarhlPOrTQ9HIbBtQchULJkjQJoVsAOhuUw6XcMbk_sFMxMyGTllmpLGz4jEDWFLq1bkOvWW2ZVNDQqLBPaccOVf7fFTmMblZ_Rzdyzpnr7Ifx4LeuaL9HkRuJ1Dr5qJUZo8gUsPdGjwl08VXzj_7FgX1l8wVg3Y89pOPedYCWPw/cgo_map_rec.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__HOUSTON__ COFFEE AREAS

In [None]:
# map only Coffe Shop clusters
h_map_coffe_cluster = folium.Map(location=[h_lat, h_lng], zoom_start=11)
for lat, lng, label in zip(df_coffe_cluster_h['venue.location.lat'], df_coffe_cluster_h['venue.location.lng'], df_coffe_cluster_h['ClusterNumber']):
    folium.CircleMarker(
        [lat, lng],
        radius=25,
        popup=label,
        tooltip=label,
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(h_map_coffe_cluster)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# h_map_coffe_cluster

> Apperantly GitHub doesn't load Folium map, here is a picture of HOUSTON COFFEE map

><img src="https://li5j7g.by.files.1drv.com/y4mafOAhj3a7n7b1RTKtoI3XGwV0_lwVP5rhn0rnXiG-DddMfXSt-SLh3DeBlXWzD4B9Up9VUTOZyTrfCviLPH6cIUO4Z9dceTxJzTq8mmFc99xTS5hSr763AOcGqKUDoGikeVlMt3LYyXs0bsoREZBn6u0scYCak1hZMTuU2lgtJCAENEHehpA4ebZXiDnnmlceF38fyoTA0eFMzNaGYlatQ/hstn_map_rec.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

>__NEW YORK CITY__ COFFEE AREAS

In [None]:
n_map_coffe_cluster = folium.Map(location=[n_lat, n_lng], zoom_start=11)
for lat, lng, label in zip(df_coffe_cluster_n['venue.location.lat'], df_coffe_cluster_n['venue.location.lng'], df_coffe_cluster_n['ClusterNumber']):
    folium.CircleMarker(
        [lat, lng],
        radius=25,
        popup=label,
        tooltip=label,
        fill=True,
        color='red',
        fill_color='red',
        fill_opacity=0.5
        ).add_to(n_map_coffe_cluster)

# GitHub doesn't show the map
# unmark line below to see the map if you run this code at supportive platform
# n_map_coffe_cluster

> Apperantly GitHub doesn't load Folium map, here is a picture of NYC COFFEE map

><img src="https://li5k7g.by.files.1drv.com/y4mLRgtb0YKhjg03tOUO5EdheOYY-9FHU67LfqwYvK_FmiGiMB0RKUfvl2ZXuAZBUp8XR-r4Whlusb5FSsviz2pgwAxZhpDm16RTZot6Bmq6y3VUvsRLFo0t8eyW_D-eHHz0JevXsZrfVdFybgnG_3MNv77llXAjalaPpnzmmJ2YQWMcALV4giOAOmigdvs2o28L1J4Ss7yuAQnBz9r6-2Pjg/nyc_map_rec.PNG?psid=1" alt="Drawing" style="width: 700px;"/>

# 6. Discussion section

We found that all the reserved cities in the project have coffee shop trending areas. Apparently, 3 of 4 cities have exactly 4 coffee clusters and only Houston has 1 coffee cluster. Perhaps, at the moment of the research, it was hot enough for the people of Houston to stay away from hot drinks like coffee =) Anyways, the rest of the cities demonstrated 1 of 4 clusters tend to belong to so-called downtown areas. Perhaps, those are the areas with an in-office workforce.

The other 3 clusters can be characterized as mainly residential. Perhaps people work from homes in those areas and consume at nearby venues, including the coffee shops.

Depends on the prospects of current health craziest, work from home could be a long-term trend. Therefore, residential areas might be a meaningful goal for the new venues like coffee shops.

## 7. Conclusion

All the downtown coffee areas might make seance if those areas don't have strong present competition.

All the residential areas might provide a better competition perspective. However, there is a potential real state defecate at those locations. Many people tend to consume services within 10 minutes drive near their dwelling or nearby local shopping plaza--which might be a good option in our case.