# Coursera Battle of the neighbourhoods 
## Opening a new Jazz Club in Manhattan




Author: Mark Chinnock   
Date: June 2020



### Content

1.	[Introduction](#introduction)
	
    1.1 [Discussion of the "background situation"](#background)
    
    1.2 [Problem](#problem)
    
    1.3 [Audience](#audience)
    
    
2.	[Data](#data)
3. [Methodology](#methodology)
4. [Results](#results)
5. [Discussion](#discussion)
6. [Conclusion](#conclusion)
    
  

# 1. Introduction <a name="introduction"></a>
I am a successful Jazz musician wanting to move to New York City and open a new venue of my own in Manhattan.  I want to locate my venue away from other clubs but in a similar neighborhood.  I have a limited knowledge of the neighborhoods.

## 1.1 Background <a name="background"></a>
Where is the best location to open a new Jazz Club in Manhattan?  Determine similar neighborhoods to where existing Jazz venues are and find similar ones without a local Jazz Club.

## 1.2 Problem <a name="problem"></a>
The business understanding is to identify which neighborhood(s) are a best match for a location, based on: 
- Finding where existing Jazz Clubs are currently located 
- Discovering what type of neighborhood category they are in
- Find a similar neighborhood that doesn't currently have a Jazz Venue 

## 1.3 Audience <a name="audience"></a>
The target audience for this would be anyone considering moving to Manhattan without detailed knowledge of the area.  This report is a specific study on Jazz Club categories but this approach and the search criteria could easily be amended to repeat the exercise for a different category of venue.


# 2. Data <a name="data"></a>

To investigate this problem we will need the following datasets of information:
* List of neighborhoods in Manhattan
* Location of existing Jazz Clubs and other venues


Using the neighborhood data from the dataset previously provided on this course (https://cocl.us/new_york_dataset) we can obtain the geo coordinates of the Manhattan neighborhoods and store the following information:  


 	Borough 	Neighborhood 	Latitude 	Longitude
6 	Manhattan 	Marble Hill 	40.876551 	-73.910660   
100 	Manhattan 	Chinatown 	40.715618 	-73.994279   
101 	Manhattan 	Washington Heights 	40.851903 	-73.936900   

This data provides:
1. Borough name - eg Manhattan
2. Neighborhood - eg marble hill, highlights the area within Manhattan
3. Latitude - latitude coordinate for mapping
4. Longitude - longitude coordinate for mapping


Additionally, we can retrieve Population information for each neighborhood from the NYC open source: https://opendata.cityofnewyork.us/

This data can can read in using geopandas and then we are able to perform various mapping functions against the dataframe which may help with the investigation.

The neighborhood data will be compared using the venue data retrieved from foursquare to produce clusters of similar neighborhoods.  The Jazz Club venues will then be placed in those clusters to see whether there is a pattern of cluster for a venue such as a jazz Club to help form a recommendation.

### Import the necessary libraries

In [2]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import geopandas as gpd

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
import json    

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
    
# tranforming json file into a pandas dataframe library
from pandas import json_normalize

# !conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from folium import Choropleth, Marker
from folium.plugins import HeatMap, MarkerCluster

import math
from area import area


print('Folium installed')
print('Libraries imported.')

def embed_map(m, file_name):
    from IPython.display import IFrame
    m.save(file_name)
    return IFrame(file_name, width='100%', height='500px')

Folium installed
Libraries imported.


### Foursquare

In [3]:
CLIENT_ID = 'IZ3OANXNNHAGM2NO1FBTZ4WWGRLJPQORTA1EC4LEY3GTWJGO' # your Foursquare ID
CLIENT_SECRET = 'A3DOWBNWOGPZKSVYP04IY2UE1FSFBQNIDSKGDLRYG5XVQAD1' # your Foursquare Secret
VERSION = '20200604'
LIMIT = 100
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: IZ3OANXNNHAGM2NO1FBTZ4WWGRLJPQORTA1EC4LEY3GTWJGO
CLIENT_SECRET:A3DOWBNWOGPZKSVYP04IY2UE1FSFBQNIDSKGDLRYG5XVQAD1


We can explore the foursquare database using the API call with Manhattan coordinates and retrieve all the existing venues in the area, including Jazz clubs

In [4]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.7896239 -73.9598939


`address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)`

In [None]:
Search specifically for the existing jazz clubs

In [5]:
search_query = 'Jazz Club'
# removed radius as there aren't any Jazz clubs in centre of Manhattan!
# radius = 500
print(search_query + ' .... OK!')

Jazz Club .... OK!


define and submit the url to foursquare API...

In [7]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    latitude, 
    longitude, 
    VERSION, 
    search_query, 
#     radius, 
    LIMIT
    )

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ee0e348885df533641a7719'},
 'response': {'venues': [{'id': '54ed2952498ec36f92b69d0c',
    'name': "Dizzy's Jazz Club",
    'location': {'address': '10 Columbus Cir',
     'lat': 40.768764,
     'lng': -73.982944,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.768764,
       'lng': -73.982944}],
     'distance': 3027,
     'postalCode': '10019',
     'cc': 'US',
     'neighborhood': "Hell's Kitchen",
     'city': 'New York',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['10 Columbus Cir',
      'New York, NY 10019',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d1e7931735',
      'name': 'Jazz Club',
      'pluralName': 'Jazz Clubs',
      'shortName': 'Jazz Club',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/musicvenue_jazzclub_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1591796913',
    'hasPerk': False},
   {'

From this we can extract the following useful features:
1. venue name
2. venue location
3. venue address
4. distance from search coordinates
...

From the Foursquare lab in the previous module, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.


In [8]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [9]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
  
nearby_venues = json_normalize(venues) # flatten JSON


# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns and keep everything after the full spot
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(5)

Unnamed: 0,id,name,categories,referralId,hasPerk,address,lat,lng,labeledLatLngs,distance,postalCode,cc,neighborhood,city,state,country,formattedAddress,crossStreet,id.1
0,54ed2952498ec36f92b69d0c,Dizzy's Jazz Club,Jazz Club,v-1591796913,False,10 Columbus Cir,40.768764,-73.982944,"[{'label': 'display', 'lat': 40.768764, 'lng':...",3027,10019.0,US,Hell's Kitchen,New York,NY,United States,"[10 Columbus Cir, New York, NY 10019, United S...",,
1,4aca81a1f964a52028c220e3,Cecil's Jazz Club & Restaurant,Jazz Club,v-1591796913,False,364 Valley Rd,40.774516,-74.239879,"[{'label': 'display', 'lat': 40.77451572910885...",23659,7052.0,US,,West Orange,NJ,United States,"[364 Valley Rd, West Orange, NJ 07052, United ...",,
2,52195a0211d247f3c3341b9d,Scat Jazz Club,Lounge,v-1591796913,False,,40.812489,-73.941215,"[{'label': 'display', 'lat': 40.81248873725435...",2992,,US,,New York,NY,United States,"[New York, NY, United States]",,
3,55a5818d498ef469b908d8e2,Cassandra's Jazz Club,Lounge,v-1591796913,False,,40.813977,-73.94463,"[{'label': 'display', 'lat': 40.81397710077426...",3000,,US,,New York,NY,United States,"[New York, NY, United States]",,
4,4ecca0d7c2ee2da025cee7ac,The Wood Jazz Club,Jazz Club,v-1591796913,False,Netherwood Cir,40.592676,-74.352658,"[{'label': 'display', 'lat': 40.59267589003585...",39745,8820.0,US,,Edison,NJ,United States,"[Netherwood Cir, Edison, NJ 08820, United States]",,


OK, so we've got our clubs.  Let's get some population data on Manhattan and a geojson file of the shape of the neighborhoods so we can map them.  We can easily manipulate geo data from various filetypes using geopandas

In [19]:
# read the neighborhood population data into a DataFrame and load the GeoJSON data
df = pd.read_csv('../nycvisualization/New_York_City_Population_By_Neighborhood_Tabulation_Areas.csv')
# nycmap = json.load(open("nyc_neighborhoods.geojson"))
nycmap = gpd.read_file("../nycvisualization/nyc_neighborhoods.geojson")

# align the column name for NTA code
df.rename(columns={'NTA Code':'ntacode'}, inplace=True)


In [20]:
# just want the Manhattan boro data
df = df[df['Borough']=='Manhattan']
nycmap = nycmap[nycmap['boro_name']=='Manhattan'].set_index('ntacode')


plot_dict = pd.DataFrame(df.groupby(['ntacode'])['Population'].max())
nycmap.head()

Unnamed: 0_level_0,shape_area,county_fips,ntaname,shape_leng,boro_name,boro_code,geometry
ntacode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MN06,10647077.5264,61,Manhattanville,17040.6854129,Manhattan,1,"MULTIPOLYGON (((-73.94608 40.82126, -73.94640 ..."
MN15,18362149.2813,61,Clinton,34481.6287742,Manhattan,1,"MULTIPOLYGON (((-73.99383 40.77293, -73.99379 ..."
MN27,14501868.1603,61,Chinatown,20786.2561105,Manhattan,1,"MULTIPOLYGON (((-73.98382 40.72147, -73.98386 ..."
MN25,19014298.8996,61,Battery Park City-Lower Manhattan,43738.4962191,Manhattan,1,"MULTIPOLYGON (((-74.00078 40.69429, -74.00096 ..."
MN14,15805586.3147,61,Lincoln Square,19869.9083199,Manhattan,1,"MULTIPOLYGON (((-73.97500 40.77753, -73.97546 ..."


Let's map the population density of each neighborhood onto a choropleth, and add on the location of our Jazz Clubs 

In [21]:
# Create a base map
m_1 = folium.Map(location=[latitude, longitude], tiles='cartodbpositron', zoom_start=11)

# Add a choropleth map to the base map
Choropleth(geo_data=nycmap.__geo_interface__, 
           data=plot_dict['Population'], 
           key_on="feature.id", 
           fill_color='YlGnBu', 
           legend_name='Population of each Neighborhood as of 2010',
           hover_name='ntaname'
          ).add_to(m_1)

# Add a marker for each jazz club
for idx, row in nearby_venues[nearby_venues['city']=='New York'].iterrows():
    Marker([row['lat'], row['lng']], popup=row['name']).add_to(m_1)


# Display the map
m_1

In [23]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [24]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [26]:
neighborhoods_data = newyork_data['features']

Transform the data into a pandas dataframe


In [28]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [29]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

We're only interested in the Manhattan neighborhoods, so create a manhattan dataframe with just manhattan borough

In [30]:
manhattan = neighborhoods.loc[neighborhoods['Borough']=='Manhattan']
manhattan.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
6,Manhattan,Marble Hill,40.876551,-73.91066
100,Manhattan,Chinatown,40.715618,-73.994279
101,Manhattan,Washington Heights,40.851903,-73.9369
102,Manhattan,Inwood,40.867684,-73.92121
103,Manhattan,Hamilton Heights,40.823604,-73.949688


Create a map of New York showing Manhattan neighborhoods and the position of the existing Jazz Clubs superimposed on top of a choropleth map coloured to show the population density of each neighborhood.

    Blue circles = geographic centre of Neighborhoods
    Marker = Jazz Club location



In [48]:
# Create a base map
m_1 = folium.Map(location=[latitude, longitude], tiles='cartodbpositron', zoom_start=11)

# Add a choropleth map to the base map
Choropleth(geo_data=nycmap.__geo_interface__, 
           data=plot_dict['Population'], 
           key_on="feature.id", 
           fill_color='YlGnBu', 
           legend_name='Population of each Neighborhood as of 2010',
           hover_name='ntaname'
          ).add_to(m_1)

# add markers to map
for lat, lng, borough, label in zip(manhattan['Latitude'], manhattan['Longitude'], manhattan['Borough'], manhattan['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(m_1)  

# Add a marker for each jazz club
for idx, row in nearby_venues[nearby_venues['city']=='New York'].iterrows():
    Marker([row['lat'], row['lng']], popup=row['name']).add_to(m_1)


# Display the map
m_1

explore Neighborhoods of Manhattan


In [32]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [33]:
venues = getNearbyVenues(names=manhattan['Neighborhood'],
                                   latitudes=manhattan['Latitude'],
                                   longitudes=manhattan['Longitude']
                                  )


Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


which neighborhoods have the current jazz clubs

In [34]:
# print(venues.shape)
venues[venues['Venue Category']=='Jazz Club']['Neighborhood']

378        Central Harlem
471       Upper East Side
1341    Greenwich Village
1503         East Village
1883         West Village
1897         West Village
1912         West Village
1940         West Village
Name: Neighborhood, dtype: object

We have all the Jazz clubs within Manhattan, which neighborhoods they are in, and we know the population density of each of the neighborhoods.

This concludes the data gathering phase.


# 3. Methodology <a name="methodology"></a>

In this project we will direct our efforts on finding existing Jazz Clubs in the Manhattan borough of New York City, identify the neighborhoods where they reside, and then looking for neighborhoods with similar demographics based on other venue types and population density, but with fewer Jazz Clubs (or none!)

In first step we have collected the required data: **location and type (category) of every Jazz Club within 10km from the centre of Manhattan.** We have also **identified all the venues within Manhattan** currently held in the Foursquare database.

Second step in our analysis will be calculating and explorating **'Jazz Club density'** across different neighborhoods of Manhattan - we will use  **geodata** and **choropleth maps** to identify population density (neighborhood area / population count recorded) in each neighborhood, and comparing which neighborhoods are similar but without Jazz Clubs.

In third and final step we will create clusters (using k-means clustering) of the neighborhoods to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.


Start preparing the data for k-means analysis

In [35]:
# one hot encoding
manhattan_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.sample(10)

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
1666,Little Italy,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
185,Washington Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1852,Soho,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1876,West Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
914,Lincoln Square,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
945,Lincoln Square,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1180,Murray Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2964,Stuyvesant Town,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1427,Greenwich Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
422,East Harlem,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
manhattan_onehot.shape

(3121, 332)

In [37]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()

let's focus on neighborhoods with existing Jazz Clubs.

In [38]:
jazz_grouped = manhattan_grouped[manhattan_grouped['Jazz Club']>0]

In [39]:
num_top_venues = 5

for hood in jazz_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = jazz_grouped[jazz_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Harlem----
                  venue  freq
0    African Restaurant  0.07
1     French Restaurant  0.04
2   American Restaurant  0.04
3  Gym / Fitness Center  0.04
4                   Bar  0.04


----East Village----
                venue  freq
0                 Bar  0.05
1  Mexican Restaurant  0.05
2        Cocktail Bar  0.04
3         Coffee Shop  0.04
4      Ice Cream Shop  0.03


----Greenwich Village----
                  venue  freq
0    Italian Restaurant  0.10
1      Sushi Restaurant  0.05
2                  Café  0.05
3  Caribbean Restaurant  0.02
4          Dessert Shop  0.02


----Upper East Side----
                  venue  freq
0    Italian Restaurant  0.08
1           Coffee Shop  0.06
2                Bakery  0.04
3  Gym / Fitness Center  0.04
4           Yoga Studio  0.03


----West Village----
                 venue  freq
0   Italian Restaurant  0.07
1             Wine Bar  0.05
2  American Restaurant  0.05
3         Cocktail Bar  0.04
4            Jazz Club  

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

What are the top 10 venues for the neighborhoods

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Hotel,Coffee Shop,Gym,Memorial Site,Playground,Gourmet Shop,Food Court,Mexican Restaurant,Shopping Mall
1,Carnegie Hill,Coffee Shop,Café,Yoga Studio,Bookstore,Gym / Fitness Center,Gym,Italian Restaurant,Pizza Place,Wine Shop,Vietnamese Restaurant
2,Central Harlem,African Restaurant,Chinese Restaurant,Seafood Restaurant,Bar,French Restaurant,Gym / Fitness Center,American Restaurant,Park,Cafeteria,Library
3,Chelsea,Coffee Shop,Art Gallery,Ice Cream Shop,Café,Bakery,American Restaurant,Cocktail Bar,Theater,Italian Restaurant,Bar
4,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Spa,Bar,Ice Cream Shop,Coffee Shop,American Restaurant,Optical Shop


We have a clear indication which neighborhoods currently have Jazz Clubs and what the most popular venues are within those neighborhoods.

Let us now cluster all the neighborhoods and see which ones k-means clusters together.  This will help decide which other neighborhoods are similar to our identified existing "Jazz" neighborhoods.

In [42]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 0, 0, 0, 1, 1, 3, 0, 1], dtype=int32)

In [43]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Manhattan,Marble Hill,40.876551,-73.91066,4,Sandwich Place,Gym,Coffee Shop,Yoga Studio,Pharmacy,Supplement Shop,Steakhouse,Seafood Restaurant,Pizza Place,Deli / Bodega
100,Manhattan,Chinatown,40.715618,-73.994279,0,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Spa,Bar,Ice Cream Shop,Coffee Shop,American Restaurant,Optical Shop
101,Manhattan,Washington Heights,40.851903,-73.9369,3,Café,Bakery,Mobile Phone Shop,Mexican Restaurant,Donut Shop,Latin American Restaurant,Supermarket,Tapas Restaurant,Sandwich Place,Bank
102,Manhattan,Inwood,40.867684,-73.92121,3,Lounge,Mexican Restaurant,Restaurant,Bakery,Café,Frozen Yogurt Shop,Spanish Restaurant,Caribbean Restaurant,Chinese Restaurant,Park
103,Manhattan,Hamilton Heights,40.823604,-73.949688,3,Pizza Place,Coffee Shop,Deli / Bodega,Café,Mexican Restaurant,Sandwich Place,Sushi Restaurant,Cocktail Bar,Bakery,Yoga Studio


which cluster label(s) do the neighborhoods with existing jazz clubs have?

In [44]:
manhattan_merged[manhattan_merged['Neighborhood'].isin(['Central Harlem','Upper East Side','Greenwich Village','East Village','West Village'])]

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
105,Manhattan,Central Harlem,40.815976,-73.943211,0,African Restaurant,Chinese Restaurant,Seafood Restaurant,Bar,French Restaurant,Gym / Fitness Center,American Restaurant,Park,Cafeteria,Library
107,Manhattan,Upper East Side,40.775639,-73.960508,0,Italian Restaurant,Coffee Shop,Gym / Fitness Center,Bakery,French Restaurant,Spa,Juice Bar,Yoga Studio,American Restaurant,Wine Shop
117,Manhattan,Greenwich Village,40.726933,-73.999914,0,Italian Restaurant,Café,Sushi Restaurant,Bar,Dessert Shop,Seafood Restaurant,Caribbean Restaurant,Sandwich Place,Spa,Chinese Restaurant
118,Manhattan,East Village,40.727847,-73.982226,0,Bar,Mexican Restaurant,Cocktail Bar,Coffee Shop,Pizza Place,Speakeasy,Wine Bar,Juice Bar,Ice Cream Shop,Seafood Restaurant
123,Manhattan,West Village,40.734434,-74.00618,0,Italian Restaurant,Wine Bar,American Restaurant,Pizza Place,Park,Jazz Club,Cocktail Bar,New American Restaurant,Bakery,Coffee Shop


OK, cluster label 0 is unanimously the cluster for a Jazz Bar.

Finally, let's map the clustered neighborhoods on top of the map showing the population density and existing Jazz Clubs.


In [51]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add cluster markers to the existing map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(m_1)

# Display the map
m_1

In [45]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(map_clusters)
    
for idx, row in nearby_venues[nearby_venues['city']=='New York'].iterrows():
    Marker([row['lat'], row['lng']]).add_to(map_clusters)    
       
# Show the map
embed_map(map_clusters, 'map_clusters.html')

# 4. Results <a name="results"></a>

The analysis shows that there are Jazz Club venues spread all around Manhattan with more centred around Manhattan Valley, near West 106th Street, and 3 near Central Harlem, around West 131-133rd Street.  These are both towards the northern end of Manhattan.

All of the neighborhoods with existing Jazz Clubs were put into the cluster 0.

There are several cluster 0 neighborhoods with similar population density without an existing Jazz Club and these are 

# 5. Discussion <a name="discussion"></a>

On the surface cluster 0 neighborhoods appear to be good candidates to locate a Jazz Club.  However, there could be other factors as to why they haven't been chosen as venues before.  Perhaps it would be better to locate a specialist club such as a JAzz venue near to other existing clubs where there may be an existing ambiance.

# 6. Conclusion <a name="conclusion"></a>

The purpose of this project was to identify a potential Jazz Club location in Manhattan in a neighhorhood where ther wasn't currently a Jazz Club.  Additional information of characteristics of neighborhoods, population density and clustering of similar venue types together were also used to provide a scientific approach to the research.  

The final decision on where to location a new Jazz Club would still require additional information and research, taking into account the general atmosphere of a location, closeness to major roads, rental prices, whether tourism numbers has an effect on such a venue, etc.