# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a kind of wellness-fitness center: a center where you can go to gym, spa or yoga classes. Specifically, this report will be targeted to stakeholders interested in opening this kind of center in **Manhattan, New York, US**.

Since there are lots of gyms, yoga centers and spas in NY we will try to detect **locations that are mediumly crowded with these services**. We think that if there are a lot of gyms, spas or yoga centers in certain areas, it has to be for a reason. So, lets focus in this areas and also skip the more crowded.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, the factor that will influence our decission is:
* number of existing gyms, spas and yoga centers in the neighborhood

We will use a dataset that contains the 5 boroughs and the 306 neighborhoods existing in NY city. This dataset exists for free on the web: https://geo.nyu.edu/catalog/nyu_2451_34572.
Following data sources will be needed to extract/generate the required information:
* number of gyms, spas and yoga centers and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of NY which will be obtained using **Geopy library**

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files


!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-3.3.0               |           py36_0         747 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         846 KB

The following NEW packages will be INSTALLED:

    altair:  3.3.0-py36_0 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forg

To download the data, we can simply run a `wget` command and access the data. So let's go ahead and do that.

In [4]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Let's load the data

In [5]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [6]:
neighborhoods_data = newyork_data['features']

Transforming the data into a pandas dataframe

In [7]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [9]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


### Neighborhood Candidates

Since we are only interested in Manhattan, let's slice the original dataframe and create a new dataframe of the Manhattan data and let's get the geographical coordinates of Manhattan.


In [12]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [13]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


Let's visualize Manhattan and the neighborhoods in it.

In [14]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on gyms, spas and yoga centers in each neighborhood.

We will include in out list only venues that have 'gym', 'spa' or 'yoga' in name.

Foursquare credentials are defined in cell bellow.

In [15]:
CLIENT_ID = '4TAGFGLO2RN0KPOEAG2CNJKXMEADIGG3YKUVZM3TBT4CCOJN' # your Foursquare ID
CLIENT_SECRET = 'TPJYDUDCHWHT2TLQSOQBWGBB3QQXKKEN3DENRJWMQR5GBGPV' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4TAGFGLO2RN0KPOEAG2CNJKXMEADIGG3YKUVZM3TBT4CCOJN
CLIENT_SECRET:TPJYDUDCHWHT2TLQSOQBWGBB3QQXKKEN3DENRJWMQR5GBGPV


Let's create a function to get all the 'gyms' in all the neighborhoods in Manhattan

In [16]:
def getgym(names, latitudes, longitudes, radius=500, LIMIT=100):
    query='gym'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, query, radius, LIMIT)
            

        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        venues_list.append([( name, 
            lat, 
            lng, v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
                'Gym' if len(v['categories'])==0 else v['categories'][0]['name']) for v in results])
    
    
    nearby_gym = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_gym.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_gym)

Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_gym*.

In [49]:
manhattan_gym = getgym(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyve

Let's see all the categories and remove the venues that are not related with gyms.

In [50]:
manhattan_gym['Venue Category'].unique()

array(['Gym', 'Gym / Fitness Center', 'College Gym', 'Office',
       'Recreation Center', 'School', 'College Stadium',
       'Residential Building (Apartment / Condo)', 'Basketball Court',
       'Cycle Studio', 'Kids Store', 'Athletics & Sports', 'Playground',
       'Hotel', 'Gymnastics Gym', 'Auditorium', 'Daycare',
       'General Entertainment', 'Climbing Gym', 'Rock Climbing Spot',
       'Building', 'Track', 'Hotel Pool', 'Health & Beauty Service',
       'Gym Pool', 'Outdoor Gym', 'Gay Bar', 'Club House', 'University',
       'Performing Arts Venue', 'Indie Theater', 'Spa', 'Boxing Gym',
       'Event Space', "Men's Store", 'College Basketball Court',
       'College Auditorium', 'Moving Target', 'College Library',
       'Physical Therapist', 'Dance Studio', 'Clothing Store',
       'Martial Arts Dojo', 'Adult Education Center', 'Sports Club',
       'Non-Profit', 'Pet Store'], dtype=object)

In [51]:
manhattan_gym = manhattan_gym[manhattan_gym['Venue Category'].isin(['Gym', 'Gym / Fitness Center', 'Basketball Court', 'Cycle Studio','Athletics & Sports', 'Gymnastics Gym','Climbing Gym','Rock Climbing Spot', 'Gym Pool',
       'Outdoor Gym','Boxing Gym',
       'Physical Therapist', 'Sports Club'])]
manhattan_gym


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Winston Churchill Gym,40.874714,-73.911467,Gym
1,Marble Hill,40.876551,-73.91066,Planet Fitness,40.874088,-73.909137,Gym / Fitness Center
2,Marble Hill,40.876551,-73.91066,Astral Fitness & Wellness Center,40.876705,-73.906372,Gym
5,Chinatown,40.715618,-73.994279,Gym,40.71654,-73.996871,Gym
6,Chinatown,40.715618,-73.994279,The Gym at The Crossroads,40.714009,-73.990495,Gym
8,Chinatown,40.715618,-73.994279,The Tombs Gym,40.716429,-73.99991,Gym
9,Chinatown,40.715618,-73.994279,Downstairs Gym,40.717605,-73.999299,Gym
10,Chinatown,40.715618,-73.994279,1789 Star Gym,40.713344,-74.000211,Gym
12,Chinatown,40.715618,-73.994279,NOMO Gym,40.719797,-74.000572,Gym / Fitness Center
13,Chinatown,40.715618,-73.994279,Fitness Center,40.719171,-74.0002,Gym


Let's repeat the proces with spas and yoga centers.

In [22]:
def getspa(names, latitudes, longitudes, radius=500, LIMIT=100):
    query='spa'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, query, radius, LIMIT)
            

        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        venues_list.append([( name, 
            lat, 
            lng, v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
                'Spa' if len(v['categories'])==0 else v['categories'][0]['name']) for v in results])
    
    
    nearby_spa = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_spa.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_spa)

In [23]:
manhattan_spa = getspa(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyve

In [24]:
manhattan_spa['Venue Category'].unique()

array(['Nail Salon', 'Spa', 'Salon / Barbershop', 'Massage Studio',
       'Health & Beauty Service', 'Cosmetics Shop', "Dentist's Office",
       'Medical Center', 'Scenic Lookout', 'Theater', 'Pet Service',
       'Church', 'Record Shop', 'Food Truck', 'Fraternity House',
       'Building', 'Storage Facility', 'College Theater', 'Moving Target',
       'College Auditorium', 'Event Space', 'Indie Theater',
       'Optical Shop', 'Art Gallery', 'Monument / Landmark',
       'Laundry Service', "Doctor's Office", 'Daycare', 'Hotel',
       'College Cafeteria', 'American Restaurant', 'Pet Store',
       'Gift Shop', 'Business Service', 'Dive Bar', 'Assisted Living',
       'Coworking Space', 'Meeting Room', 'Design Studio',
       'Gym / Fitness Center', 'Residential Building (Apartment / Condo)',
       'Tanning Salon', 'Deli / Bodega', 'Language School', 'Office',
       'Pool', 'Miscellaneous Shop', 'Plaza', 'Acupuncturist',
       "Women's Store", 'Resort', 'Movie Theater', 'Bus Stati

In [47]:
manhattan_spa = manhattan_spa[manhattan_spa['Venue Category'].isin(['Nail Salon', 'Spa', 'Salon / Barber Shop','Massage Studio','Health & Beauty Service'])]
manhattan_spa

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Spa Nail,40.879232,-73.906792,Nail Salon
1,Marble Hill,40.876551,-73.91066,The Spa at TCR the Club of Riverdale,40.878632,-73.914774,Spa
2,Marble Hill,40.876551,-73.91066,Studio Esthetique Day Spa,40.878686,-73.915209,Spa
3,Marble Hill,40.876551,-73.91066,Blue Skin Laser Spa,40.879808,-73.906175,Spa
4,Marble Hill,40.876551,-73.91066,Narcisses Spa,40.880133,-73.907131,Spa
7,Marble Hill,40.876551,-73.91066,Hello Nails & Spa Inc,40.879631,-73.906282,Spa
8,Chinatown,40.715618,-73.994279,Zu Yuan Spa,40.715469,-73.998627,Spa
9,Chinatown,40.715618,-73.994279,Season Spa,40.717693,-73.996678,Spa
10,Chinatown,40.715618,-73.994279,GoGreen Organic Spa,40.717014,-73.995847,Spa
11,Chinatown,40.715618,-73.994279,Sunny Spa E Broadway,40.713917,-73.99397,Massage Studio


In [52]:
def getyoga(names, latitudes, longitudes, radius=500, LIMIT=100):
    query='yoga'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, query, radius, LIMIT)
            

        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        venues_list.append([( name, 
            lat, 
            lng, v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
                'Yoga' if len(v['categories'])==0 else v['categories'][0]['name']) for v in results])
    
    
    nearby_yoga = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_yoga.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_yoga)

In [53]:
manhattan_yoga = getyoga(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyve

In [54]:
manhattan_yoga['Venue Category'].unique()

array(['Yoga Studio', 'Dance Studio', 'Wine Bar', 'Athletics & Sports',
       'Residential Building (Apartment / Condo)', 'Coworking Space',
       'Gym', 'Gym / Fitness Center', 'Park', 'Yoga', 'Event Space',
       'Spiritual Center', 'Temple', 'Travel Agency', 'Building',
       'Nightlife Spot', 'Massage Studio', 'Office', 'College Gym',
       'Physical Therapist', 'Pharmacy', 'Miscellaneous Shop', 'Boutique',
       'Pop-Up Shop', 'Tech Startup', 'Health & Beauty Service',
       'Business Center', 'Spa', 'Pilates Studio', 'Garden'], dtype=object)

In [56]:
manhattan_yoga = manhattan_yoga[manhattan_yoga['Venue Category'].isin(['Yoha Studio', 'Yoga', 'Pilates Studio'])]
manhattan_yoga

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
32,Lenox Hill,40.768113,-73.95886,Elahi Yoga,40.767293,-73.961594,Yoga
37,Roosevelt Island,40.76216,-73.949168,Wild Blossom Yoga with Natalie,40.764398,-73.953602,Yoga
91,Greenwich Village,40.726933,-73.999914,Yoga to the People,40.72421,-73.997347,Yoga
132,Little Italy,40.719324,-73.997305,Yoga to the People,40.72421,-73.997347,Yoga
154,Soho,40.722184,-74.000657,Yoga to the People,40.72421,-73.997347,Yoga
195,Financial District,40.707107,-74.010665,NYSE Yoga Class 8th floor,40.707648,-74.011776,Yoga
219,Noho,40.723259,-73.988434,Egil's Yoga,40.725074,-73.991478,Yoga
269,Midtown South,40.74851,-73.988713,NEWLIFE NY Yoga & Raw Food Expo,40.752183,-73.993284,Yoga
274,Midtown South,40.74851,-73.988713,NudeYorkYoga,40.745546,-73.991824,Yoga
281,Sutton Place,40.76028,-73.963556,Pilates/Yoga,40.757895,-73.967837,Pilates Studio


Let's create a unique datafrme, df, which includes all the gyms, spas and yoga centers.

In [91]:
frames = [manhattan_yoga, manhattan_gym, manhattan_spa]

df = pd.concat(frames, ignore_index=True)
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lenox Hill,40.768113,-73.95886,Elahi Yoga,40.767293,-73.961594,Yoga
1,Roosevelt Island,40.76216,-73.949168,Wild Blossom Yoga with Natalie,40.764398,-73.953602,Yoga
2,Greenwich Village,40.726933,-73.999914,Yoga to the People,40.72421,-73.997347,Yoga
3,Little Italy,40.719324,-73.997305,Yoga to the People,40.72421,-73.997347,Yoga
4,Soho,40.722184,-74.000657,Yoga to the People,40.72421,-73.997347,Yoga


In [92]:
df.rename(columns={'Venue Latitude':'venlat', 'Venue Longitude':'venlon', 'Venue Category':'vencat'}, inplace=True)
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,venlat,venlon,vencat
0,Lenox Hill,40.768113,-73.95886,Elahi Yoga,40.767293,-73.961594,Yoga
1,Roosevelt Island,40.76216,-73.949168,Wild Blossom Yoga with Natalie,40.764398,-73.953602,Yoga
2,Greenwich Village,40.726933,-73.999914,Yoga to the People,40.72421,-73.997347,Yoga
3,Little Italy,40.719324,-73.997305,Yoga to the People,40.72421,-73.997347,Yoga
4,Soho,40.722184,-74.000657,Yoga to the People,40.72421,-73.997347,Yoga


## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting neighborhoods of Manhattan that have high gyms, spas and yoga centers density.

In first step we have collected the required **data: location and type of every gym, spa and yoga center within 500 m from every neighborhood center**.

Second step in our analysis will be analysing the amount of each venue per neighborhood and the mean frequency of each type.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements**. We will present map of clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="methodology"></a>

Let's count how many gyms/spas or yoga centers are in each neighborhood.

In [93]:
df_count=df.groupby('Neighborhood').count()
df_count

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,venlat,venlon,vencat
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,80,80,80,80,80,80
Carnegie Hill,92,92,92,92,92,92
Central Harlem,12,12,12,12,12,12
Chelsea,96,96,96,96,96,96
Chinatown,110,110,110,110,110,110
Civic Center,136,136,136,136,136,136
Clinton,124,124,124,124,124,124
East Harlem,26,26,26,26,26,26
East Village,100,100,100,100,100,100
Financial District,148,148,148,148,148,148


In [94]:
df_sorted=df_count.sort_values(by='Neighborhood Latitude', ascending = False)
df_sorted

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,venlat,venlon,vencat
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Midtown South,180,180,180,180,180,180
Sutton Place,174,174,174,174,174,174
Flatiron,162,162,162,162,162,162
Murray Hill,158,158,158,158,158,158
Midtown,154,154,154,154,154,154
Financial District,148,148,148,148,148,148
Soho,142,142,142,142,142,142
Civic Center,136,136,136,136,136,136
Little Italy,132,132,132,132,132,132
Lenox Hill,130,130,130,130,130,130


It seems that Midtown South, Sutton Place, Flatiron, Murray Hill and Midtown set the top 5 neighbourhoods in terms of amount of gyms/spas/yoga centers with 180, 174, 162, 158 and 154 each.

Since we don't want to recomend a really crowded neighborhood to start the business but we don't want an "empty" one either, we will chose the neighborhoods with more than 80 venues and less than 130. This means, that our dataframe will be reduced:

In [104]:
df_reduced = df[df['Neighborhood'].isin(['Lenox Hill', 'Clinton', 'Greenwich Village', 'Turtle Bay', 'Yorkville', 'Lincoln Square',
                                         'Chinatown', 'Uppear East Side', 'Noho', 'Gramercy', 'East Village', 'Tudor City', 'Chelsea', 'Carnegie Hill','West Village', 'Battery Park City'])]
df_reduced_sorted=df_reduced.groupby('Neighborhood').count()
df_reduced_sorted

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,venlat,venlon,vencat
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,80,80,80,80,80,80
Carnegie Hill,92,92,92,92,92,92
Chelsea,96,96,96,96,96,96
Chinatown,110,110,110,110,110,110
Clinton,124,124,124,124,124,124
East Village,100,100,100,100,100,100
Gramercy,100,100,100,100,100,100
Greenwich Village,118,118,118,118,118,118
Lenox Hill,130,130,130,130,130,130
Lincoln Square,110,110,110,110,110,110


Let's analize each neighborhood

In [105]:
# one hot encoding
df_onehot = pd.get_dummies(df_reduced[['vencat']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighborhood'] = df_reduced['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Basketball Court,Boxing Gym,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Health & Beauty Service,Massage Studio,Nail Salon,Outdoor Gym,Rock Climbing Spot,Spa,Yoga
0,Lenox Hill,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,Greenwich Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
6,Noho,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10,Lenox Hill,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
12,Greenwich Village,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [184]:
df_grouped = df_onehot.groupby('Neighborhood').mean().reset_index()
df_grouped

Unnamed: 0,Neighborhood,Athletics & Sports,Basketball Court,Boxing Gym,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Health & Beauty Service,Massage Studio,Nail Salon,Outdoor Gym,Rock Climbing Spot,Spa,Yoga
0,Battery Park City,0.025,0.0,0.0,0.0,0.65,0.175,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.125,0.0
1,Carnegie Hill,0.021739,0.021739,0.0,0.0,0.347826,0.065217,0.0,0.021739,0.021739,0.043478,0.195652,0.0,0.0,0.26087,0.0
2,Chelsea,0.0,0.020833,0.0,0.0,0.25,0.125,0.0,0.020833,0.020833,0.041667,0.208333,0.0,0.0,0.3125,0.0
3,Chinatown,0.0,0.0,0.0,0.0,0.109091,0.018182,0.0,0.0,0.054545,0.181818,0.036364,0.0,0.0,0.6,0.0
4,Clinton,0.016129,0.0,0.0,0.0,0.435484,0.16129,0.0,0.0,0.0,0.016129,0.080645,0.0,0.016129,0.274194,0.0
5,East Village,0.02,0.0,0.0,0.0,0.3,0.08,0.0,0.02,0.0,0.08,0.1,0.0,0.0,0.4,0.0
6,Gramercy,0.0,0.02,0.0,0.0,0.2,0.1,0.0,0.0,0.02,0.14,0.32,0.0,0.0,0.2,0.0
7,Greenwich Village,0.0,0.0,0.033898,0.0,0.237288,0.0,0.0,0.0,0.067797,0.016949,0.101695,0.0,0.0,0.525424,0.016949
8,Lenox Hill,0.015385,0.0,0.0,0.0,0.261538,0.123077,0.0,0.0,0.015385,0.092308,0.184615,0.0,0.0,0.292308,0.015385
9,Lincoln Square,0.018182,0.0,0.0,0.018182,0.436364,0.163636,0.0,0.0,0.0,0.018182,0.127273,0.0,0.0,0.218182,0.0


Let's run *k*-means to cluster the neighborhood into 5 clusters.

In [185]:
# set number of clusters
kclusters = 3

manhattan_grouped_clustering = df_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 1, 2, 0, 1, 1, 2, 1, 0], dtype=int32)

In [186]:
df_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

In [115]:
df_grouped

Unnamed: 0,Cluster Labels,Neighborhood,Athletics & Sports,Basketball Court,Boxing Gym,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Health & Beauty Service,Massage Studio,Nail Salon,Outdoor Gym,Rock Climbing Spot,Spa,Yoga
0,0,Battery Park City,0.025,0.0,0.0,0.0,0.65,0.175,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.125,0.0
1,1,Carnegie Hill,0.021739,0.021739,0.0,0.0,0.347826,0.065217,0.0,0.021739,0.021739,0.043478,0.195652,0.0,0.0,0.26087,0.0
2,1,Chelsea,0.0,0.020833,0.0,0.0,0.25,0.125,0.0,0.020833,0.020833,0.041667,0.208333,0.0,0.0,0.3125,0.0
3,2,Chinatown,0.0,0.0,0.0,0.0,0.109091,0.018182,0.0,0.0,0.054545,0.181818,0.036364,0.0,0.0,0.6,0.0
4,0,Clinton,0.016129,0.0,0.0,0.0,0.435484,0.16129,0.0,0.0,0.0,0.016129,0.080645,0.0,0.016129,0.274194,0.0
5,1,East Village,0.02,0.0,0.0,0.0,0.3,0.08,0.0,0.02,0.0,0.08,0.1,0.0,0.0,0.4,0.0
6,1,Gramercy,0.0,0.02,0.0,0.0,0.2,0.1,0.0,0.0,0.02,0.14,0.32,0.0,0.0,0.2,0.0
7,2,Greenwich Village,0.0,0.0,0.033898,0.0,0.237288,0.0,0.0,0.0,0.067797,0.016949,0.101695,0.0,0.0,0.525424,0.016949
8,1,Lenox Hill,0.015385,0.0,0.0,0.0,0.261538,0.123077,0.0,0.0,0.015385,0.092308,0.184615,0.0,0.0,0.292308,0.015385
9,0,Lincoln Square,0.018182,0.0,0.0,0.018182,0.436364,0.163636,0.0,0.0,0.0,0.018182,0.127273,0.0,0.0,0.218182,0.0


In [187]:
manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(df_grouped.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged = manhattan_merged[manhattan_merged['Neighborhood'].isin(['Lenox Hill', 'Clinton', 'Greenwich Village', 'Turtle Bay', 'Yorkville', 'Lincoln Square',
                                         'Chinatown', 'Uppear East Side', 'Noho', 'Gramercy', 'East Village', 'Tudor City', 'Chelsea', 'Carnegie Hill','West Village', 'Battery Park City'])]
manhattan_merged

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Athletics & Sports,Basketball Court,Boxing Gym,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Health & Beauty Service,Massage Studio,Nail Salon,Outdoor Gym,Rock Climbing Spot,Spa,Yoga
1,Manhattan,Chinatown,40.715618,-73.994279,2.0,0.0,0.0,0.0,0.0,0.109091,0.018182,0.0,0.0,0.054545,0.181818,0.036364,0.0,0.0,0.6,0.0
9,Manhattan,Yorkville,40.77593,-73.947118,1.0,0.017857,0.0,0.0,0.0,0.321429,0.196429,0.0,0.017857,0.0,0.0,0.160714,0.0,0.0,0.285714,0.0
10,Manhattan,Lenox Hill,40.768113,-73.95886,1.0,0.015385,0.0,0.0,0.0,0.261538,0.123077,0.0,0.0,0.015385,0.092308,0.184615,0.0,0.0,0.292308,0.015385
13,Manhattan,Lincoln Square,40.773529,-73.985338,0.0,0.018182,0.0,0.0,0.018182,0.436364,0.163636,0.0,0.0,0.0,0.018182,0.127273,0.0,0.0,0.218182,0.0
14,Manhattan,Clinton,40.759101,-73.996119,0.0,0.016129,0.0,0.0,0.0,0.435484,0.16129,0.0,0.0,0.0,0.016129,0.080645,0.0,0.016129,0.274194,0.0
17,Manhattan,Chelsea,40.744035,-74.003116,1.0,0.0,0.020833,0.0,0.0,0.25,0.125,0.0,0.020833,0.020833,0.041667,0.208333,0.0,0.0,0.3125,0.0
18,Manhattan,Greenwich Village,40.726933,-73.999914,2.0,0.0,0.0,0.033898,0.0,0.237288,0.0,0.0,0.0,0.067797,0.016949,0.101695,0.0,0.0,0.525424,0.016949
19,Manhattan,East Village,40.727847,-73.982226,1.0,0.02,0.0,0.0,0.0,0.3,0.08,0.0,0.02,0.0,0.08,0.1,0.0,0.0,0.4,0.0
24,Manhattan,West Village,40.734434,-74.00618,1.0,0.0,0.0,0.0,0.0,0.238095,0.02381,0.0,0.0,0.047619,0.02381,0.357143,0.0,0.0,0.309524,0.0
27,Manhattan,Gramercy,40.73721,-73.981376,1.0,0.0,0.02,0.0,0.0,0.2,0.1,0.0,0.0,0.02,0.14,0.32,0.0,0.0,0.2,0.0


In [188]:
manhattan_merged['Cluster Labels']=manhattan_merged['Cluster Labels']+1

In [189]:
manhattan_merged['Cluster Labels'] = manhattan_merged['Cluster Labels'].astype(int)
manhattan_merged

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Athletics & Sports,Basketball Court,Boxing Gym,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Health & Beauty Service,Massage Studio,Nail Salon,Outdoor Gym,Rock Climbing Spot,Spa,Yoga
1,Manhattan,Chinatown,40.715618,-73.994279,3,0.0,0.0,0.0,0.0,0.109091,0.018182,0.0,0.0,0.054545,0.181818,0.036364,0.0,0.0,0.6,0.0
9,Manhattan,Yorkville,40.77593,-73.947118,2,0.017857,0.0,0.0,0.0,0.321429,0.196429,0.0,0.017857,0.0,0.0,0.160714,0.0,0.0,0.285714,0.0
10,Manhattan,Lenox Hill,40.768113,-73.95886,2,0.015385,0.0,0.0,0.0,0.261538,0.123077,0.0,0.0,0.015385,0.092308,0.184615,0.0,0.0,0.292308,0.015385
13,Manhattan,Lincoln Square,40.773529,-73.985338,1,0.018182,0.0,0.0,0.018182,0.436364,0.163636,0.0,0.0,0.0,0.018182,0.127273,0.0,0.0,0.218182,0.0
14,Manhattan,Clinton,40.759101,-73.996119,1,0.016129,0.0,0.0,0.0,0.435484,0.16129,0.0,0.0,0.0,0.016129,0.080645,0.0,0.016129,0.274194,0.0
17,Manhattan,Chelsea,40.744035,-74.003116,2,0.0,0.020833,0.0,0.0,0.25,0.125,0.0,0.020833,0.020833,0.041667,0.208333,0.0,0.0,0.3125,0.0
18,Manhattan,Greenwich Village,40.726933,-73.999914,3,0.0,0.0,0.033898,0.0,0.237288,0.0,0.0,0.0,0.067797,0.016949,0.101695,0.0,0.0,0.525424,0.016949
19,Manhattan,East Village,40.727847,-73.982226,2,0.02,0.0,0.0,0.0,0.3,0.08,0.0,0.02,0.0,0.08,0.1,0.0,0.0,0.4,0.0
24,Manhattan,West Village,40.734434,-74.00618,2,0.0,0.0,0.0,0.0,0.238095,0.02381,0.0,0.0,0.047619,0.02381,0.357143,0.0,0.0,0.309524,0.0
27,Manhattan,Gramercy,40.73721,-73.981376,2,0.0,0.02,0.0,0.0,0.2,0.1,0.0,0.0,0.02,0.14,0.32,0.0,0.0,0.2,0.0


Finally, let's visualize the resulting clusters

In [190]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion <a name="results"></a>

Our analysis shows that in Manhattan there are 1662 gyms or fitnes centers, 1834 spas, massage studios and so on and only 20 yoga centers. They are distributed in 39, 40 and 9 neighborhoods, respectively.

The analysis also shows that the distribution of the venues are not uniform among the neighborhoos of Manhattan. Midtown South has the large amount of this kind of venues (180) whereas Morningside Heights, Manhattanville, Stuyvesant Town and Central Harlem only have 12 of them.

By clustering the data, we've seen that the following neighborhoods are similiar among them:
- Lincoln Square, Clinton, Battery Park City and Tudor City
- Yorkville, Lenox Hill, Chelsea, East Village, West Village, Gramercy, Carnegie Hill, Noho and Turtle Bay
- Chinatown and Greenwich Village

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Manhattan neighborhoods with medium number of gyms, spas and yoga centers in order to aid stakeholders in narrowing down the search for optimal location for a new Wellness-Fitness Center. By analysing the data we decided to not consider the neihborhoods with really low and really high amount of fitness/wellness venues. At this point, we have only 15 neighborhoods to chose. 
Clustering of those locations was then performed in order to create major zones of interest.

One important point to consider is that, after we reduced of neighborhoods, only Noho, Lenox Hill and Greenwich Village have yoga centers. It's important if stakeholders want to chose one area with some representation of this sport. 
Finally, since the first Noho and Lenox Hill, the ones with yoga centers, are in the same cluster, and also this cluster is the biggest one, chose one of these two locations to start the new Wellness Center would be the best option. Other neighborhood of the same cluster could be a good option since due to the scope of the cluster it will be easier to find variety in additional factors such as attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.