# Capstone Project - The Battle of Neighborhoods (Week 2)

## Background

For the entrepreneur, New York offers a stable environment, a large economy, and access to one of the world's busiest regions. Small business owners in the state don't expect that to change and are largely optimistic about what the future holds.That means plenty of business opportunity, in New York City neighborhood. New Yorkers are, on average, wealthier than their national counterparts as well, meaning more money to spend on the goods and services small businesses have to offer.However, the heightened cost of living can prove difficult to manage. Still, entrepreneurs said that if they can overcome the steep expenses associated with payroll and rent, not to mention a tangled web of taxes and fees, operating in New York is an investment that pays off in the end. Starting a new business in New York City will be a tough work.

## Problem

One of the biggest challenges for any New York City business is going to be the wide array of competition. In a city of eight million plus, there is going to be competition around every corner. The New York City market has become overly saturated in almost every single industry from doctor practices to restaurants. Businesses now have to go beyond just serving their customers and focus on marketing and reputation in order to improve customer acquisition and grow their businesses. With Queens having one of the most diverse places in the nation. Half of the neighborhood’s residents speak Spanish. Others speak Chinese, Urdu, Hindi, Russian, Portuguese, Greek or Korean. Altogether, the neighborhood is said to be the home of 167 languages. It will be a challenges on what business to start and offer these culture diverse in one of the five boroughs of New York City with it having the largest borough geographically.

## Solving the Problem Using Data Science

To solve the existing problems, we will recommend to new business by collecting location data from Foursquare and apply data science techniques and tools. We are going to cluster New York neighborhoods in order to find existing business establishments and venues in order to scope out the competition and create a new trendy business

## Data section

We extracted the ZIP Code Definitions of New York City Neighborhoods which is available from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. Created a csv file and uploaded to own server. To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on New York City Zip Codes by neighborhood and data on venues and essential facilities surrounding such properties from FourSquare API interface.

## Methodology

Methodology section
1. Collect Data
2. Explore and Understand Data
3. Data Preparation and Preprocessing 
4. Modeling

#### Collect Data 

In [2]:
# Scrape the website and store in data frame
import pandas as pd
link = "https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm"
df = pd.read_html(link,header=0)[0]

In [3]:
import requests
from bs4 import BeautifulSoup


url = "https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm"
wiki = requests.get(url)
soup = BeautifulSoup(wiki.content, "html.parser")
table = soup.find_all("table")[0]
table_rows = table.find_all("tr")

In [4]:
#reading into list of columns

c1=[]
c2=[]
c3=[]


for tr in table_rows:
    #header = tr.find_all("th")
    row = tr.find_all("td")
    if len(row) == 3:
        c1.append(row[0].find(text = True))
        c2.append(row[1].find(text = True))
        c3.append(row[2].find(text = True))

In [5]:
#reading into Data Frame

df = pd.DataFrame(c1, columns = ["Postcode"])
df["Borough"] = c2
df["Neighborhood"] = c3

In [6]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,Bronx,Central Bronx,"10453, 10457, 10460"
1,Brooklyn,Central Brooklyn,"11212, 11213, 11216, 11233, 11238"
2,Manhattan,Central Harlem,"10026, 10027, 10030, 10037, 10039"
3,Queens,Northeast Queens,"11361, 11362, 11363, 11364"
4,Staten Island,Port Richmond,"10302, 10303, 10310"


In [7]:
df.shape

(5, 3)

In [8]:
df.to_csv("NYZip2.csv")

In [9]:
df= pd.read_csv("NYZip2.csv", index_col=0)

df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,Bronx,Central Bronx,"10453, 10457, 10460"
1,Brooklyn,Central Brooklyn,"11212, 11213, 11216, 11233, 11238"
2,Manhattan,Central Harlem,"10026, 10027, 10030, 10037, 10039"
3,Queens,Northeast Queens,"11361, 11362, 11363, 11364"
4,Staten Island,Port Richmond,"10302, 10303, 10310"


In [10]:
#import libraries
import numpy as np
import pandas as pd

final= pd.read_csv("https://alsantiago.com/NYZip2.csv", index_col=0)

In [11]:
# import JSON library 
import json 

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
# library to handle requests

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [12]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


In [13]:
final = final[['Borough','Neighborhood','Latitude','Longitude']]
final.head()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Bronx,Central Bronx,40.852779,-73.912332
2,Bronx,Bronx Park and Fordham,40.862543,-73.888143
3,Bronx,High Bridge and Morrisania,40.820479,-73.925084
4,Bronx,Hunts Point and Mott Haven,40.805489,-73.916585
5,Bronx,Kingsbridge and Riverdale,40.880678,-73.90654


In [14]:
address = 'New York, NY'
geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York are 40.7308619, -73.9871558.


In [15]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(final['Latitude'], final['Longitude'], final['Borough'], final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [16]:
newyork_data = final[final['Borough'].str.contains('Queens', regex=True)]
# Dataframe where the Borough name contains "Brooklyn" word
downtown_newyork = newyork_data[newyork_data['Borough'] == 'Queens'].reset_index(drop=True)

In [17]:
address = 'New York, NY'
geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7308619, -73.9871558.


In [18]:
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(downtown_newyork['Latitude'], downtown_newyork['Longitude'], downtown_newyork['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  
    
map_downtown

In [19]:
CLIENT_ID = '4VVLBSKBMVZQBCAKXVDPHCLEILZ55IQLJV5VVDK4DODH3O5N' # your Foursquare ID
CLIENT_SECRET = '3NV0AOGW4NGW0SPJ3DUE3D4YYOIEDODMQ4SUTYOCQ2UMKLG4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

In [20]:
downtown_newyork.loc[0, 'Neighborhood']

'Northeast Queens'

In [21]:
# neighborhood latitude value
neighborhood_latitude = downtown_newyork.loc[0, 'Latitude']
# neighborhood longitude value
neighborhood_longitude = downtown_newyork.loc[0, 'Longitude']
# neighborhood name
neighborhood_name = downtown_newyork.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Northeast Queens are 40.764191, -73.772775.


In [22]:
# limit of number of venues returned by Foursquare API
LIMIT = 100
# define radius
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
# display URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=4VVLBSKBMVZQBCAKXVDPHCLEILZ55IQLJV5VVDK4DODH3O5N&client_secret=3NV0AOGW4NGW0SPJ3DUE3D4YYOIEDODMQ4SUTYOCQ2UMKLG4&v=20180605&ll=40.764191,-73.772775&radius=500&limit=100'

In [23]:
# Send the GET request and examine the resutls
results = requests.get(url).json()

In [24]:
# Now extract the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [25]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The French Workshop,Bakery,40.765404,-73.771861
1,Martha's Country Bakery,Bakery,40.763422,-73.770971
2,Press 195,Bar,40.763905,-73.770946
3,Avli Little Greek Tavern,Greek Restaurant,40.765729,-73.771972
4,Nippon Cha,Noodle House,40.764408,-73.771461


In [26]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

81 venues were returned by Foursquare.


In [27]:
# Let's create a function to repeat the same process to all the neighborhoods in downtown_newyork
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [28]:
downtown_newyork_venues = getNearbyVenues(names=downtown_newyork['Neighborhood'],
                                   latitudes=downtown_newyork['Latitude'],
                                   longitudes=downtown_newyork['Longitude']
                                  )

Northeast Queens
North Queens
Central Queens
Jamaica
Northwest Queens
West Central Queens
Rockaways
Southeast Queens
Southwest Queens
West Queens


In [29]:
print(downtown_newyork_venues.shape)
downtown_newyork_venues.head()

(318, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Northeast Queens,40.764191,-73.772775,The French Workshop,40.765404,-73.771861,Bakery
1,Northeast Queens,40.764191,-73.772775,Martha's Country Bakery,40.763422,-73.770971,Bakery
2,Northeast Queens,40.764191,-73.772775,Press 195,40.763905,-73.770946,Bar
3,Northeast Queens,40.764191,-73.772775,Avli Little Greek Tavern,40.765729,-73.771972,Greek Restaurant
4,Northeast Queens,40.764191,-73.772775,Nippon Cha,40.764408,-73.771461,Noodle House


In [30]:
downtown_newyork_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Queens,23,23,23,23,23,23
Jamaica,8,8,8,8,8,8
North Queens,21,21,21,21,21,21
Northeast Queens,81,81,81,81,81,81
Northwest Queens,63,63,63,63,63,63
Rockaways,1,1,1,1,1,1
Southeast Queens,25,25,25,25,25,25
Southwest Queens,22,22,22,22,22,22
West Central Queens,38,38,38,38,38,38
West Queens,36,36,36,36,36,36


In [31]:
print('There are {} uniques categories.'.format(len(downtown_newyork_venues['Venue Category'].unique())))

There are 114 uniques categories.


In [32]:
# Analyze Each Neighborhood
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_newyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_newyork_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Museum,Asian Restaurant,Athletics & Sports,Auto Workshop,Automotive Shop,Bagel Shop,Bakery,...,Toy / Game Store,Trail,Train,Train Station,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Northeast Queens,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Northeast Queens,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,Northeast Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Northeast Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Northeast Queens,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Museum,Asian Restaurant,Athletics & Sports,Auto Workshop,Automotive Shop,Bagel Shop,Bakery,...,Toy / Game Store,Trail,Train,Train Station,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Central Queens,0.043478,0.0,0.0,0.0,0.043478,0.043478,0.043478,0.043478,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jamaica,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0
2,North Queens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095238,...,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Northeast Queens,0.0,0.037037,0.0,0.012346,0.0,0.0,0.0,0.012346,0.024691,...,0.0,0.012346,0.0,0.012346,0.0,0.0,0.0,0.012346,0.0,0.012346
4,Northwest Queens,0.0,0.031746,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015873,0.0,0.0,0.015873,0.0,0.015873,0.0
5,Rockaways,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Southeast Queens,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Southwest Queens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0
8,West Central Queens,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.026316,0.026316,...,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0
9,West Queens,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
# print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Queens----
                 venue  freq
0          Pizza Place  0.09
1   Chinese Restaurant  0.09
2    Accessories Store  0.04
3      Bubble Tea Shop  0.04
4  Japanese Restaurant  0.04


----Jamaica----
                  venue  freq
0        Discount Store  0.12
1  Caribbean Restaurant  0.12
2              Pharmacy  0.12
3      Basketball Court  0.12
4     Fish & Chips Shop  0.12


----North Queens----
               venue  freq
0        Supermarket  0.14
1  Korean Restaurant  0.14
2             Bakery  0.10
3   Toy / Game Store  0.05
4          Pool Hall  0.05


----Northeast Queens----
               venue  freq
0                Bar  0.07
1        Pizza Place  0.06
2   Sushi Restaurant  0.05
3                Spa  0.04
4  Indian Restaurant  0.04


----Northwest Queens----
                venue  freq
0                Café  0.11
1               Hotel  0.08
2         Coffee Shop  0.08
3         Pizza Place  0.06
4  Mexican Restaurant  0.06


----Rockaways----
                

In [35]:
# put that into a pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [36]:
# create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Central Queens,Pizza Place,Chinese Restaurant,Accessories Store,Bubble Tea Shop,Ice Cream Shop
1,Jamaica,Chinese Restaurant,Video Store,Caribbean Restaurant,Pharmacy,Seafood Restaurant
2,North Queens,Korean Restaurant,Supermarket,Bakery,Gym / Fitness Center,Sushi Restaurant
3,Northeast Queens,Bar,Pizza Place,Sushi Restaurant,Spa,Indian Restaurant
4,Northwest Queens,Café,Hotel,Coffee Shop,Pizza Place,Mexican Restaurant


In [37]:
# Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 2, 3, 1, 1, 0, 1, 1, 1, 4], dtype=int32)

In [38]:
downtown_data = downtown_newyork
downtown_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Northeast Queens,40.764191,-73.772775
1,Queens,North Queens,40.768208,-73.827403
2,Queens,Central Queens,40.739634,-73.79449
3,Queens,Jamaica,40.698095,-73.758986
4,Queens,Northwest Queens,40.747155,-73.93975


In [39]:
downtown_data.shape

(10, 4)

In [40]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
downtown_merged = downtown_data

# add clustering labels
downtown_merged['Cluster Labels'] = kmeans.labels_

# merge newyork_grouped with newyork_data to add latitude/longitude for each neighborhood
downtown_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

downtown_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Queens,Northeast Queens,40.764191,-73.772775,1,Bar,Pizza Place,Sushi Restaurant,Spa,Indian Restaurant
1,Queens,North Queens,40.768208,-73.827403,2,Korean Restaurant,Supermarket,Bakery,Gym / Fitness Center,Sushi Restaurant
2,Queens,Central Queens,40.739634,-73.79449,3,Pizza Place,Chinese Restaurant,Accessories Store,Bubble Tea Shop,Ice Cream Shop
3,Queens,Jamaica,40.698095,-73.758986,1,Chinese Restaurant,Video Store,Caribbean Restaurant,Pharmacy,Seafood Restaurant
4,Queens,Northwest Queens,40.747155,-73.93975,1,Café,Hotel,Coffee Shop,Pizza Place,Mexican Restaurant


In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [42]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
5,West Central Queens,Chinese Restaurant,Grocery Store,Sushi Restaurant,Japanese Restaurant,Spa


In [43]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Northeast Queens,Bar,Pizza Place,Sushi Restaurant,Spa,Indian Restaurant
3,Jamaica,Chinese Restaurant,Video Store,Caribbean Restaurant,Pharmacy,Seafood Restaurant
4,Northwest Queens,Café,Hotel,Coffee Shop,Pizza Place,Mexican Restaurant
6,Rockaways,Beach,Yoga Studio,French Restaurant,Cosmetics Shop,Deli / Bodega
7,Southeast Queens,Pharmacy,Gift Shop,Indian Restaurant,Pet Store,Cosmetics Shop
8,Southwest Queens,Fast Food Restaurant,Clothing Store,Italian Restaurant,Gym,Pharmacy


In [44]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,North Queens,Korean Restaurant,Supermarket,Bakery,Gym / Fitness Center,Sushi Restaurant


In [45]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 3, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Central Queens,Pizza Place,Chinese Restaurant,Accessories Store,Bubble Tea Shop,Ice Cream Shop


In [46]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 4, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,West Queens,Tennis Stadium,Mexican Restaurant,Latin American Restaurant,Deli / Bodega,Pizza Place


# Results and Discussion
All clusters have 20 restaurants and food places on the area that caters to specific cuisines. <br>
Cluster 1 has 1 neighborhood with 3 restaurants as common venues and 2 stores.<br>
Cluster 2 has 6 neighborhoods with 13 diverse food places and restaurants but has different types of stores.<br>
Cluster 3 has 1 neighborhood with 2 restaurants and food places has a bus station.<br>
Cluster 4 has 1 neighborhood with 1 restaurant a supermarket and a bus station.<br>
Cluster 5 has 1 neighborhood with 2 restaurants as common venues a hotel and a stadium.<br>
The geographic location does give us a a decent view on where to open the new businesses as the thought is to begin with there are a lot of restaurants and cater to the diverse culture of the borough. <br>
It looks like the neighborhood of Cluster 2 are the greater ones and the vast majority of the neighborhood compared to the rest.


# Conclusion
The KMeans classification  technique gives a new entrepreneur a good idea on what type of trendy business he/she can start with based on the type of venues identified using Foursquare api.<br>
The cluster of venues identified within the neighborhood is predominantly restaurants and food shops, but there is no nearby coffee shop. <br>
Its not too late to Catch up with the trend of offering gourmet and specialty coffee from different parts of the world will definitely fit the diverse culture of Queens, New York.
