### New York Next Health Food Store Location

#### Business Problem:
We want to open a health food store in New York City.
<br>
I'll try to answer the question which neighborhoods are recommended for the store location
taking into account:
* How healthy the population (Smoking, Obesity) in the neighborhood
* The Mix of bussiness in the nighborhood (Junk food restraunts vs Gyms)
* How much health food stores already are in the neighborhood

#### Project Data:
The data I'll be using:
* New York City population and health data by neighborhood (offical data from nyc.gov)
* FourSquare data
* Geolocator Latitude and longtitude data

##### Data Examples:
New York City population and health data by neighborhood (offical data from nyc.gov)

In [462]:
nyc.head()

Unnamed: 0_level_0,ID,Borough,Name,Overall_Pop,Smoking,Obesity
Name_fix,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Financial District,101,Manhattan,Financial District,63383.0,16.0,4.0
Greenwich Village,102,Manhattan,Greenwich Village and Soho,91638.0,16.0,4.0
Lower East Side,103,Manhattan,Lower East Side and Chinatown,171103.0,20.0,10.0
Clinton,104,Manhattan,Clinton and Chelsea,122119.0,11.0,10.0
Midtown,105,Manhattan,Midtown,53120.0,11.0,10.0


###### Foursquare Data:

In [471]:
t[['Name','burgers','fried chicken','donuts','gym fitness','health food store','organic grocery']].head()


Unnamed: 0,Name,burgers,fried chicken,donuts,gym fitness,health food store,organic grocery
0,Financial District,43,10,26,115,2,0
1,Greenwich Village and Soho,31,16,14,30,9,3
2,Lower East Side and Chinatown,22,6,14,14,1,3
3,Clinton and Chelsea,1,0,0,2,0,0
4,Midtown,39,5,13,76,6,1


##### Libraries Import

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
#!conda install matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#conda install -c anaconda sodapy
#conda install -c anaconda xlrd
#conda install -c anaconda html5lib

#Libary needed for getting NY population data
from sodapy import Socrata


##### New York Data

In [461]:
nyc=pd.read_excel('https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2018-chp-pud.xlsx',sheet_name='CHP_all_data',header=1)
nyc.drop([0,1,2,3,4,5],inplace=True)
nyc.reset_index()
nyc['Name_fix']=nyc['Name'].str.partition(' and ')[0]
nyc=nyc[['ID','Borough','Name','Overall_Pop','Smoking','Obesity','Name_fix']].set_index(['Name_fix']).dropna()


##### Latitude & Longitude

In [353]:
geolocator = Nominatim(user_agent="ny_explorer")
for n in nyc.index:
    address=n+',NY'
    location = geolocator.geocode(address)
    if location:
            nyc.loc[[n],'lat'] = location.latitude
            nyc.loc[[n],'lon'] = location.longitude 

In [424]:
nyc.loc[nyc['Borough']=='Queens']


Unnamed: 0_level_0,ID,Borough,Name,Overall_Pop,Smoking,Obesity
Name_fix,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Long Island City,401,Queens,Long Island City and Astoria,199969.0,19.0,19.0
Woodside,402,Queens,Woodside and Sunnyside,135972.0,14.0,20.0
Jackson Heights,403,Queens,Jackson Heights,179844.0,13.0,20.0
Elmhurst,404,Queens,Elmhurst and Corona,188107.0,15.0,23.0
Ridgewood,405,Queens,Ridgewood and Maspeth,166924.0,20.0,22.0
Rego Park,406,Queens,Rego Park and Forest Hills,115119.0,10.0,19.0
Flushing,407,Queens,Flushing and Whitestone,263039.0,13.0,13.0
Hillcrest,408,Queens,Hillcrest and Fresh Meadows,156217.0,14.0,20.0
Kew Gardens,409,Queens,Kew Gardens and Woodhaven,148465.0,11.0,23.0
South Ozone Park,410,Queens,South Ozone Park and Howard Beach,125603.0,12.0,27.0


##### Foursquare API

In [309]:
CLIENT_ID = '5DAL5IKIGQL4TQTNPEATC2J4YIFI2BTGEFQAU0L45NWWAFLU' # your Foursquare ID
CLIENT_SECRET = 'OFAEXC1MUT0IEHVCMM2HIVZI02C4CNMZYJY5PX3AC2RE3IAJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

###### Foursquare data

In [459]:
categories=['4bf58dd8d48988d16c941735','4bf58dd8d48988d148941735','4d4ae6fc7a7b7dea34424761','4bf58dd8d48988d175941735','50aa9e744b90af0d42d5de0e','52f2ab2ebcbc57f1066b8b45']
radius=2000
LIMIT=1

In [356]:
for n in nyc.index:
    for c in categories:
        url = 'https://api.foursquare.com/v2/venues/explore?categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            c,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            nyc.at[n,'lat'], 
            nyc.at[n,'lon'], 
            radius, 
            LIMIT
            )
        results = requests.get(url).json()
        nyc.loc[[n],results['response']['query']]=results['response']['totalResults']


KeyError: 'totalResults'

In [463]:
#Exporting the data in case of reaching the foursquare request limit
#nyc.to_csv (r'C:\Users\Boaz\coursera\nycFoursquare.csv', index = False, header=True)
t=pd.read_csv(r'C:\Users\Boaz\coursera\nycFoursquare.csv')
t.head()

Unnamed: 0,ID,Borough,Name,Overall_Pop,Smoking,Obesity,lat,lon,burgers,donuts,fried chicken,gym fitness,health food store,organic grocery
0,101,Manhattan,Financial District,63383,16,4,40.707612,-74.009378,43,26,10,115,2,0
1,102,Manhattan,Greenwich Village and Soho,91638,16,4,40.733584,-74.002817,31,14,16,30,9,3
2,103,Manhattan,Lower East Side and Chinatown,171103,20,10,40.715936,-73.986806,22,14,6,14,1,3
3,104,Manhattan,Clinton and Chelsea,122119,11,10,43.048403,-75.378503,1,0,0,2,0,0
4,105,Manhattan,Midtown,53120,11,10,40.760109,-73.978163,39,13,5,76,6,1


In [449]:
t.reset_index()
t.set_index('Name',inplace=True)
nyc=t
temp=nyc
temp['JunkFoodPer1000']=1000*(nyc['burgers']+nyc['donuts']+nyc['fried chicken'])/nyc['Overall_Pop']
temp['GymPer1000']=1000*nyc['gym fitness']/nyc['Overall_Pop']
temp['HealthFoodPer1000']=1000*(nyc['health food store']+nyc['organic grocery'])/nyc['Overall_Pop']
#nycForModel=temp[['JunkFoodPer1000','GymPer1000','HealthFoodPer1000','Smoking','Obesity']]
nycForModel=temp[['JunkFoodPer1000','GymPer1000','HealthFoodPer1000','Obesity']]
nycForModel.head()

Unnamed: 0_level_0,JunkFoodPer1000,GymPer1000,HealthFoodPer1000,Obesity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Financial District,1.246391,1.814367,0.031554,4
Greenwich Village and Soho,0.665663,0.327375,0.13095,4
Lower East Side and Chinatown,0.245466,0.081822,0.023378,10
Clinton and Chelsea,0.008189,0.016377,0.0,10
Midtown,1.073042,1.430723,0.131777,10


##### Clustering

In [450]:
# set number of clusters
kclusters = 3

#toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(nycForModel)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 1])

In [451]:
#nycForModel.insert(0, 'Cluster Labels', kmeans.labels_)
#nycCluster = nyc.join(nycForModel,lsuffix='_1')
#nycCluster.head()
t.insert(0, 'Cluster Labels', kmeans.labels_)
t.reset_index(inplace=True)
t.head()

Unnamed: 0,Name,Cluster Labels,ID,Borough,Overall_Pop,Smoking,Obesity,lat,lon,burgers,donuts,fried chicken,gym fitness,health food store,organic grocery,JunkFoodPer1000,GymPer1000,HealthFoodPer1000
0,Financial District,0,101,Manhattan,63383,16,4,40.707612,-74.009378,43,26,10,115,2,0,1.246391,1.814367,0.031554
1,Greenwich Village and Soho,0,102,Manhattan,91638,16,4,40.733584,-74.002817,31,14,16,30,9,3,0.665663,0.327375,0.13095
2,Lower East Side and Chinatown,0,103,Manhattan,171103,20,10,40.715936,-73.986806,22,14,6,14,1,3,0.245466,0.081822,0.023378
3,Clinton and Chelsea,0,104,Manhattan,122119,11,10,43.048403,-75.378503,1,0,0,2,0,0,0.008189,0.016377,0.0
4,Midtown,0,105,Manhattan,53120,11,10,40.760109,-73.978163,39,13,5,76,6,1,1.073042,1.430723,0.131777


##### On Map

In [452]:
address = 'New York, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [453]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(t['lat'], t['lon'], t['Name'], t['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [458]:
t.loc[t['Cluster Labels']==0].sort_values(['HealthFoodPer1000'])

Unnamed: 0,Name,Cluster Labels,ID,Borough,Overall_Pop,Smoking,Obesity,lat,lon,burgers,donuts,fried chicken,gym fitness,health food store,organic grocery,JunkFoodPer1000,GymPer1000,HealthFoodPer1000
3,Clinton and Chelsea,0,104,Manhattan,122119,11,10,43.048403,-75.378503,1,0,0,2,0,0,0.008189,0.016377,0.0
6,Upper West Side,0,107,Manhattan,214744,10,10,40.787046,-73.975416,25,7,5,25,0,0,0.172298,0.116418,0.0
7,Upper East Side,0,108,Manhattan,225914,8,11,40.773702,-73.96412,8,2,3,19,0,0,0.057544,0.084103,0.0
48,Flushing and Whitestone,0,407,Queens,263039,13,13,40.76543,-73.817429,0,1,2,3,0,0,0.011405,0.011405,0.0
29,Park Slope and Carroll Gardens,0,306,Brooklyn,109351,15,15,40.670103,-73.985972,9,4,2,24,1,0,0.137173,0.219477,0.009145
35,Borough Park,0,312,Brooklyn,201640,10,15,40.633993,-73.996806,2,2,0,0,2,0,0.019837,0.0,0.009919
5,Stuyvesant Town and Turtle Bay,0,106,Manhattan,144591,12,13,40.731971,-73.978093,8,3,1,13,2,0,0.082993,0.089909,0.013832
2,Lower East Side and Chinatown,0,103,Manhattan,171103,20,10,40.715936,-73.986806,22,14,6,14,1,3,0.245466,0.081822,0.023378
0,Financial District,0,101,Manhattan,63383,16,4,40.707612,-74.009378,43,26,10,115,2,0,1.246391,1.814367,0.031554
1,Greenwich Village and Soho,0,102,Manhattan,91638,16,4,40.733584,-74.002817,31,14,16,30,9,3,0.665663,0.327375,0.13095


In [454]:
t.groupby('Cluster Labels').mean()[['JunkFoodPer1000','GymPer1000','HealthFoodPer1000','Obesity']]

Unnamed: 0_level_0,JunkFoodPer1000,GymPer1000,HealthFoodPer1000,Obesity
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.338182,0.381089,0.031869,10.454545
1,0.042026,0.017034,0.006339,34.388889
2,0.04535,0.033112,0.002079,23.9


In [455]:
t.groupby('Cluster Labels').count()[['Name']]

Unnamed: 0_level_0,Name
Cluster Labels,Unnamed: 1_level_1
0,11
1,18
2,30
