# Introduction

 _Created by_ Vicente Carvalho for Coursera Capstone purposes.

Almost every business depends on location. At first, the location has an immediate effect on fixed costs and it can also be very important for variable earnings. In this problem, it will be chosen as the best location for a Psychology Clinic. The main criteria adopted are the relation between some variables such as the number of restaurants, clinics and hospitals, and the number of results of 'psy' results in the Foursquare database.

# Customer

Customer C is a psychologist who is also owner of several clinics of psychology in New York. C wants to know what areas of New York are good to install your clinics. Right now C has only one clinic in Manhattan but C wants to open new clinics in other regions too.

# Business Problem

C intends to scale your business and also to dilute some fixed costs by using the same client target for all your clinics in New York.

C intends to use similar as possible furniture, paints and customer psychology challengers. C believes that beyond one-by-one therapy, group therapy is also a great tool to provide your clients with better health and quality, so it is interesting to deal with clients of similar backgrounds and interests.

In C experience as clinical psychology, there is a strong correlation between customer psychology profile and house/work neighborhood area. It's known by the client experience that Manhattan is a great place but frequently over too overpriced. So, the client wants to know other areas that are similar to Manhattan that should be also investigated.

# Problem Solution Framework

C needs customer clusterization. As C is pretty sure about psychology profile and neighborhood, the first approach is definitely to try cluster neighbors in New York. 

It's necessary more information about his actual clinic in Manhattan: 
 - C said in his actual clinic there are a lot of psychologist clinics, hospitals, and restaurants: it should be examined as evidence of good places;
 - it is supposed that the correlation is strong between place and psychological profile;


# Data

It will be used data from Foursquare API. The steps admitted to solving the clusterization are:

 - Read New York Json file;
 - Add Latitude and Longitude information by Borough and Neighborhood;
 - Data Extract by Foursquare's API;

# Problem Solving

At first, k-means will be used to cluster regions of New York. It will be observed a strict radius to minimize the incidence of the same results for different searches.

In the case of no correlation found by the criteria pointed by C, another database will be included to support the required information.

## Data Extract

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
import json
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### Define Foursquare Credentials and Version

In [2]:
CLIENT_ID = 'ZZUKG2R2MDGSCP30TS4XINPAP4Y4PMBLU1DO1TNGFFPCNTZR' # your Foursquare ID
CLIENT_SECRET = 'KTAN34ECK3NKIB2LVQD2REJ5IEF4CG1YZOOEA5A3RIXSMMIT' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
radius = 1000
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZZUKG2R2MDGSCP30TS4XINPAP4Y4PMBLU1DO1TNGFFPCNTZR
CLIENT_SECRET:KTAN34ECK3NKIB2LVQD2REJ5IEF4CG1YZOOEA5A3RIXSMMIT


### Neighborhoods in New York

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
for data in neighborhoods_data:
    borough = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## Add Lat/Long Info

In [9]:
!conda install -c conda-forge geopy --yes
import geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

done
-
  - anaconda/osx-64::openssl-1.1.1d-h1de35cc_2
  - defaults/osx-64::openssl-1.1.1d-h1de35cdone

# All requested packages already installed.



In [10]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [28]:
lat = 40.7127281
lng = -74.0060152

In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### Connecting Foursquare API 

In [13]:
neighborhoods['Psy'] = 0
neighborhoods['Hospital'] = 0
neighborhoods['Bank'] = 0
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank
0,Bronx,Wakefield,40.894705,-73.847201,0,0,0
1,Bronx,Co-op City,40.874294,-73.829939,0,0,0
2,Bronx,Eastchester,40.887556,-73.827806,0,0,0
3,Bronx,Fieldston,40.895437,-73.905643,0,0,0
4,Bronx,Riverdale,40.890834,-73.912585,0,0,0


In [46]:
columns = ['Psy','Hospital','Bank']
def getNearbyVenues(start, end, neighborhoods):
    radius = 500
    LIMIT = 100
    for i in range(start, end):
        lat = neighborhoods.loc[i,'Latitude']
        lng = neighborhoods.loc[i,'Longitude']
        name = neighborhoods.loc[i,'Neighborhood']
        # create the API request URL
        url = []
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'psy', radius, LIMIT))
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'hospital', radius, LIMIT))
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'bank', radius, LIMIT))
        
        # make the GET request
        # results = requests.get(url).json()["response"]['groups'][0]['items']
        for k in range(0,3):
            results = requests.get(url[k]).json()
            #print(results)
            if ((results['response']['venues'] != []) and ('warning' not in results['response'].keys())):
                # assign relevant part of JSON to venues
                venues = results['response']['venues']
                # tranform venues into a dataframe
                dataframe = json_normalize(venues)
                neighborhoods.loc[i,columns[k]] = dataframe.shape[0]
            #else:
                # No response from Foursquare API, this code is not needed by definition of neighborhoods
                # neighborhoods.loc[i,k] = 0 
    
    return(neighborhoods)

In [81]:
a = pd.DataFrame()
a['Col1'] = [1, 2, 3, 4, 5]
a['Col2'] = ['a', 'b', 'c', 'd', 'e']
a.drop(index = [2, 4],axis = 0, inplace = True)
a.reset_index(inplace = True)
a.drop(columns = ['index'],axis = 1, inplace = True)
a.head()

Unnamed: 0,Col1,Col2
0,1,a
1,2,b
2,4,d


In [47]:
ny_venues = getNearbyVenues(0,306,neighborhoods)

KeyError: 'venues'

In [48]:
ny_venues.shape

NameError: name 'ny_venues' is not defined

In [37]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, 40.894705, -73.847201, VERSION, 'psy', radius, LIMIT)
results = requests.get(url).json()
print(results)

{'meta': {'code': 200, 'requestId': '5e7a883747b43d00235185ea'}, 'response': {'venues': []}}


In [40]:
if results["response"]['venues'] != []:
    print('Entrou')
else:
    print('Falso')

Falso


In [75]:
lines

[3,
 76,
 3,
 23,
 53,
 76,
 102,
 192,
 193,
 198,
 202,
 203,
 207,
 225,
 226,
 255,
 257,
 291,
 294,
 303]

In [82]:
print(ny_venues.shape)
ny_venues.head()

(209, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Park Slope,40.672321,-73.97705,Community Bookstore,40.672778,-73.976634,Bookstore
1,Park Slope,40.672321,-73.97705,Sounds,40.67245,-73.976784,Accessories Store
2,Park Slope,40.672321,-73.97705,Sushi Katsuei,40.670615,-73.978504,Japanese Restaurant
3,Park Slope,40.672321,-73.97705,Norman & Jules,40.6723,-73.977469,Toy / Game Store
4,Park Slope,40.672321,-73.97705,Tarzian West for Housewares,40.671063,-73.978131,Furniture / Home Store


In [83]:
ny_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bath Beach,45,45,45,45,45,45
Canarsie,5,5,5,5,5,5
Coney Island,18,18,18,18,18,18
Cypress Hills,26,26,26,26,26,26
East New York,12,12,12,12,12,12
Flatlands,18,18,18,18,18,18
Manhattan Beach,10,10,10,10,10,10
Park Slope,67,67,67,67,67,67
Starrett City,8,8,8,8,8,8


In [84]:
print('There are {} uniques categories.'.format(len(ny_venues['Venue Category'].unique())))

There are 108 uniques categories.


## Analyzing Neighborhood

In [85]:
# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

ny_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Beach,...,Thai Restaurant,Theater,Theme Park Ride / Attraction,Toy / Game Store,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Women's Store,Yoga Studio
0,Park Slope,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Park Slope,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Park Slope,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Park Slope,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,Park Slope,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [86]:
ny_onehot.shape

(209, 109)

In [87]:
ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()
ny_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Beach,...,Thai Restaurant,Theater,Theme Park Ride / Attraction,Toy / Game Store,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Women's Store,Yoga Studio
0,Bath Beach,0.0,0.0,0.044444,0.0,0.022222,0.022222,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022222,0.022222,0.0,0.0,0.0,0.0
1,Canarsie,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Coney Island,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.166667,0.055556,...,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cypress Hills,0.0,0.0,0.0,0.0,0.038462,0.038462,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.038462,0.0
4,East New York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Flatlands,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0
6,Manhattan Beach,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Park Slope,0.014925,0.059701,0.0,0.029851,0.029851,0.0,0.0,0.0,0.0,...,0.014925,0.014925,0.0,0.014925,0.0,0.0,0.0,0.014925,0.0,0.014925
8,Starrett City,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues

In [88]:
num_top_venues = 10

for hood in ny_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = ny_grouped[ny_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bath Beach----
                  venue  freq
0    Chinese Restaurant  0.07
1              Pharmacy  0.07
2  Fast Food Restaurant  0.04
3    Italian Restaurant  0.04
4       Bubble Tea Shop  0.04
5      Sushi Restaurant  0.04
6      Asian Restaurant  0.04
7        Ice Cream Shop  0.02
8    Dim Sum Restaurant  0.02
9                 Diner  0.02


----Canarsie----
                           venue  freq
0               Asian Restaurant   0.2
1                    IT Services   0.2
2                            Gym   0.2
3             Chinese Restaurant   0.2
4           Caribbean Restaurant   0.2
5              Accessories Store   0.0
6  Paper / Office Supplies Store   0.0
7                          Plaza   0.0
8                     Playground   0.0
9                    Pizza Place   0.0


----Coney Island----
                          venue  freq
0              Baseball Stadium  0.17
1          Caribbean Restaurant  0.11
2                 Deli / Bodega  0.11
3                  Skating R

In [89]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [90]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bath Beach,Pharmacy,Chinese Restaurant,Italian Restaurant,Sushi Restaurant,Bubble Tea Shop,Fast Food Restaurant,Asian Restaurant,Rental Car Location,Restaurant,Kebab Restaurant
1,Canarsie,Asian Restaurant,IT Services,Gym,Chinese Restaurant,Caribbean Restaurant,Yoga Studio,Gas Station,Event Service,Falafel Restaurant,Fast Food Restaurant
2,Coney Island,Baseball Stadium,Caribbean Restaurant,Deli / Bodega,Beach,Pharmacy,Skating Rink,Pizza Place,Bakery,Theme Park Ride / Attraction,Music Venue
3,Cypress Hills,Fried Chicken Joint,Latin American Restaurant,Fast Food Restaurant,Ice Cream Shop,Spanish Restaurant,Chinese Restaurant,Metro Station,Gas Station,Donut Shop,Food
4,East New York,Deli / Bodega,Plaza,Metro Station,Pharmacy,Event Service,Music Venue,Caribbean Restaurant,Fried Chicken Joint,Pizza Place,Bus Station


In [135]:
search_query = 'Restaurant'
radius = 500
print(search_query + ' .... OK!')


Restaurant .... OK!


In [141]:
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}&price={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT,[3,4])
url

'https://api.foursquare.com/v2/venues/search?client_id=ZZUKG2R2MDGSCP30TS4XINPAP4Y4PMBLU1DO1TNGFFPCNTZR&client_secret=KTAN34ECK3NKIB2LVQD2REJ5IEF4CG1YZOOEA5A3RIXSMMIT&ll=40.7127281,-74.0060152&v=20180604&query=Restaurant&radius=500&limit=100&price=[3, 4]'

In [142]:
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.shape[0]

50

In [143]:
dataframe.head()

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,...,location.state,location.country,location.formattedAddress,delivery.id,delivery.url,delivery.provider.name,delivery.provider.icon.prefix,delivery.provider.icon.sizes,delivery.provider.icon.name,venuePage.id
0,45e5c256f964a52046431fe3,Mudville Restaurant & Tap House,"[{'id': '4bf58dd8d48988d14c941735', 'name': 'W...",v-1583680123,False,126 Chambers St,btwn W Broadway & Church St,40.715336,-74.008881,"[{'label': 'display', 'lat': 40.71533575723845...",...,NY,United States,[126 Chambers St (btwn W Broadway & Church St)...,299726.0,https://www.seamless.com/menu/mudville-restaur...,seamless,https://fastly.4sqi.net/img/general/cap/,"[40, 50]",/delivery_provider_seamless_20180129.png,43156651.0
1,4b4fdfc8f964a520801827e3,TJ Byrnes Bar and Restaurant,"[{'id': '4bf58dd8d48988d1c4941735', 'name': 'R...",v-1583680123,False,77 Fulton St,Gold St,40.709233,-74.003747,"[{'label': 'display', 'lat': 40.70923312629616...",...,NY,United States,"[77 Fulton St (Gold St), New York, NY 10038, U...",64746.0,https://www.seamless.com/menu/tj-byrnes-77-ful...,seamless,https://fastly.4sqi.net/img/general/cap/,"[40, 50]",/delivery_provider_seamless_20180129.png,131643631.0
2,4bc238adf8219c744286b410,Amore's Pizza Restaurant,"[{'id': '4bf58dd8d48988d1ca941735', 'name': 'P...",v-1583680123,False,147 Chambers St,Hudson Street,40.71586,-74.009888,"[{'label': 'display', 'lat': 40.71585960614924...",...,NY,United States,"[147 Chambers St (Hudson Street), New York, NY...",1431324.0,https://www.seamless.com/menu/cafe-amores-pizz...,seamless,https://fastly.4sqi.net/img/general/cap/,"[40, 50]",/delivery_provider_seamless_20180129.png,
3,4c4890da1879c9b6cce6e143,New Shezan Restaurant,"[{'id': '4bf58dd8d48988d115941735', 'name': 'M...",v-1583680123,False,183 Church St,btwn Duane & Reade St.,40.715789,-74.007227,"[{'label': 'display', 'lat': 40.71578949233164...",...,NY,United States,"[183 Church St (btwn Duane & Reade St.), New Y...",,,,,,,
4,48510cf9f964a520a5501fe3,Restaurant Marc Forgione,"[{'id': '4bf58dd8d48988d157941735', 'name': 'N...",v-1583680123,False,134 Reade St,btwn Hudson and Greenwich St,40.71638,-74.009629,"[{'label': 'display', 'lat': 40.71637984317071...",...,NY,United States,"[134 Reade St (btwn Hudson and Greenwich St), ...",,,,,,,
