# Introduction

 _Created by_ Vicente Carvalho for Coursera Capstone purposes.

Almost every business depends on location. At first, the location has an immediate effect on fixed costs and it can also be very important for variable earnings. In this problem, it will be chosen as the best location for a Psychology Clinic. The main criteria adopted are the relation between some variables such as the number of restaurants, clinics and hospitals, and the number of results of 'psy' results in the Foursquare database.

# Customer

Customer C is a psychologist who is also owner of several clinics of psychology in New York. C wants to know what areas of New York are good to install your clinics. Right now C has only one clinic in Manhattan but C wants to open new clinics in other regions too.

# Business Problem

C intends to scale your business and also to dilute some fixed costs by using the same client target for all your clinics in New York.

C intends to use similar as possible furniture, paints and customer psychology challengers. C believes that beyond one-by-one therapy, group therapy is also a great tool to provide your clients with better health and quality, so it is interesting to deal with clients of similar backgrounds and interests.

In C experience as clinical psychology, there is a strong correlation between customer psychology profile and house/work neighborhood area. It's known by the client experience that Manhattan is a great place but frequently over too overpriced. So, the client wants to know other areas that are similar to Manhattan that should be also investigated.

# Problem Solution Framework

C needs customer clusterization. As C is pretty sure about psychology profile and neighborhood, the first approach is definitely to try cluster neighbors in New York. 

It's necessary more information about his actual clinic in Manhattan: 
 - C said in his actual clinic there are a lot of psychologist clinics, hospitals, and restaurants: it should be examined as evidence of good places;
 - it is supposed that the correlation is strong between place and psychological profile;


# Data

It will be used data from Foursquare API. The steps admitted to solving the clusterization are:

 - Read New York Json file;
 - Add Latitude and Longitude information by Borough and Neighborhood;
 - Data Extract by Foursquare's API;

# Problem Solving

At first, k-means will be used to cluster regions of New York. It will be observed a strict radius to minimize the incidence of the same results for different searches.

In the case of no correlation found by the criteria pointed by C, another database will be included to support the required information.

## Data Extract

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
import json
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### Define Foursquare Credentials and Version

In [2]:
CLIENT_ID = 'ZZUKG2R2MDGSCP30TS4XINPAP4Y4PMBLU1DO1TNGFFPCNTZR' # your Foursquare ID
CLIENT_SECRET = 'KTAN34ECK3NKIB2LVQD2REJ5IEF4CG1YZOOEA5A3RIXSMMIT' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
radius = 1000
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZZUKG2R2MDGSCP30TS4XINPAP4Y4PMBLU1DO1TNGFFPCNTZR
CLIENT_SECRET:KTAN34ECK3NKIB2LVQD2REJ5IEF4CG1YZOOEA5A3RIXSMMIT


### Neighborhoods in New York

In [94]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [95]:
newyork_data

0,
    'borough': 'Queens',
    'bbox': [-73.79646462081593,
     40.71145964370482,
     -73.79646462081593,
     40.71145964370482]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.265',
   'geometry': {'type': 'Point',
    'coordinates': [-73.79671678028349, 40.73350025429757]},
   'geometry_name': 'geom',
   'properties': {'name': 'Utopia',
    'stacked': 1,
    'annoline1': 'Utopia',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Queens',
    'bbox': [-73.79671678028349,
     40.73350025429757,
     -73.79671678028349,
     40.73350025429757]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.266',
   'geometry': {'type': 'Point',
    'coordinates': [-73.80486120040537, 40.73493618075478]},
   'geometry_name': 'geom',
   'properties': {'name': 'Pomonok',
    'stacked': 1,
    'annoline1': 'Pomonok',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Queens',
    'bbox': [-73.80486120040537,
     40.7349361807547

In [4]:
neighborhoods_data = newyork_data['features']

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
for data in neighborhoods_data:
    borough = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [7]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## Add Lat/Long Info

In [9]:
!conda install -c conda-forge geopy --yes
import geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

done
-
  - anaconda/osx-64::openssl-1.1.1d-h1de35cc_2
  - defaults/osx-64::openssl-1.1.1d-h1de35cdone

# All requested packages already installed.



In [10]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [11]:
lat = 40.7127281
lng = -74.0060152

In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### Connecting Foursquare API 

In [17]:
neighborhoods['Psy'] = 0
neighborhoods['Hospital'] = 0
neighborhoods['Bank'] = 0
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank
0,Bronx,Wakefield,40.894705,-73.847201,0,0,0
1,Bronx,Co-op City,40.874294,-73.829939,0,0,0
2,Bronx,Eastchester,40.887556,-73.827806,0,0,0
3,Bronx,Fieldston,40.895437,-73.905643,0,0,0
4,Bronx,Riverdale,40.890834,-73.912585,0,0,0


In [42]:
columns = ['Psy','Hospital','Bank']
def getNearbyVenues(start, end, neighborhoods):
    radius = 500
    LIMIT = 100
    for i in range(start, end):
        lat = neighborhoods.loc[i,'Latitude']
        lng = neighborhoods.loc[i,'Longitude']
        name = neighborhoods.loc[i,'Neighborhood']
        # create the API request URL
        url = []
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'psy', radius, LIMIT))
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'hospital', radius, LIMIT))
        url.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, 'bank', radius, LIMIT))
        
        # make the GET request
        # results = requests.get(url).json()["response"]['groups'][0]['items']
        for k in range(0,3):
            results = requests.get(url[k]).json()
            print(results)
            if ((results['response']['venues'] != []) and ('warning' not in results['response'].keys())):
                # assign relevant part of JSON to venues
                venues = results['response']['venues']
                # tranform venues into a dataframe
                dataframe = json_normalize(venues)
                neighborhoods.loc[i,columns[k]] = dataframe.shape[0]
            #else:
                # No response from Foursquare API, this code is not needed by definition of neighborhoods
                # neighborhoods.loc[i,k] = 0 
    
    return(neighborhoods)

In [15]:
a = pd.DataFrame()
a['Col1'] = [1, 2, 3, 4, 5]
a['Col2'] = ['a', 'b', 'c', 'd', 'e']
a.drop(index = [2, 4],axis = 0, inplace = True)
a.reset_index(inplace = True)
a.drop(columns = ['index'],axis = 1, inplace = True)
a.head()

Unnamed: 0,Col1,Col2
0,1,a
1,2,b
2,4,d


In [43]:
ny_venues = getNearbyVenues(171,300, ny_venues)

{'meta': {'code': 429, 'errorType': 'quota_exceeded', 'errorDetail': 'Quota exceeded', 'requestId': '5e7bb1a5a2e538001bd48a2a'}, 'response': {}}


KeyError: 'venues'

In [37]:
df_venues = pd.DataFrame()

In [40]:
df_venues = ny_venues.copy()
df_venues.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank
0,Bronx,Wakefield,40.894705,-73.847201,0,0,1
1,Bronx,Co-op City,40.874294,-73.829939,0,0,5
2,Bronx,Eastchester,40.887556,-73.827806,0,0,0
3,Bronx,Fieldston,40.895437,-73.905643,0,0,1
4,Bronx,Riverdale,40.890834,-73.912585,0,1,5


In [44]:
df_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Borough,Latitude,Longitude,Psy,Hospital,Bank
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,1,1,1,1,1,1
Annadale,1,1,1,1,1,1
Arden Heights,1,1,1,1,1,1
Arlington,1,1,1,1,1,1
Arrochar,1,1,1,1,1,1
...,...,...,...,...,...,...
Woodhaven,1,1,1,1,1,1
Woodlawn,1,1,1,1,1,1
Woodrow,1,1,1,1,1,1
Woodside,1,1,1,1,1,1


In [45]:
df_venues.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank
0,Bronx,Wakefield,40.894705,-73.847201,0,0,1
1,Bronx,Co-op City,40.874294,-73.829939,0,0,5
2,Bronx,Eastchester,40.887556,-73.827806,0,0,0
3,Bronx,Fieldston,40.895437,-73.905643,0,0,1
4,Bronx,Riverdale,40.890834,-73.912585,0,1,5


In [49]:
from sklearn.preprocessing import StandardScaler

X = df_venues[['Psy','Hospital','Bank']]
cluster_dataset = StandardScaler().fit_transform(X)


In [51]:
from sklearn.cluster import KMeans

num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

print(labels)

[0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0
 0 0 0 0 0 1 1 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 1 0 3 1 3 3
 1 3 1 2 4 1 1 1 1 1 2 2 2 1 1 3 1 2 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0]


In [52]:
df_venues['Cluster'] = labels

In [56]:
df_venues[df_venues['Neighborhood'] == 'Financial District']

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank,Cluster
128,Manhattan,Financial District,40.707107,-74.010665,6,7,50,2


In [55]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)
cores = ['DarkRed','FireBrick','LightSalmon','LightGreen','Green']
# add markers to map
for lat, lng, borough, neighborhood, cluster in zip(df_venues['Latitude'], df_venues['Longitude'], df_venues['Borough'], df_venues['Neighborhood'],df_venues['Cluster']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color= cores[cluster],
        fill=True,
        fill_color = cores[cluster],
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [57]:
df_venues[df_venues['Neighborhood'] == 'Murray Hill']

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank,Cluster
115,Manhattan,Murray Hill,40.748303,-73.978332,38,10,50,4
180,Queens,Murray Hill,40.764126,-73.812763,0,0,0,0


In [59]:
df_venues[df_venues['Neighborhood'] == 'Gramercy']

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Psy,Hospital,Bank,Cluster
126,Manhattan,Gramercy,40.73721,-73.981376,6,28,20,3


Teste do API Open Data New York: 

https://data.cityofnewyork.us/Housing-Development/DOF-Condominium-comparable-rental-income-Manhattan/ikqj-pyhc

In [89]:
teste1 = pd.read_csv('DOF__Condominium_comparable_rental_income___Manhattan_-_FY_2010_2011.csv')
teste1.drop(columns = ['MANHATTAN CONDOMINIUM PROPERTY Building Classification','MANHATTAN CONDOMINIUM PROPERTY Address','Census Tract', 'BIN', 'BBL', 'NTA','MANHATTAN CONDOMINIUM PROPERTY Boro-Block-Lot','MANHATTAN CONDOMINIUM PROPERTY Condo Section','Latitude', 'Longitude', 'Community Board','MANHATTAN CONDOMINIUM PROPERTY Neighborhood','MANHATTAN CONDOMINIUM PROPERTY Year Built','Council District','MANHATTAN CONDOMINIUM PROPERTY Total Units',' COMPARABLE RENTAL 1 Boro-Block-Lot','COMPARABLE RENTAL 1 Address', 'COMPARABLE RENTAL 1  Neighborhood',
       'COMPARABLE RENTAL 1  Building Classification',
       'COMPARABLE RENTAL 1  Total Units', 'COMPARABLE RENTAL 1  Year Built',
       'COMPARABLE RENTAL 1  Gross SqFt',
       'COMPARABLE RENTAL 1  Est. Gross Income',
       'COMPARABLE RENTAL 1  Gross Income per SqFt',
       'COMPARABLE RENTAL 1 Full Market Value',
       'COMPARABLE RENTAL 1  Market Value per SqFt',
       'COMPARABLE RENTAL 1  Dist. from Coop in miles',
       'COMPARABLE RENTAL 2  Boro-Block-Lot', 'COMPARABLE RENTAL 2  Address',
       'COMPARABLE RENTAL 2  Neighborhood',
       'COMPARABLE RENTAL 2  Building Classification',
       'COMPARABLE RENTAL 2  Total Units', 'COMPARABLE RENTAL 2  Year Built',
       'COMPARABLE RENTAL 2  Gross SqFt',
       'COMPARABLE RENTAL 2  Est. Gross Income',
       'COMPARABLE RENTAL 2  Gross Income per SqFt',
       'COMPARABLE RENTAL 2  Full Market Value',
       'COMPARABLE RENTAL 2  Market Value per SqFt',
       'COMPARABLE RENTAL 2  Dist. from Coop in miles','MANHATTAN CONDOMINIUM PROPERTY Gross SqFt',
       'MANHATTAN CONDOMINIUM PROPERTY Est. Gross Income','Borough','MANHATTAN CONDOMINIUM PROPERTY Full Market Value',
       'MANHATTAN CONDOMINIUM PROPERTY Gross Income per SqFt'], axis = 0, inplace = True)
teste1.head()

Unnamed: 0,Postcode,MANHATTAN CONDOMINIUM PROPERTY Market Value per SqFt
0,10004.0,163.0
1,10004.0,221.0
2,10004.0,201.0
3,10280.0,196.0
4,10280.0,196.0


In [90]:
teste1.columns

Index(['Postcode', 'MANHATTAN CONDOMINIUM PROPERTY Market Value per SqFt'], dtype='object')

In [91]:
teste1.describe()

Unnamed: 0,Postcode,MANHATTAN CONDOMINIUM PROPERTY Market Value per SqFt
count,1148.0,1165.0
mean,10027.805749,164.017339
std,36.833371,45.984265
min,10000.0,28.5
25%,10012.0,139.0
50%,10021.0,163.0
75%,10026.0,197.0
max,10280.0,311.0


In [93]:
teste1.rename(columns={"Postcode": "Postcode", "MANHATTAN CONDOMINIUM PROPERTY Market Value per SqFt": "Value/SqFt"})

Unnamed: 0,Postcode,Value/SqFt
0,10004.0,163.0
1,10004.0,221.0
2,10004.0,201.0
3,10280.0,196.0
4,10280.0,196.0
...,...,...
1160,10040.0,120.0
1161,10033.0,64.0
1162,10033.0,104.0
1163,10034.0,67.0


In [96]:
postcode = pd.DataFrame()
postcode = pd.read_html('https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm')
postcode.head()

URLError: <urlopen error [Errno 60] Operation timed out>