<strong> Capstone project - week 5 </strong>

<b> Introduction </b>

Access to health care services is important in maintaining health, preventing/ managing disease, and reducing premature death.   This means that availability and timely use of primary health services is needed; Having close access to needed health care services (or geographic availability) is one factor in achieving a successful outcome for ill patients. 

<b> Problem: </b> 

A person suddenly falls ill with symptoms similar to flu, yet he/she cannot be sure it is not covid19. Given the current health, he/she cannot wait for a doctor’s appointment, but the emergency services/ hospitals might not be necessary (yet!). In such an event, would this person have quick access to a primary health service (in a specific borough/ neighbourhood) to get a primary diagnosis? 

<b> Justification: </b> 

Firstly, since I am in healthcare, this is an issue that is important to me. 
Secondly, a person may just have the cold/flu since it is flu season after all. However, the country/ state does not want everybody with flu-like symptoms to overwhelm emergency services/ hospitals. So, it is important to know where the primary care providers are in each borough/ neighbourhood, and to ensure that people from that borough/ neighbourhood have access to those facilities first. Then, they can be transferred to hospitals/emergency services, if necessary. 

<b> Aim: </b> 

The aim of this small study is to determine if residents in a borough would have quick access to primary health services in their respective neighbourhood.

In [1]:
# Install libraries 
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!pip -q install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

! pip install geocoder # import geocoder to find latitude and longitude values
import geocoder 

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

!pip -q install folium
import folium # map rendering library

print('Libraries imported.')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 7.7MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Libraries imported.


In [2]:
# get new york data
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

#load data
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
newyork_data

Data downloaded!


{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

In [3]:
# check features of data
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [4]:
# transform data to pandas
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods

#fill data
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [5]:
neighborhoods['Borough'].unique()

array(['Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island'],
      dtype=object)

In [6]:
#check out New York
# use geopy to get latitude/longitude data
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [7]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers of neighbourhood/borough to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [8]:
# check borough manhattan
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [9]:
# use geopy to get latitude/longitude data
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [10]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers for neighbourhood to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

</b> Part 1: general idea of where medical health services are </b> 

Use Foursquare API to explore the neighborhoods for segmentation

In [11]:
#@hidden cell
CLIENT_ID = 'C0J32BIZ5OS2WZ5PUJTJSXJ2B3RAPAVM3Y4U334Q1AMZMUOY' # your Foursquare ID
CLIENT_SECRET = 'FQMDUSU0XNWHLY2YEMQYBHNNLLY3W1JMFXJFQQFAGQWG1PJ3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


In [12]:
# explore first neighobourhood in manhattan
manhattan_data.loc[0, 'Neighborhood']

neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


In [13]:
# find venues/results
# type your answer here
radius = 5000
LIMIT = 10000
cat_ID = '4bf58dd8d48988d104941735' # medical center

#url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}&categoryId={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT, cat_ID)

results = requests.get(url).json()


In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [15]:
# function to explore neighbourhood
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
        cat_ID)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
# explore manhattan
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [17]:
print(manhattan_venues.shape)
manhattan_venues.head(20)

(2721, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,NewYork-Presbyterian/The Allen Hospital,40.873327,-73.913051,Hospital
1,Marble Hill,40.876551,-73.91066,USA Vascular Centers,40.875152,-73.909602,Medical Center
2,Marble Hill,40.876551,-73.91066,Riverdale Audiology,40.879582,-73.908073,Medical Center
3,Marble Hill,40.876551,-73.91066,USA Vein Clinics,40.875314,-73.909535,Doctor's Office
4,Marble Hill,40.876551,-73.91066,USA Fibroid Centers,40.87576,-73.908281,Doctor's Office
5,Marble Hill,40.876551,-73.91066,St. Joseph's Imaging Center,40.879963,-73.907951,Medical Center
6,Marble Hill,40.876551,-73.91066,Advanced Endoscopy Center,40.87677,-73.906488,Doctor's Office
7,Marble Hill,40.876551,-73.91066,Riverdale Family Practice,40.879941,-73.907996,Doctor's Office
8,Marble Hill,40.876551,-73.91066,Columbia Doctors,40.879582,-73.908073,Doctor's Office
9,Marble Hill,40.876551,-73.91066,Montefiore Medical Group-Marble Hill,40.877639,-73.906119,Doctor's Office


In [18]:
# check size of data/catgories
print(manhattan_venues.shape)
manhattan_venues.groupby('Neighborhood').count()

print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

(2721, 7)
There are 36 uniques categories.


In [19]:
# get_category_type filter venue.category to just specific terms 'clinic', 'medical centre'
manhattan_doctor = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Doctor')]
manhattan_doctor.head(20)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
3,Marble Hill,40.876551,-73.91066,USA Vein Clinics,40.875314,-73.909535,Doctor's Office
4,Marble Hill,40.876551,-73.91066,USA Fibroid Centers,40.87576,-73.908281,Doctor's Office
6,Marble Hill,40.876551,-73.91066,Advanced Endoscopy Center,40.87677,-73.906488,Doctor's Office
7,Marble Hill,40.876551,-73.91066,Riverdale Family Practice,40.879941,-73.907996,Doctor's Office
8,Marble Hill,40.876551,-73.91066,Columbia Doctors,40.879582,-73.908073,Doctor's Office
9,Marble Hill,40.876551,-73.91066,Montefiore Medical Group-Marble Hill,40.877639,-73.906119,Doctor's Office
10,Marble Hill,40.876551,-73.91066,"Sahadeo D. Ramnauth, MD",40.873348,-73.913017,Doctor's Office
11,Marble Hill,40.876551,-73.91066,Riverdale Eye Associates,40.878632,-73.914774,Doctor's Office
12,Marble Hill,40.876551,-73.91066,"Wolfeld Plastic Surgery, LLC",40.878657,-73.914814,Doctor's Office
13,Marble Hill,40.876551,-73.91066,Riverdale Pediatrics,40.879202,-73.914397,Doctor's Office


In [20]:
# get_category_type filter venue.category to just specific terms 'clinic', 'medical centre'
manhattan_clinic = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Clinic')]
manhattan_clinic.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
283,Central Harlem,40.815976,-73.943211,Birthing Center,40.814285,-73.94008,Maternity Clinic
1022,Murray Hill,40.748303,-73.978332,Noble Fertility Center,40.748035,-73.978149,Maternity Clinic


In [21]:
# get_category_type filter venue.category to just specific terms 'clinic', 'medical centre'
manhattan_medical = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Medical')]
manhattan_medical.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,Marble Hill,40.876551,-73.91066,USA Vascular Centers,40.875152,-73.909602,Medical Center
2,Marble Hill,40.876551,-73.91066,Riverdale Audiology,40.879582,-73.908073,Medical Center
5,Marble Hill,40.876551,-73.91066,St. Joseph's Imaging Center,40.879963,-73.907951,Medical Center
34,Chinatown,40.715618,-73.994279,CP Advanced Imaging,40.716596,-73.996254,Medical Center
35,Chinatown,40.715618,-73.994279,Health Trail,40.71487,-73.998276,Medical Center


In [22]:
# create map of possible health services in Manhattan using latitude and longitude values
map_manhattan_med = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers of manhattan_doctor to map
for lat, lng, label in zip(manhattan_doctor['Venue Latitude'], manhattan_doctor['Venue Longitude'], manhattan_doctor['Venue']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_manhattan_med)  
    
# add markers of manhattan_medical to map
for lat, lng, label in zip(manhattan_medical['Venue Latitude'], manhattan_medical['Venue Longitude'], manhattan_medical['Venue']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3db7e4',
        fill_opacity=0.1,
        parse_html=False).add_to(map_manhattan_med)  
    
map_manhattan_med

</b> Part 2: </b> aggregating medical health services within certain clusters (neighbourhoods), and analyse each neighbourhood

In [23]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_grouped = manhattan_onehot.groupby('Neighborhood').sum().reset_index()
manhattan_grouped = manhattan_grouped[['Neighborhood','Doctor\'s Office','Medical Center']]
manhattan_grouped.head()


Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
0,Battery Park City,8,5
1,Carnegie Hill,85,13
2,Central Harlem,8,2
3,Chelsea,55,19
4,Chinatown,58,13


In [24]:
#manhattan_med = [manhattan_doctor, manhattan_medical]
#manhattan_med = pd.concat(manhattan_med,ignore_index=True)
#manhattan_med.head()
#manhattan_grouped = manhattan_med.groupby('Neighborhood').count()
#manhattan_grouped.head() 

In [25]:
# rank each neighbourhood according to number
manhattan_grouped_rank = manhattan_grouped.sort_values(by=['Doctor\'s Office'],ascending=False) 
manhattan_grouped_rank.head()

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
28,Roosevelt Island,87,12
1,Carnegie Hill,85,13
16,Lenox Hill,82,16
35,Upper East Side,77,21
18,Little Italy,73,14


In [26]:
# sort each neighbourhood in alphabetical order
manhattan_grouped_alpha = manhattan_grouped.sort_values(by=['Neighborhood'],ascending=True) 
manhattan_grouped_alpha.head()


Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
0,Battery Park City,8,5
1,Carnegie Hill,85,13
2,Central Harlem,8,2
3,Chelsea,55,19
4,Chinatown,58,13


In [27]:
# Download new york population data
#!wget -q -O 'swpk-hqdp.json' https://data.cityofnewyork.us/resource
#with open('swpk-hqdp.json') as json_file:
#    ny_data = json.load(json_file)
#print('Data downloaded!')

In [28]:
ny_pop_data = pd.read_csv('https://data.cityofnewyork.us/resource/swpk-hqdp.csv ')  
ny_pop_data.rename(columns={'nta_name':'Neighborhood'},inplace=True)
ny_pop_data

Unnamed: 0,borough,year,fips_county_code,nta_code,Neighborhood,population
0,Bronx,2000,5,BX01,Claremont-Bathgate,28149
1,Bronx,2000,5,BX03,Eastchester-Edenwald-Baychester,35422
2,Bronx,2000,5,BX05,Bedford Park-Fordham North,55329
3,Bronx,2000,5,BX06,Belmont,25967
4,Bronx,2000,5,BX07,Bronxdale,34309
5,Bronx,2000,5,BX08,West Farms-Bronx River,34542
6,Bronx,2000,5,BX09,Soundview-Castle Hill-Clason Point-Harding Park,50753
7,Bronx,2000,5,BX10,Pelham Bay-Country Club-City Island,27140
8,Bronx,2000,5,BX13,Co-Op City,40676
9,Bronx,2000,5,BX14,East Concourse-Concourse Village,58961


In [29]:
# only select year 2010
manhattan_pop_2010 = ny_pop_data[ny_pop_data['year'].astype("str").str.contains("2010")]
#manhattan_pop_2010 = manhattan_pop_2010[['borough','Neighborhood','population']]
manhattan_pop_2010.head()


Unnamed: 0,borough,year,fips_county_code,nta_code,Neighborhood,population
195,Bronx,2010,5,BX01,Claremont-Bathgate,31078
196,Bronx,2010,5,BX03,Eastchester-Edenwald-Baychester,34517
197,Bronx,2010,5,BX05,Bedford Park-Fordham North,54415
198,Bronx,2010,5,BX06,Belmont,27378
199,Bronx,2010,5,BX07,Bronxdale,35538


In [30]:
# only select Manhattan
manhattan_pop = manhattan_pop_2010[manhattan_pop_2010['borough'].str.contains('Manhattan')].reset_index()
manhattan_pop = manhattan_pop[['borough','Neighborhood','population']]
manhattan_pop.head()

Unnamed: 0,borough,Neighborhood,population
0,Manhattan,Marble Hill-Inwood,46746
1,Manhattan,Central Harlem North-Polo Grounds,75282
2,Manhattan,Hamilton Heights,48520
3,Manhattan,Manhattanville,22950
4,Manhattan,Morningside Heights,55929


In [31]:
# rank each neighbourhood in alphabetical order
manhattan_pop_alpha = manhattan_pop.sort_values(by=['Neighborhood'],ascending=True) 
manhattan_pop_alpha.head()

Unnamed: 0,borough,Neighborhood,population
17,Manhattan,Battery Park City-Lower Manhattan,39699
1,Manhattan,Central Harlem North-Polo Grounds,75282
5,Manhattan,Central Harlem South,43383
18,Manhattan,Chinatown,47844
9,Manhattan,Clinton,45884


In [32]:
# unfortunately, the Neighborhood population data does not match neighborhood data from Foursquare. 
# So some Neighborhood population has to be ignored or combined.
#test1=manhattan_pop_alpha[manhattan_pop_alpha['Neighborhood'].str.contains('Central Harlem', case=False)].sum().to_frame().transpose()
#test1

#sum_ = manhattan_pop_alpha.loc[manhattan_pop_alpha['Neighborhood'].str.contains('Central Harlem', case=False)].sum().to_frame().transpose()                            
#sum_=sum_.replace({'borough': 'ManhattanManhattan'}, 'Manhattan')
##sum_=sum_.replace({'Neighbourhood': r'(^.*Central.*$)'}, {'Neighbourhood': 'Central Harlem'},regex=True)
#sum_.loc[sum_['Neighborhood'].str.contains('Central Harlem', case=False), 'Neighborhood'] = 'Central Harlem'
#sum_

test1=manhattan_pop_alpha.loc[manhattan_pop_alpha['Neighborhood'].str.contains('Central Harlem', case=False), 'Neighborhood'] = 'Central Harlem'
test1=manhattan_pop_alpha.loc[manhattan_pop_alpha['Neighborhood'].str.contains('East Harlem', case=False), 'Neighborhood'] = 'East Harlem'
test1=manhattan_pop_alpha.loc[manhattan_pop_alpha['Neighborhood'].str.contains('Washington Heights', case=False), 'Neighborhood'] = 'Washington Heights'
test1=manhattan_pop_alpha.groupby('Neighborhood', sort=False).sum()
test1


Unnamed: 0_level_0,population
Neighborhood,Unnamed: 1_level_1
Battery Park City-Lower Manhattan,39699
Central Harlem,118665
Chinatown,47844
Clinton,45884
East Harlem,115921
East Village,44136
Gramercy,27988
Hamilton Heights,48520
Hudson Yards-Chelsea-Flat Iron-Union Square,70150
Lenox Hill-Roosevelt Island,80771


In [33]:
# The only foreseeable way to move forward (for now) is to take the mean population of each neighbourhood
manhattan_pop_alpha.shape
mean=manhattan_pop_alpha['population'].mean()
mean


54685.275862068964

In [34]:
# so average medical/doctor center per 1000 person
manhattan1000 = manhattan_grouped
mean_1000 = mean/1000
manhattan1000[['Doctor\'s Office','Medical Center']] = manhattan1000[['Doctor\'s Office','Medical Center']].div(mean_1000)
manhattan1000.head()
manhattan1000_rank = manhattan1000.sort_values(by=['Doctor\'s Office'],ascending=False) 
manhattan1000_rank.head()

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
28,Roosevelt Island,1.590922,0.219437
1,Carnegie Hill,1.554349,0.237724
16,Lenox Hill,1.49949,0.292583
35,Upper East Side,1.408057,0.384016
18,Little Italy,1.334911,0.25601


In [35]:
# recalculate manhattan_grouped
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_grouped = manhattan_onehot.groupby('Neighborhood').sum().reset_index()
manhattan_grouped = manhattan_grouped[['Neighborhood','Doctor\'s Office','Medical Center']]
manhattan_grouped.head()


Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
0,Battery Park City,8,5
1,Carnegie Hill,85,13
2,Central Harlem,8,2
3,Chelsea,55,19
4,Chinatown,58,13


In [36]:
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
0,Battery Park City,8,5
1,Carnegie Hill,85,13
2,Central Harlem,8,2
3,Chelsea,55,19
4,Chinatown,58,13


In [37]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 0, 4, 4, 1, 0, 0, 2, 1], dtype=int32)

In [38]:
# add clustering labels
manhattan_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data
manhattan_merged = manhattan_merged.join(manhattan_grouped.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Doctor's Office,Medical Center
0,Manhattan,Marble Hill,40.876551,-73.91066,0,14,3
1,Manhattan,Chinatown,40.715618,-73.994279,4,58,13
2,Manhattan,Washington Heights,40.851903,-73.9369,2,23,6
3,Manhattan,Inwood,40.867684,-73.92121,0,14,5
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0,6,2


In [39]:
# rank neighborhood based on doctor's office
manhattan_merged_rank = manhattan_grouped.sort_values(by=['Doctor\'s Office'],ascending=False) 
manhattan_merged_rank['Total'] = manhattan_merged_rank.loc[:,['Doctor\'s Office','Medical Center']].sum(axis=1)
manhattan_merged_rank.head()

Unnamed: 0,Cluster Labels,Neighborhood,Doctor's Office,Medical Center,Total
28,3,Roosevelt Island,87,12,99
1,3,Carnegie Hill,85,13,98
16,3,Lenox Hill,82,16,98
35,3,Upper East Side,77,21,98
18,3,Little Italy,73,14,87


In [40]:
# rank neighborhood based on cluster, check with doctor's office
manhattan_merged_cluster = manhattan_merged.sort_values(by=['Cluster Labels'],ascending=False)
manhattan_merged_cluster.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Doctor's Office,Medical Center
33,Manhattan,Midtown South,40.74851,-73.988713,4,52,25
1,Manhattan,Chinatown,40.715618,-73.994279,4,58,13
18,Manhattan,Greenwich Village,40.726933,-73.999914,4,49,20
17,Manhattan,Chelsea,40.744035,-74.003116,4,55,19
15,Manhattan,Midtown,40.754691,-73.981669,4,59,12


In [41]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# visualise results
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine clusters

In [42]:
# cluster 1
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
0,Marble Hill,14,3
3,Inwood,14,5
4,Hamilton Heights,6,2
5,Manhattanville,2,5
6,Central Harlem,8,2
7,East Harlem,6,4
14,Clinton,8,8
20,Lower East Side,6,2
21,Tribeca,5,4
24,West Village,11,1


In [43]:
# cluster 2
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
16,Murray Hill,60,24
26,Morningside Heights,60,21
27,Gramercy,62,29
29,Financial District,61,30
32,Civic Center,65,20
36,Tudor City,60,30
38,Flatiron,63,30


In [44]:
# cluster 3
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
2,Washington Heights,23,6
9,Yorkville,27,6
12,Upper West Side,22,9
19,East Village,29,9
35,Turtle Bay,33,13


In [45]:
#cluster 4
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
8,Upper East Side,77,21
10,Lenox Hill,82,16
11,Roosevelt Island,87,12
13,Lincoln Square,72,14
22,Little Italy,73,14
30,Carnegie Hill,85,13


In [46]:
# cluster 5
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Doctor's Office,Medical Center
1,Chinatown,58,13
15,Midtown,59,12
17,Chelsea,55,19
18,Greenwich Village,49,20
23,Soho,44,16
33,Midtown South,52,25
34,Sutton Place,56,17
