## Spatial Analysis of Healthcare Businesses in Los Angeles Neighborhoods

Import libraries for the analysis

In [1]:
import numpy as np # library to handle data in a vectorized manner
 
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
!conda update -n base -c defaults conda --yes

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    scipy-1.4.1                |   py36h921218d_0        18.9 MB  conda-forge
    libcblas-3.8.0             |      11_openblas        

<a id='item1'></a>

## 1. Download and Explore Dataset

Neighborhood geospatial data for Los Angeles is available at: https://usc.data.socrata.com/dataset/Los-Angeles-Neighborhood-Map/r8qd-yxsr

Data can be downloaded as a csv file, which will be read as a Pandas dataframe.

In [2]:
fh = pd.read_csv('la_neighborhoods.csv')

print('Data downloaded!')

Data downloaded!


#### Explore the data

In [3]:
fh.shape

(272, 14)

Let's take a quick look at the data.

In [4]:
fh.head()

Unnamed: 0,set,slug,the_geom,kind,external_i,name,display_na,sqmi,type,name_1,slug_1,latitude,longitude,location
0,L.A. County Neighborhoods (Current),acton,MULTIPOLYGON (((-118.20261747920541 34.5389897...,L.A. County Neighborhood (Current),acton,Acton,Acton L.A. County Neighborhood (Current),39.339109,unincorporated-area,,,-118.16981,34.497355,POINT(34.497355239240846 -118.16981019229348)
1,L.A. County Neighborhoods (Current),adams-normandie,MULTIPOLYGON (((-118.30900800000012 34.0374109...,L.A. County Neighborhood (Current),adams-normandie,Adams-Normandie,Adams-Normandie L.A. County Neighborhood (Curr...,0.80535,segment-of-a-city,,,-118.300208,34.031461,POINT(34.031461499124156 -118.30020800000011)
2,L.A. County Neighborhoods (Current),agoura-hills,MULTIPOLYGON (((-118.76192500000009 34.1682029...,L.A. County Neighborhood (Current),agoura-hills,Agoura Hills,Agoura Hills L.A. County Neighborhood (Current),8.14676,standalone-city,,,-118.759885,34.146736,POINT(34.146736499122795 -118.75988450000015)
3,L.A. County Neighborhoods (Current),agua-dulce,MULTIPOLYGON (((-118.2546773959221 34.55830403...,L.A. County Neighborhood (Current),agua-dulce,Agua Dulce,Agua Dulce L.A. County Neighborhood (Current),31.462632,unincorporated-area,,,-118.317104,34.504927,POINT(34.504926999796837 -118.3171036690717)
4,L.A. County Neighborhoods (Current),alhambra,MULTIPOLYGON (((-118.12174700000014 34.1050399...,L.A. County Neighborhood (Current),alhambra,Alhambra,Alhambra L.A. County Neighborhood (Current),7.623814,standalone-city,,,-118.136512,34.085539,POINT(34.085538999123571 -118.13651200000021)


In [3]:
# Rename 'name' as 'neighborhood' in the dataframe and drop unnecessary columns

fh.drop(['set', 'slug', 'the_geom', 'kind', 'external_i', 'display_na', 'sqmi', 'type', 'name_1', 'slug_1', 'location'], axis=1, inplace=True)
fh.rename(columns={"name": "neighborhood"}, inplace=True)
fh.rename(columns={'latitude': 'long'}, inplace=True)
fh.rename(columns={'longitude': 'lat'}, inplace=True)

fh.head()

Unnamed: 0,neighborhood,long,lat
0,Acton,-118.16981,34.497355
1,Adams-Normandie,-118.300208,34.031461
2,Agoura Hills,-118.759885,34.146736
3,Agua Dulce,-118.317104,34.504927
4,Alhambra,-118.136512,34.085539


In [6]:
print('The dataframe has {} neighborhoods.'.format(len(fh['neighborhood'].unique())))

The dataframe has 272 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Los Angeles.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>la_explorer</em>, as shown below.

In [4]:
address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Los Angeles are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Los Angeles are 34.0536909, -118.2427666.


In [5]:
fh.head()

Unnamed: 0,neighborhood,long,lat
0,Acton,-118.16981,34.497355
1,Adams-Normandie,-118.300208,34.031461
2,Agoura Hills,-118.759885,34.146736
3,Agua Dulce,-118.317104,34.504927
4,Alhambra,-118.136512,34.085539


#### Create a map of Los Angeles with neighborhoods superimposed on top.

In [5]:
# create map of Los Angeles using latitude and longitude values
map_la = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(fh['lat'], fh['long'], fh['neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

#### Let's explore the Long Beach neighborhood in Los Angeles.

Find Long Beach's record in the dataframe.

In [7]:
count=0
for name in fh['neighborhood']:
    if name == 'Long Beach': 
        print(count) 
        break
    count=count+1
    
print(fh.loc[count])

143
neighborhood    Long Beach
long              -118.156
lat                33.8066
Name: 143, dtype: object


Get the neighborhood's latitude and longitude values.

In [8]:
neighborhood_latitude = fh.loc[count, 'lat'] # neighborhood latitude value
neighborhood_longitude = fh.loc[count, 'long'] # neighborhood longitude value

neighborhood_name = fh.loc[count, 'neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Long Beach are 33.80658069997873, -118.15606399999999.


#### Now, let's get the top healthcare facilities that are in Long Beach within a radius of one kilometer.

In [9]:
LIMIT = 100 
radius = 500 # define radius in meters
CATEGORY_ID = '4bf58dd8d48988d104941735'

# create URL
url = 'https://api.foursquare.com/v2/venues/search?categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CATEGORY_ID,
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL




'https://api.foursquare.com/v2/venues/search?categoryId=4bf58dd8d48988d104941735&client_id=IDPZL5RVGAU5LVJKBD25BVQEM100Y01OHUEMHAULDZHZZYY5&client_secret=PIRQPIJXQ43QVRP2Y4HGWJMAGTP3TNIXNPM4G2YTPE4VQFDR&v=20180605&ll=33.80658069997873,-118.15606399999999&radius=500&limit=100'

Send the GET request and examine the results

In [10]:
results = requests.get(url).json()


In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [12]:
venues = results['response']['venues']

nearby_venues = json_normalize(venues) # flatten JSON
    
# filter columns
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# drop veterinarians, pet services, chiropractors, military bases, eye doctors and medical schools

nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Veterinarian'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Medical School'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Chiropractor'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Military Base'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Cosmetics Shop'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Shopping Mall'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Pet Service'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Optical Shop'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Office'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Eye Doctor'].index, axis=0, inplace=True)
nearby_venues.drop(nearby_venues[nearby_venues['categories']=='Building'].index, axis=0, inplace=True)

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,American Red Cross Long Beach,Medical Lab,33.806815,-118.153517
1,Liberty Pacific Medical Imaging,Medical Center,33.80405,-118.159365
2,Willow Wellness Center - Long Beach Memorial H...,Doctor's Office,33.807352,-118.15956
3,Healthcare Partners Laboratory,Doctor's Office,33.804886,-118.150718
4,"Reischl Physical Therapy, Inc.",Physical Therapist,33.803483,-118.153958


And how many venues were returned by Foursquare?

In [17]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

21 venues were returned by Foursquare.


<a id='item2'></a>

## 2. Explore Neighborhoods in Los Angeles

#### Create a function to repeat the same process to all the neighborhoods in LA.

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CATEGORY_ID,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
# Create new dataframe called losangeles_venues

losangeles_venues = getNearbyVenues(names=fh['neighborhood'],
                                   latitudes=fh['lat'],
                                   longitudes=fh['long']
                                  )



Acton
Adams-Normandie
Agoura Hills
Agua Dulce
Alhambra
Alondra Park
Artesia
Altadena
Angeles Crest
Arcadia
Arleta
Arlington Heights
Athens
Atwater Village
Avalon
Avocado Heights
Azusa
Vermont-Slauson
Baldwin Hills/Crenshaw
Baldwin Park
Bel-Air
Bellflower
Bell Gardens
Green Valley
Bell
Beverly Crest
Beverly Grove
Burbank
Koreatown
Beverly Hills
Beverlywood
Boyle Heights
Bradbury
Brentwood
Broadway-Manchester
Calabasas
Canoga Park
Carson
Carthay
Castaic Canyons
Chatsworth
Castaic
Central-Alameda
Century City
Cerritos
Charter Oak
Chatsworth Reservoir
Chesterfield Square
Cheviot Hills
Chinatown
Citrus
Claremont
Northridge
Commerce
Compton
Cypress Park
La Mirada
Covina
Cudahy
Culver City
Del Aire
Del Rey
Desert View Highlands
Diamond Bar
Downey
Downtown
Duarte
Eagle Rock
East Compton
East Hollywood
East La Mirada
Elizabeth Lake
East Los Angeles
East Pasadena
East San Gabriel
Echo Park
El Monte
El Segundo
El Sereno
Elysian Park
Elysian Valley
Vermont Square
Encino
Exposition Park
Fairfax
Flo

#### Drop other types of services and check the size of the resulting dataframe

In [24]:
# drop veterinarians, pet services, chiropractors, military bases, eye doctors and medical schools

losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Veterinarian'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Medical School'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Chiropractor'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Military Base'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Cosmetics Shop'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Shopping Mall'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Pet Service'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Optical Shop'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Office'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Eye Doctor'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Building'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Auditorium'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Pet Store'].index, axis=0, inplace=True)
losangeles_venues.drop(losangeles_venues[losangeles_venues['Venue Category']=='Spa'].index, axis=0, inplace=True)



print(losangeles_venues.shape)
losangeles_venues.head(20)

(1539, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,Agoura Hills,34.146736,-118.759885,Garfinkle Family Dental,34.14254,-118.757161,Dentist's Office
2,Agoura Hills,34.146736,-118.759885,Dr. Zak Agoura Hills Dental Care,34.14525,-118.762655,Dentist's Office
4,Agoura Hills,34.146736,-118.759885,Agoura Hills Dentist - John Abajian DDS - Arti...,34.148322,-118.76231,Dentist's Office
6,Agoura Hills,34.146736,-118.759885,revolution in motion,34.147266,-118.754238,Alternative Healer
8,Agoura Hills,34.146736,-118.759885,Agoura Family Dental,34.148322,-118.76231,Dentist's Office
9,Alhambra,34.085539,-118.136512,Kaiser Permanente Cardiology,34.083183,-118.134483,Doctor's Office
10,Alhambra,34.085539,-118.136512,Kings Throne,34.088677,-118.135675,Emergency Room
12,Artesia,33.866896,-118.080101,Gene Humphries DDS,33.867565,-118.081818,Dentist's Office
13,Artesia,33.866896,-118.080101,"Dr. Hamlet Ong, DDS",33.868258,-118.081915,Dentist's Office
14,Artesia,33.866896,-118.080101,Artesia Foot Clinic,33.866817,-118.082149,Doctor's Office


Let's check how many venues were returned for each neighborhood

In [25]:
losangeles_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agoura Hills,5,5,5,5,5,5
Alhambra,2,2,2,2,2,2
Altadena,1,1,1,1,1,1
Arcadia,24,24,24,24,24,24
Arleta,1,1,1,1,1,1
Arlington Heights,4,4,4,4,4,4
Artesia,25,25,25,25,25,25
Atwater Village,14,14,14,14,14,14
Avalon,1,1,1,1,1,1
Avocado Heights,1,1,1,1,1,1


#### Let's find out how many unique categories can be curated from all the returned venues

In [26]:
print('There are {} uniques categories.'.format(len(losangeles_venues['Venue Category'].unique())))

There are 20 uniques categories.


<a id='item3'></a>

## 3. Analyze Each Neighborhood

In [27]:
# one hot encoding
losangeles_onehot = pd.get_dummies(losangeles_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
losangeles_onehot['Neighborhood'] = losangeles_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [losangeles_onehot.columns[-1]] + list(losangeles_onehot.columns[:-1])
losangeles_onehot = losangeles_onehot[fixed_columns]

losangeles_onehot.head()

Unnamed: 0,Neighborhood,Acupuncturist,Alternative Healer,Assisted Living,Dentist's Office,Doctor's Office,Emergency Room,Gym / Fitness Center,Home Service,Hospital,Hospital Ward,Marijuana Dispensary,Maternity Clinic,Medical Center,Medical Lab,Mental Health Office,Pharmacy,Physical Therapist,Rehab Center,Urgent Care Center,Weight Loss Center
1,Agoura Hills,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Agoura Hills,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Agoura Hills,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Agoura Hills,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,Agoura Hills,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [28]:
losangeles_onehot.shape

(1539, 21)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [29]:
losangeles_grouped = losangeles_onehot.groupby('Neighborhood').mean().reset_index()
losangeles_grouped


Unnamed: 0,Neighborhood,Acupuncturist,Alternative Healer,Assisted Living,Dentist's Office,Doctor's Office,Emergency Room,Gym / Fitness Center,Home Service,Hospital,Hospital Ward,Marijuana Dispensary,Maternity Clinic,Medical Center,Medical Lab,Mental Health Office,Pharmacy,Physical Therapist,Rehab Center,Urgent Care Center,Weight Loss Center
0,Agoura Hills,0.0,0.2,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alhambra,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Altadena,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arcadia,0.0,0.0,0.0,0.375,0.541667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arleta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Arlington Heights,0.0,0.0,0.0,0.25,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Artesia,0.12,0.0,0.0,0.44,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.04,0.0
7,Atwater Village,0.0,0.071429,0.0,0.214286,0.428571,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0
8,Avalon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Avocado Heights,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [30]:
losangeles_grouped.shape

(158, 21)

#### Let's print each neighborhood along with the top 3 most common venues

In [31]:
num_top_venues = 3

for hood in losangeles_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = losangeles_grouped[losangeles_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agoura Hills----
                venue  freq
0    Dentist's Office   0.8
1  Alternative Healer   0.2
2       Acupuncturist   0.0


----Alhambra----
             venue  freq
0  Doctor's Office   0.5
1   Emergency Room   0.5
2    Acupuncturist   0.0


----Altadena----
              venue  freq
0   Doctor's Office   1.0
1     Acupuncturist   0.0
2  Maternity Clinic   0.0


----Arcadia----
              venue  freq
0   Doctor's Office  0.54
1  Dentist's Office  0.38
2    Medical Center  0.08


----Arleta----
                venue  freq
0      Medical Center   1.0
1       Acupuncturist   0.0
2  Alternative Healer   0.0


----Arlington Heights----
              venue  freq
0   Doctor's Office  0.50
1  Dentist's Office  0.25
2    Medical Center  0.25


----Artesia----
              venue  freq
0  Dentist's Office  0.44
1   Doctor's Office  0.32
2     Acupuncturist  0.12


----Atwater Village----
              venue  freq
0   Doctor's Office  0.43
1  Dentist's Office  0.21
2          Hospi

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [32]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 3 venues for each neighborhood.

In [102]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = losangeles_grouped['Neighborhood']

for ind in np.arange(losangeles_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(losangeles_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agoura Hills,Dentist's Office,Alternative Healer,Weight Loss Center,Urgent Care Center,Assisted Living
1,Alhambra,Doctor's Office,Emergency Room,Weight Loss Center,Urgent Care Center,Alternative Healer
2,Altadena,Doctor's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living
3,Arcadia,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital
4,Arleta,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living


<a id='item4'></a>

## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [103]:
# set number of clusters
kclusters = 5

losangeles_grouped_clustering = losangeles_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(losangeles_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 3, 4, 2, 4, 4, 4, 2, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

In [104]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#neighborhoods_venues_sorted.head()

losangeles_merged = fh

# merge losangeles_grouped with file handle to add latitude/longitude for each neighborhood
losangeles_merged = losangeles_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='neighborhood')

newdf = losangeles_merged.dropna(subset=['Cluster Labels']).reset_index()

newdf

Unnamed: 0,index,neighborhood,long,lat,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,2,Agoura Hills,-118.759885,34.146736,0.0,Dentist's Office,Alternative Healer,Weight Loss Center,Urgent Care Center,Assisted Living
1,4,Alhambra,-118.136512,34.085539,3.0,Doctor's Office,Emergency Room,Weight Loss Center,Urgent Care Center,Alternative Healer
2,6,Artesia,-118.080101,33.866896,4.0,Dentist's Office,Doctor's Office,Acupuncturist,Medical Center,Urgent Care Center
3,7,Altadena,-118.136239,34.193871,3.0,Doctor's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living
4,9,Arcadia,-118.030419,34.13323,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital
5,10,Arleta,-118.430757,34.2431,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living
6,11,Arlington Heights,-118.323408,34.04491,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital
7,13,Atwater Village,-118.262373,34.131066,4.0,Doctor's Office,Dentist's Office,Hospital,Alternative Healer,Medical Lab
8,14,Avalon,-118.327332,33.336954,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living
9,15,Avocado Heights,-118.001261,34.040881,0.0,Dentist's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living


Finally, let's visualize the resulting clusters

In [36]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(newdf['lat'], newdf['long'], newdf['neighborhood'], newdf['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Merge with life expectancy data at the census tract level

We will merge average life expectancy in years at the census tract level. Afterward, this new dataframe will be merged with our Los Angeles neighborhood dataframe.

In [105]:
life_expdf = pd.read_csv('California life expectancy table.csv')
life_expdf.rename(columns={"TRACT2KX": "Tract Number"}, inplace=True)
life_expdf.drop(['Tract ID', 'STATE2KX', 'CNTY2KX', 'Abridged life table flag'], axis=1, inplace=True)

life_expdf.head()

Unnamed: 0,Tract Number,e(0),se(e(0))
0,400100,87.2,2.5534
1,400200,81.5,1.0952
2,400300,87.2,3.5274
3,400400,82.9,1.8118
4,400500,78.8,2.1647


Merge life_expdf with census tract locations to get neighborhood names.

In [106]:
tractdf = pd.read_csv('Census_Tract_Locations__LA.csv')
tractdf.drop(['GEOID', 'Tract', 'Location', 'Latitude', 'Longitude'], axis=1, inplace=True)
tractdf.rename(columns={"TRACT2KX": "Tract Number", 'Neighborhood': 'neighborhood'}, inplace=True)

tractdf.head()

Unnamed: 0,Tract Number,neighborhood
0,101110,Tujunga
1,101122,Tujunga
2,101210,Tujunga
3,101220,Tujunga
4,101300,Tujunga


Merge both dataframes by census tract ID, then calculate average life expectancy by each neighborhood, creating a new dataframe.

In [107]:
life_expdf = pd.merge(life_expdf, tractdf, on='Tract Number')

avg_LE_grouped = life_expdf.groupby(['neighborhood'])['e(0)'].mean().reset_index()

avg_LE_grouped.head()

Unnamed: 0,neighborhood,e(0)
0,Acton,76.4
1,Adams-Normandie,79.3
2,Agoura Hills,81.766667
3,Agua Dulce,76.7
4,Alhambra,82.82


## 6. Cluster neighborhoods based on their average life expectancy

In [108]:
# set number of clusters
kclusters = 5

avg_LE_grouped_clustering = avg_LE_grouped.drop('neighborhood', 1)

# run k-means clustering
kmeans_le = KMeans(n_clusters=kclusters, random_state=0).fit(avg_LE_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans_le.labels_[0:10] 

array([0, 4, 2, 0, 3, 2, 3, 1, 3, 4], dtype=int32)

Finally, merge with Los Angeles dataframe by neighborhood.

In [109]:
# add clustering labels
avg_LE_grouped.insert(0, 'Cluster Labels LE', kmeans_le.labels_)

print(avg_LE_grouped.head())
#print(avg_LE_grouped.groupby('Cluster Labels LE').mean().reset_index())

losangeles_merged_2 = newdf

# merge losangeles_grouped with file handle to add latitude/longitude for each neighborhood
losangeles_merged_2 = losangeles_merged_2.join(avg_LE_grouped.set_index('neighborhood'), on='neighborhood')

newdf = losangeles_merged_2.dropna(subset=['Cluster Labels LE']).reset_index()

newdf.head(50) 

   Cluster Labels LE     neighborhood       e(0)
0                  0            Acton  76.400000
1                  4  Adams-Normandie  79.300000
2                  2     Agoura Hills  81.766667
3                  0       Agua Dulce  76.700000
4                  3         Alhambra  82.820000


Unnamed: 0,level_0,index,neighborhood,long,lat,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels LE,e(0)
0,0,2,Agoura Hills,-118.759885,34.146736,0.0,Dentist's Office,Alternative Healer,Weight Loss Center,Urgent Care Center,Assisted Living,2.0,81.766667
1,1,4,Alhambra,-118.136512,34.085539,3.0,Doctor's Office,Emergency Room,Weight Loss Center,Urgent Care Center,Alternative Healer,3.0,82.82
2,2,6,Artesia,-118.080101,33.866896,4.0,Dentist's Office,Doctor's Office,Acupuncturist,Medical Center,Urgent Care Center,4.0,80.066667
3,3,7,Altadena,-118.136239,34.193871,3.0,Doctor's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living,3.0,81.9875
4,4,9,Arcadia,-118.030419,34.13323,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital,1.0,84.308333
5,5,10,Arleta,-118.430757,34.2431,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living,3.0,83.5125
6,6,11,Arlington Heights,-118.323408,34.04491,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital,4.0,79.54
7,7,13,Atwater Village,-118.262373,34.131066,4.0,Doctor's Office,Dentist's Office,Hospital,Alternative Healer,Medical Lab,2.0,80.933333
8,8,14,Avalon,-118.327332,33.336954,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living,4.0,79.5
9,9,15,Avocado Heights,-118.001261,34.040881,0.0,Dentist's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living,3.0,83.3


In [110]:
newdf.insert(0, 'Grouped LE', 0)

count=0
for item in newdf['Cluster Labels LE']:
    if newdf['Cluster Labels LE'][count]==0:
        newdf['Grouped LE'][count]= 'Very low LE'
    elif newdf['Cluster Labels LE'][count]==1:
        newdf['Grouped LE'][count]= 'Very high LE'
    elif newdf['Cluster Labels LE'][count]==2:
        newdf['Grouped LE'][count]= 'Moderate LE' 
    elif newdf['Cluster Labels LE'][count]==3:
        newdf['Grouped LE'][count]= 'High LE'
    elif newdf['Cluster Labels LE'][count]==4:
        newdf['Grouped LE'][count]= 'Low LE'    
    count=count+1
    
newdf.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from 

Unnamed: 0,Grouped LE,level_0,index,neighborhood,long,lat,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels LE,e(0)
0,Moderate LE,0,2,Agoura Hills,-118.759885,34.146736,0.0,Dentist's Office,Alternative Healer,Weight Loss Center,Urgent Care Center,Assisted Living,2.0,81.766667
1,High LE,1,4,Alhambra,-118.136512,34.085539,3.0,Doctor's Office,Emergency Room,Weight Loss Center,Urgent Care Center,Alternative Healer,3.0,82.82
2,Low LE,2,6,Artesia,-118.080101,33.866896,4.0,Dentist's Office,Doctor's Office,Acupuncturist,Medical Center,Urgent Care Center,4.0,80.066667
3,High LE,3,7,Altadena,-118.136239,34.193871,3.0,Doctor's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living,3.0,81.9875
4,Very high LE,4,9,Arcadia,-118.030419,34.13323,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital,1.0,84.308333
5,High LE,5,10,Arleta,-118.430757,34.2431,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living,3.0,83.5125
6,Low LE,6,11,Arlington Heights,-118.323408,34.04491,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital,4.0,79.54
7,Moderate LE,7,13,Atwater Village,-118.262373,34.131066,4.0,Doctor's Office,Dentist's Office,Hospital,Alternative Healer,Medical Lab,2.0,80.933333
8,Low LE,8,14,Avalon,-118.327332,33.336954,2.0,Medical Center,Weight Loss Center,Hospital,Alternative Healer,Assisted Living,4.0,79.5
9,High LE,9,15,Avocado Heights,-118.001261,34.040881,0.0,Dentist's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living,3.0,83.3


In [79]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, group in zip(newdf['lat'], newdf['long'], newdf['neighborhood'], newdf['Cluster Labels LE'], newdf['Grouped LE']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(group), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=False,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

## Explore Inglewood vs. Beverly Hills

In [111]:
newdf_2 = newdf

newdf_2.set_index('neighborhood', inplace=True)
newdf_2.head()



Unnamed: 0_level_0,Grouped LE,level_0,index,long,lat,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels LE,e(0)
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Agoura Hills,Moderate LE,0,2,-118.759885,34.146736,0.0,Dentist's Office,Alternative Healer,Weight Loss Center,Urgent Care Center,Assisted Living,2.0,81.766667
Alhambra,High LE,1,4,-118.136512,34.085539,3.0,Doctor's Office,Emergency Room,Weight Loss Center,Urgent Care Center,Alternative Healer,3.0,82.82
Artesia,Low LE,2,6,-118.080101,33.866896,4.0,Dentist's Office,Doctor's Office,Acupuncturist,Medical Center,Urgent Care Center,4.0,80.066667
Altadena,High LE,3,7,-118.136239,34.193871,3.0,Doctor's Office,Weight Loss Center,Urgent Care Center,Alternative Healer,Assisted Living,3.0,81.9875
Arcadia,Very high LE,4,9,-118.030419,34.13323,4.0,Doctor's Office,Dentist's Office,Medical Center,Weight Loss Center,Hospital,1.0,84.308333


In [115]:
newdf_2.loc['Inglewood']

Grouped LE                         Low LE
level_0                                69
index                                 117
long                             -118.346
lat                               33.9541
Cluster Labels                          1
1st Most Common Venue            Hospital
2nd Most Common Venue      Medical Center
3rd Most Common Venue    Dentist's Office
4th Most Common Venue     Doctor's Office
5th Most Common Venue      Emergency Room
Cluster Labels LE                       4
e(0)                              78.5778
Name: Inglewood, dtype: object

In [116]:
newdf_2.loc['Beverly Hills']

Grouped LE                     Very high LE
level_0                                  20
index                                    29
long                                 -118.4
lat                                 34.0825
Cluster Labels                            3
1st Most Common Venue         Acupuncturist
2nd Most Common Venue       Doctor's Office
3rd Most Common Venue    Urgent Care Center
4th Most Common Venue    Alternative Healer
5th Most Common Venue       Assisted Living
Cluster Labels LE                         1
e(0)                                85.8333
Name: Beverly Hills, dtype: object