# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

This project will try to find best possible locations for opening a new restaurant **of any type** in **Los Angeles**. There are already many restaurants in Los Angeles, so the analysis will focus less dense areas of restaurants with medium or high population density. The neighborhoods surrounding our candidate locations should help provide insights into which type of restaurants might be successful, both from price point and cuisine perspective.

The analysis will exclude unincorporated areas, as they tend to be of very low population density and do not make a good candidate for this type of analysis. Analysis will include all neighborhoods in L.A. County, as well as cities that are independant of L.A. county of which there are many (Beverley Hills, West Hollywood, etc.).

Data science techniques wil be utilized to drive this analysis and recommendations.

## Data <a name="data"></a>

Based on the problem statement, there will be a need for a list of all Los Angeles neighborhoods, with a way to filter out unincoprated areas, with latitude and longitude for each. This data set is then joined with data from FourSquare for Los Angeles restaurants.

Los Angeles neightborhood data is available here:

https://usc.data.socrata.com/api/views/9utn-waje/rows.csv?accessType=DOWNLOAD

However the data set has bad data which make the data import challenging. The data set has been cleaned manually and placed here:

http://glacier2.verio.com/data/la_neighborhoods.csv

### Neighborhood Data

In [1]:
import pandas as pd

url = "http://glacier2.verio.com/data/la_neighborhoods.csv"

la_neighborhood_df = pd.read_csv(url)
print("Completed CSV Read")

Completed CSV Read


Let's take a look at the data set.

In [2]:
la_neighborhood_df.head()

Unnamed: 0,set,slug,the_geom,kind,external_i,name,display_na,sqmi,type,name_1,slug_1,latitude,longitude,location
0,L.A. County Neighborhoods (Current),acton,MULTIPOLYGON (((-118.20261747920541 34.5389897...,L.A. County Neighborhood (Current),acton,Acton,Acton L.A. County Neighborhood (Current),39.339109,unincorporated-area,,,-118.16981,34.497355,POINT(34.497355239240846 -118.16981019229348)
1,L.A. County Neighborhoods (Current),adams-normandie,MULTIPOLYGON (((-118.30900800000012 34.0374109...,L.A. County Neighborhood (Current),adams-normandie,Adams-Normandie,Adams-Normandie L.A. County Neighborhood (Curr...,0.80535,segment-of-a-city,,,-118.300208,34.031461,POINT(34.031461499124156 -118.30020800000011)
2,L.A. County Neighborhoods (Current),agoura-hills,MULTIPOLYGON (((-118.76192500000009 34.1682029...,L.A. County Neighborhood (Current),agoura-hills,Agoura Hills,Agoura Hills L.A. County Neighborhood (Current),8.14676,standalone-city,,,-118.759884,34.146736,POINT(34.146736499122795 -118.75988450000015)
3,L.A. County Neighborhoods (Current),agua-dulce,MULTIPOLYGON (((-118.2546773959221 34.55830403...,L.A. County Neighborhood (Current),agua-dulce,Agua Dulce,Agua Dulce L.A. County Neighborhood (Current),31.462632,unincorporated-area,,,-118.317104,34.504927,POINT(34.504926999796837 -118.3171036690717)
4,L.A. County Neighborhoods (Current),alhambra,MULTIPOLYGON (((-118.12174700000014 34.1050399...,L.A. County Neighborhood (Current),alhambra,Alhambra,Alhambra L.A. County Neighborhood (Current),7.623814,standalone-city,,,-118.136512,34.085539,POINT(34.085538999123571 -118.13651200000021)


The key fields that will be used are are name, type, and latitude and longitude.

In [3]:
la_neighborhood_df.shape

(269, 14)

269 distinct neightborhoods. Now let's remove the unincoporated areas.

In [4]:
# Ignore unincorporated areas
options = ['segment-of-a-city', 'standalone-city'] 
    
# selecting rows based on condition 
rslt_df = la_neighborhood_df.loc[la_neighborhood_df['type'].isin(options)] 

In [5]:
rslt_df.shape

(199, 14)

70 unicorporated areas have been removed. Let's subset and correct the data set.

In [6]:
la_neigh_df = rslt_df[["name", "type", "latitude", "longitude"]]

# It appears latitude and longitude are reversed, so perform a rename

la_neigh_df.rename(columns={"latitude": "lng", "longitude": "lat"}, inplace=True)

la_neigh_df.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Index(['name', 'type', 'lng', 'lat'], dtype='object')

In [7]:
# Finally check some of the data
la_neigh_df.head()

Unnamed: 0,name,type,lng,lat
1,Adams-Normandie,segment-of-a-city,-118.300208,34.031461
2,Agoura Hills,standalone-city,-118.759884,34.146736
4,Alhambra,standalone-city,-118.136512,34.085539
6,Artesia,standalone-city,-118.080101,33.866896
9,Arcadia,standalone-city,-118.030419,34.13323


### Restaurant Data

Pull restaurant data down from Foursquare.

In [None]:
#!pip install geopy

In [8]:
import geopy

In [9]:
# Get Los Angeles latitude and longitude
from geopy.geocoders import Nominatim 

# Get the coordinates of Los Angeles
address = 'Los Angeles'

geolocator = Nominatim(user_agent="la_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Los Angeles latitude and longitude: {}, {}.'.format(latitude, longitude))

Los Angeles latitude and longitude: 34.0536909, -118.242766.


Use folium and map Los Angeles as a starting point, along with our neighborhood data frame.

In [None]:
#!pip install folium

In [10]:
import folium

la_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# Markers
for lat, lng, name, type_loc in zip(la_neigh_df['lat'], la_neigh_df['lng'], la_neigh_df['name'], la_neigh_df['type']): 
    #print(name)
    label = '{},{}'.format(name, type_loc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=20,
        popup=label,
        color='gray',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(la_map)  

In [11]:
la_map

In [12]:
# FourSquare set up
CLIENT_ID = 'O0HJUMSBLMA4FOPTXPXA0IO0P4DEWM2XMKD0IJ5Q0ZJRPUV3' # your Foursquare 
CLIENT_SECRET = 'O0L2WTWXSGY10DARU2CBPI450RB00LARJRIZ5FZ2KF01GQYW' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
radius = 500


url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

Your credentails:
CLIENT_ID: O0HJUMSBLMA4FOPTXPXA0IO0P4DEWM2XMKD0IJ5Q0ZJRPUV3
CLIENT_SECRET:O0L2WTWXSGY10DARU2CBPI450RB00LARJRIZ5FZ2KF01GQYW


'https://api.foursquare.com/v2/venues/search?client_id=O0HJUMSBLMA4FOPTXPXA0IO0P4DEWM2XMKD0IJ5Q0ZJRPUV3&client_secret=O0L2WTWXSGY10DARU2CBPI450RB00LARJRIZ5FZ2KF01GQYW&ll=34.0536909,-118.242766&v=20180604&radius=500&limit=100'

In [13]:
# Function to retrieve venue categories
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [14]:
import requests
from pandas.io.json import json_normalize

# Results from the GET
results = requests.get(url).json()
results

# Pull out the venues
venues = results['response']['venues']

# dataframe
venues_dataframe = json_normalize(venues)
venues_dataframe.head()

Unnamed: 0,categories,delivery.id,delivery.provider.icon.name,delivery.provider.icon.prefix,delivery.provider.icon.sizes,delivery.provider.name,delivery.url,hasPerk,id,location.address,...,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.neighborhood,location.postalCode,location.state,name,referralId,venuePage.id
0,"[{'id': '4bf58dd8d48988d126941735', 'name': 'G...",,,,,,,False,52ebf057498ee83c88d9c1f9,,...,"[Los Angeles, CA, United States]","[{'label': 'display', 'lat': 34.05378746696942...",34.053787,-118.242369,,,CA,Los Angeles Mayor's Office of Economic Develop...,v-1624387346,
1,"[{'id': '4bf58dd8d48988d129941735', 'name': 'C...",,,,,,,False,4b38d6b3f964a520fb5025e3,200 N Main St,...,"[200 N Main St (Temple Street), Los Angeles, C...","[{'label': 'display', 'lat': 34.05301072481605...",34.053011,-118.241863,,90012.0,CA,James K. Hahn City Hall East Building,v-1624387346,
2,"[{'id': '4bf58dd8d48988d126941735', 'name': 'G...",,,,,,,False,4d65fb1dc2ccb60ca24c6bac,200 N Spring St,...,"[200 N Spring St, Los Angeles, CA 90012, Unite...","[{'label': 'display', 'lat': 34.05415339478075...",34.054153,-118.243117,,90012.0,CA,Los Angeles Civic Center,v-1624387346,
3,"[{'id': '4bf58dd8d48988d129941735', 'name': 'C...",,,,,,,False,4b5113edf964a520314127e3,200 N Spring St,...,"[200 N Spring St (at Temple Ave), Los Angeles,...","[{'label': 'display', 'lat': 34.05348417688625...",34.053484,-118.242478,Civic Center,90012.0,CA,Los Angeles City Hall,v-1624387346,75727220.0
4,"[{'id': '4bf58dd8d48988d129941735', 'name': 'C...",,,,,,,False,4d9ccfabc593a1cd8dff5119,Los Angeles City Hall,...,"[Los Angeles City Hall, Los Angeles, CA 90012,...","[{'label': 'display', 'lat': 34.05393149240509...",34.053931,-118.243169,,90012.0,CA,Office of Mayor Eric Garcetti,v-1624387346,


In [15]:
# Filter the categories and clean up the venue names
filtered_columns = ['name', 'categories'] + [col for col in venues_dataframe.columns if col.startswith('location.')] + ['id']
venues_df_filtered = venues_dataframe.loc[:, filtered_columns]
venues_df_filtered['categories'] = venues_df_filtered.apply(get_category_type, axis=1)
venues_df_filtered.columns = [column.split('.')[-1] for column in venues_df_filtered.columns]
venues_df_filtered.head()

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,neighborhood,postalCode,state,id
0,Los Angeles Mayor's Office of Economic Develop...,Government Building,,US,Los Angeles,United States,,38,"[Los Angeles, CA, United States]","[{'label': 'display', 'lat': 34.05378746696942...",34.053787,-118.242369,,,CA,52ebf057498ee83c88d9c1f9
1,James K. Hahn City Hall East Building,City Hall,200 N Main St,US,Los Angeles,United States,Temple Street,112,"[200 N Main St (Temple Street), Los Angeles, C...","[{'label': 'display', 'lat': 34.05301072481605...",34.053011,-118.241863,,90012.0,CA,4b38d6b3f964a520fb5025e3
2,Los Angeles Civic Center,Government Building,200 N Spring St,US,Los Angeles,United States,,60,"[200 N Spring St, Los Angeles, CA 90012, Unite...","[{'label': 'display', 'lat': 34.05415339478075...",34.054153,-118.243117,,90012.0,CA,4d65fb1dc2ccb60ca24c6bac
3,Los Angeles City Hall,City Hall,200 N Spring St,US,Los Angeles,United States,at Temple Ave,35,"[200 N Spring St (at Temple Ave), Los Angeles,...","[{'label': 'display', 'lat': 34.05348417688625...",34.053484,-118.242478,Civic Center,90012.0,CA,4b5113edf964a520314127e3
4,Office of Mayor Eric Garcetti,City Hall,Los Angeles City Hall,US,Los Angeles,United States,,45,"[Los Angeles City Hall, Los Angeles, CA 90012,...","[{'label': 'display', 'lat': 34.05393149240509...",34.053931,-118.243169,,90012.0,CA,4d9ccfabc593a1cd8dff5119


In [16]:
# Check the size
venues_df_filtered.shape

(100, 16)

In [17]:
# Nearby venues function
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        #print(lat)
        #print(lng)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lng, 
            lat, 
            radius, 
            LIMIT)
            
        #print(url)
        
        # make the GET request
        #results = requests.get(url).json()["response"]['groups'][0]['items']
        results = requests.get(url).json()['response'].get('groups',[{}])[0].get('items', [])
        
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
names=la_neighborhood_df['name']

In [19]:
# The neighborhood that have venues in La
la_venues = getNearbyVenues(names=names,
                                   latitudes=rslt_df['latitude'],
                                   longitudes=rslt_df['longitude']
                                  )

Acton
Adams-Normandie
Agoura Hills
Agua Dulce
Alhambra
Alondra Park
Artesia
Altadena
Angeles Crest
Arcadia
Arleta
Arlington Heights
Athens
Atwater Village
Avalon
Avocado Heights
Azusa
Vermont-Slauson
Baldwin Hills/Crenshaw
Baldwin Park
Bel-Air
Bellflower
Bell Gardens
Green Valley
Bell
Beverly Crest
Beverly Grove
Burbank
Koreatown
Beverly Hills
Beverlywood
Boyle Heights
Bradbury
Brentwood
Broadway-Manchester
Calabasas
Canoga Park
Carson
Carthay
Castaic Canyons
Chatsworth
Castaic
Central-Alameda
Century City
Cerritos
Charter Oak
Chatsworth Reservoir
Chesterfield Square
Cheviot Hills
Chinatown
Citrus
Claremont
Northridge
Commerce
Compton
Cypress Park
La Mirada
Covina
Cudahy
Culver City
Del Aire
Del Rey
Desert View Highlands
Diamond Bar
Downey
Downtown
Duarte
Eagle Rock
East Compton
East Hollywood
East La Mirada
Elizabeth Lake
East Los Angeles
East Pasadena
East San Gabriel
Echo Park
El Monte
El Segundo
El Sereno
Elysian Park
Elysian Valley
Vermont Square
Encino
Exposition Park
Fairfax
Flo

In [20]:
la_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Acton,-118.300208,34.031461,7-Eleven,34.033027,-118.299960,Convenience Store
1,Acton,-118.300208,34.031461,Shell,34.033095,-118.300025,Gas Station
2,Acton,-118.300208,34.031461,Little Xian,34.032292,-118.299465,Sushi Restaurant
3,Acton,-118.300208,34.031461,Sushi Delight,34.032501,-118.299454,Sushi Restaurant
4,Acton,-118.300208,34.031461,Tacos La Estrella,34.032230,-118.300757,Taco Place
5,Acton,-118.300208,34.031461,El Rincon Hondureño,34.032527,-118.298860,Latin American Restaurant
6,Acton,-118.300208,34.031461,Orange Door Sushi,34.032485,-118.299368,Sushi Restaurant
7,Acton,-118.300208,34.031461,Loren Miller Recreational Park,34.031335,-118.303717,Playground
8,Acton,-118.300208,34.031461,Xtra Bionicos Allexis,34.032211,-118.296272,Dessert Shop
9,Acton,-118.300208,34.031461,Adlong market,34.032242,-118.296229,Grocery Store


In [21]:
la_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Acton,11,11,11,11,11,11
Adams-Normandie,29,29,29,29,29,29
Agoura Hills,13,13,13,13,13,13
Agua Dulce,28,28,28,28,28,28
Alhambra,24,24,24,24,24,24
Alondra Park,5,5,5,5,5,5
Altadena,20,20,20,20,20,20
Angeles Crest,4,4,4,4,4,4
Arcadia,10,10,10,10,10,10
Arleta,6,6,6,6,6,6


In [22]:
print('Number of unique categories is {}.'.format(len(la_venues['Venue Category'].unique())))

Number of unique categories is 298.


Finally, filter out all venues that are not restaurants.

In [23]:
options = ['Fast Food Restaurant' ,'Breakfast Spot',
 'Café', 'Restaurant', 'Indian Restaurant', 'BBQ Joint' ,'Burger Joint',
 'American Restaurant', 'Pizza Place', 'Brewery' 'Thai Restaurant',
 'Deli / Bodega', 'Mexican Restaurant','Sushi Restaurant' , 'Taco Place',
 'Hawaiian Restaurant',
 'Taiwanese Restaurant', 'Vegetarian / Vegan Restaurant',
 'Vietnamese Restaurant' ,'Japanese Restaurant' ,
 'Korean Restaurant', 
 'Donburi Restaurant', 'Seafood Restaurant',  'Dumpling Restaurant',
 'Mediterranean Restaurant', 'Southern / Soul Food Restaurant' 'Diner',
 'Udon Restaurant', 'Empanada Restaurant', 'Ramen Restaurant' 'Cuban Restaurant',
 'Korean BBQ Restaurant', 'Hotel Bar', 'Brazilian Restaurant', 'Ethiopian Restaurant',
 'French Restaurant', 'Cajun / Creole Restaurant' ,'Filipino Restaurant',
 'Dim Sum Restaurant',  'Greek Restaurant' ,'Noodle House' , 'New American Restaurant',
 'Middle Eastern Restaurant', 'Falafel Restaurant',
 'Persian Restaurant', 'Caribbean Restaurant', 'Andhra Restaurant' ,
 'Russian Restaurant'
  ] 

# selecting rows based on condition 
la_rest_df = la_venues.loc[la_venues['Venue Category'].isin(options)] 
la_rest_df

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2,Acton,-118.300208,34.031461,Little Xian,34.032292,-118.299465,Sushi Restaurant
3,Acton,-118.300208,34.031461,Sushi Delight,34.032501,-118.299454,Sushi Restaurant
4,Acton,-118.300208,34.031461,Tacos La Estrella,34.032230,-118.300757,Taco Place
6,Acton,-118.300208,34.031461,Orange Door Sushi,34.032485,-118.299368,Sushi Restaurant
11,Adams-Normandie,-118.759884,34.146736,El Pollo Loco,34.144732,-118.761088,Fast Food Restaurant
12,Adams-Normandie,-118.759884,34.146736,Sushi Raku,34.148230,-118.760163,Sushi Restaurant
14,Adams-Normandie,-118.759884,34.146736,Urbane Cafe,34.146573,-118.758956,Café
15,Adams-Normandie,-118.759884,34.146736,Jinky's Kanan Cafe,34.146280,-118.756833,Breakfast Spot
16,Adams-Normandie,-118.759884,34.146736,Boar Dough Tasting Room,34.144237,-118.756564,Restaurant
17,Adams-Normandie,-118.759884,34.146736,Lal Mirch,34.147822,-118.760536,Indian Restaurant


We now have two datasets, one, the neightborhoods, the other, restaurants. We will combine and use these datasets to perform analysis.

## Methodology <a name="methodology"></a>

The approach used is KMeans clustering. Restaurants will be clustered around the neightborhoods and neighborhoods without clusters will be analyzed.

First, start with the onehot encoding.

In [24]:
# Onehot encoding
la_onehot = pd.get_dummies(la_rest_df[['Venue Category']], prefix="", prefix_sep="")
la_onehot['Neighborhood'] = la_rest_df['Neighborhood'] 
fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
final_onehot = la_onehot[fixed_columns]

In [25]:
final_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Andhra Restaurant,BBQ Joint,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Caribbean Restaurant,...,Pizza Place,Restaurant,Russian Restaurant,Seafood Restaurant,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
2,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
11,Adams-Normandie,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
final_onehot.shape

(717, 43)

Explore the data a bit more.

In [27]:
# Group by neighborhood
la_df_grouped = final_onehot.groupby('Neighborhood').mean().reset_index()
la_df_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Andhra Restaurant,BBQ Joint,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Caribbean Restaurant,...,Pizza Place,Restaurant,Russian Restaurant,Seafood Restaurant,Sushi Restaurant,Taco Place,Taiwanese Restaurant,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.75,0.25,0.0,0.0,0.0,0.0
1,Adams-Normandie,0.058824,0.0,0.058824,0.0,0.117647,0.058824,0.058824,0.0,0.0,...,0.058824,0.058824,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0
2,Agoura Hills,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,...,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Agua Dulce,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.071429,0.142857
4,Alhambra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
num_top_venues = 3

for hood in la_df_grouped['Neighborhood']:
    print("------- "+hood+" -------")
    temp = la_df_grouped[la_df_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

------- Acton -------
                 venue  freq
0     Sushi Restaurant  0.75
1           Taco Place  0.25
2  American Restaurant  0.00


------- Adams-Normandie -------
                  venue  freq
0  Fast Food Restaurant  0.24
1        Breakfast Spot  0.12
2      Sushi Restaurant  0.12


------- Agoura Hills -------
                venue  freq
0         Pizza Place  0.25
1      Breakfast Spot  0.25
2  Mexican Restaurant  0.25


------- Agua Dulce -------
                   venue  freq
0      Indian Restaurant  0.50
1  Vietnamese Restaurant  0.14
2   Taiwanese Restaurant  0.07


------- Alhambra -------
                  venue  freq
0           Pizza Place  0.25
1    Mexican Restaurant  0.25
2  Fast Food Restaurant  0.25


------- Altadena -------
                  venue  freq
0            Taco Place   0.4
1           Pizza Place   0.2
2  Fast Food Restaurant   0.2


------- Angeles Crest -------
                 venue  freq
0  American Restaurant   0.5
1   Mexican Restaurant   0.5

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
import numpy as np

num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [33]:
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = la_df_grouped['Neighborhood']

for ind in np.arange(la_df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(la_df_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Acton,Sushi Restaurant,Taco Place,Vietnamese Restaurant
1,Adams-Normandie,Fast Food Restaurant,Sushi Restaurant,Breakfast Spot
2,Agoura Hills,Fast Food Restaurant,Breakfast Spot,Pizza Place
3,Agua Dulce,Indian Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant
4,Alhambra,Fast Food Restaurant,Japanese Restaurant,Mexican Restaurant


#### Clustering

In [35]:
from sklearn.cluster import KMeans

# Run Kmeans clustering and row labels
kclusters_num = 3
la_df_grouped_clustering = la_df_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters_num, random_state=0).fit(la_df_grouped_clustering)
kmeans.labels_[0:10]

array([2, 2, 2, 2, 2, 2, 0, 2, 1, 2])

In [37]:
# Clustering Labels - only once
#neighborhoods_venues_sorted.insert(0, 'ClusterLabels', kmeans.labels_)
neighborhoods_venues_sorted.head()

Unnamed: 0,ClusterLabels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,2,Acton,Sushi Restaurant,Taco Place,Vietnamese Restaurant
1,2,Adams-Normandie,Fast Food Restaurant,Sushi Restaurant,Breakfast Spot
2,2,Agoura Hills,Fast Food Restaurant,Breakfast Spot,Pizza Place
3,2,Agua Dulce,Indian Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant
4,2,Alhambra,Fast Food Restaurant,Japanese Restaurant,Mexican Restaurant


In [38]:
# Let's merge the two datasets
la_df_merged = rslt_df
la_df_merged = la_df_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='name') 
la_df_merged.head()

Unnamed: 0,set,slug,the_geom,kind,external_i,name,display_na,sqmi,type,name_1,slug_1,latitude,longitude,location,ClusterLabels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
1,L.A. County Neighborhoods (Current),adams-normandie,MULTIPOLYGON (((-118.30900800000012 34.0374109...,L.A. County Neighborhood (Current),adams-normandie,Adams-Normandie,Adams-Normandie L.A. County Neighborhood (Curr...,0.80535,segment-of-a-city,,,-118.300208,34.031461,POINT(34.031461499124156 -118.30020800000011),2.0,Fast Food Restaurant,Sushi Restaurant,Breakfast Spot
2,L.A. County Neighborhoods (Current),agoura-hills,MULTIPOLYGON (((-118.76192500000009 34.1682029...,L.A. County Neighborhood (Current),agoura-hills,Agoura Hills,Agoura Hills L.A. County Neighborhood (Current),8.14676,standalone-city,,,-118.759884,34.146736,POINT(34.146736499122795 -118.75988450000015),2.0,Fast Food Restaurant,Breakfast Spot,Pizza Place
4,L.A. County Neighborhoods (Current),alhambra,MULTIPOLYGON (((-118.12174700000014 34.1050399...,L.A. County Neighborhood (Current),alhambra,Alhambra,Alhambra L.A. County Neighborhood (Current),7.623814,standalone-city,,,-118.136512,34.085539,POINT(34.085538999123571 -118.13651200000021),2.0,Fast Food Restaurant,Japanese Restaurant,Mexican Restaurant
6,L.A. County Neighborhoods (Current),artesia,MULTIPOLYGON (((-118.0748950000001 33.88038299...,L.A. County Neighborhood (Current),artesia,Artesia,Artesia L.A. County Neighborhood (Current),1.632204,standalone-city,,,-118.080101,33.866896,POINT(33.866895999126271 -118.08010100000017),2.0,Korean Restaurant,Vegetarian / Vegan Restaurant,Taco Place
9,L.A. County Neighborhoods (Current),arcadia,MULTIPOLYGON (((-118.017052 34.177181999122524...,L.A. County Neighborhood (Current),arcadia,Arcadia,Arcadia L.A. County Neighborhood (Current),11.150797,standalone-city,,,-118.030419,34.13323,POINT(34.133229999123017 -118.03041899311202),2.0,Pizza Place,Vietnamese Restaurant,Deli / Bodega


In [39]:
neighborhoods_venues_sorted.loc[neighborhoods_venues_sorted['ClusterLabels'] ==1, neighborhoods_venues_sorted.columns[[1] + list(range(5, neighborhoods_venues_sorted.shape[1]))]]

Unnamed: 0,Neighborhood
8,Arleta
40,Del Rey
47,East Hollywood
52,Elysian Valley
58,Glendale
81,La Mirada
88,Leimert Park
96,Mayflower Village
114,Pomona


## Analysis <a name="analysis"></a>

Build the cluster map uncover areas where there are no clusters.

In [44]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# Mapping with Folium
la_map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
x = np.arange(3)
ys = [i + x + (i*x)**2 for i in range(kclusters_num)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


markers_colors = []

for lon, lat, poi, cluster in zip(la_df_merged['latitude'], la_df_merged['longitude'], la_df_merged['name'], la_df_merged['ClusterLabels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='blue',
        fill_opacity=0.6).add_to(la_map_clusters)
       
la_map_clusters

A quick review of the cluster map reveals 4 areas without a cluster as potential neighborhoods to have a restaurant.

<dl>
    <dt>Granada Hills</dt>
    <dt>East Los Angeles</dt>
    <dt>Santa Fe Springs</dt>
</dl>    

A quick look at a population density map confirms that there are enough people in the neighborhood to potentiallt support a new restaurant.

https://www.arcgis.com/apps/mapviewer/index.html?webmap=5913b5311e6449909e4139117c96a878

<dl>
    <dt>Granada Hills has a highly diverse demographic makeup and has a median income of 83,911</dt>
    <dt>East Los Angeles has a high Latino population (96%) and has a median income of 37,982</dt>
    <dt>Santa Fe Springs has a high Latino population (70%) and a median income of 54,081</dt>
</dl>
(from Wikipedia)

## Results and Discussion <a name="results"></a>

Both East Los Angeles and Santa Fe Springs would be good candidates for any Mexican or Latino restaurant. Santa Fe Springs has a higher median income, and likely more able to support a mid-priced restaurant.

Grenada Hills is very good candidate for any mid to high-priced restaurant, potentially of any cuisine, though there is a strong Asian population which could favor one of the sub-Asian populations.

## Conclusion <a name="conclusion"></a>

This analysis is really just the beginning. Many other factors could be explored, zoning for instance, which could affect where to further investigate, but with that, this clustering analysis has provided some initial top recommdations based on the two data sets using KMeans.