## Opening Upscale Italian Restaurants in Los Angeles


Company A is planning to open several upscale Italian restaurants in Los Angeles.

They are deciding on the areas, the criteria are:
the areas should have a high median income
the areas should not have many Italian restaurants around it
the areas should have a fairly large population size

The solution is provided using the KMeans clustering method


### Importing dataset

The dataset is retrieved from Los Angeles Almanac (http://www.laalmanac.com/employment/em12c.php) and ZIPatlas.com (http://zipatlas.com/us/ca/los-angeles/zip-code-comparison/population-density.htm)



In [154]:
import pandas as pd

la_data = pd.read_csv('la_data.csv')

la_data['Zip - Neighborhood'] = la_data['Zip'].map(str)+' - '+la_data['Neighborhood']

arrange_columns = [la_data.columns[-1]] + list(la_data.columns[:-1])
la_data = la_data[arrange_columns]
la_data.columns = ['Neighborhood','Zip','Name','Latitude','Longitude','Population','Median Income']
la_data

Unnamed: 0,Neighborhood,Zip,Name,Latitude,Longitude,Population,Median Income
0,90057 - Los Angeles(Westlake),90057,Los Angeles(Westlake),34.061918,-118.277939,43986,31337
1,"90020 - Los Angeles(Hancock Park, Western Wilt...",90020,"Los Angeles(Hancock Park, Western Wilton, Wils...",34.066367,-118.309868,42383,42407
2,"90005 - Los Angeles(Hancock Park, Koreatown, W...",90005,"Los Angeles(Hancock Park, Koreatown, Wilshire ...",34.059281,-118.30742,43014,32461
3,"90006 - Los Angeles(Byzantine-Latino Quarter, ...",90006,"Los Angeles(Byzantine-Latino Quarter, Harvard ...",34.048013,-118.293953,62765,33790
4,90029 - Los Angeles(East Hollywood),90029,Los Angeles(East Hollywood),34.089953,-118.294824,41697,37379
5,"90017 - Los Angeles(Downtown Bunker Hill, City...",90017,"Los Angeles(Downtown Bunker Hill, City West, S...",34.052842,-118.264495,20689,28638
6,90011 - Los Angeles(Southeast Los Angeles),90011,Los Angeles(Southeast Los Angeles),34.00709,-118.258681,101214,33824
7,"90004 - Los Angeles(Hancock Park, Rampart Vill...",90004,"Los Angeles(Hancock Park, Rampart Village, Vir...",34.076259,-118.310715,67850,46581
8,90038 - Los Angeles(Hollywood),90038,Los Angeles(Hollywood),34.088017,-118.327168,32557,36996
9,90028 - Los Angeles(Hollywood),90028,Los Angeles(Hollywood),34.099869,-118.326843,30562,40068


### Importing libraries

In [14]:
import json
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



### Creating a map of Los Angeles

In [15]:
latitude = 34.0522
longitude = -118.2437
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Los Angeles are 34.0522, -118.2437.


In [155]:
map_la = folium.Map(location=[latitude, longitude], zoom_start = 10)
map_la

for lat, lng, neighborhood in zip(la_data['Latitude'], la_data['Longitude'], la_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)
    
map_la

### Getting the venues from Foursquare

In [17]:
#Define Foursquare credentials and version

CLIENT_ID = 'F44RE35X5VD0OR4QZ5QFM2VVP5PRMMMB1JXKTFKKAVU5HBN3' 
CLIENT_SECRET = 'FOVYKDZGWC0GURZJTGA3CXCXZLMXZ3IXYWXYU2GD0Z3TMQS1' 
VERSION = '20180605' 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: F44RE35X5VD0OR4QZ5QFM2VVP5PRMMMB1JXKTFKKAVU5HBN3
CLIENT_SECRET:FOVYKDZGWC0GURZJTGA3CXCXZLMXZ3IXYWXYU2GD0Z3TMQS1


In [156]:
#Getting 100 venues within a radius of 1000 meters (1 kilometer)

LIMIT = 100
radius = 1000

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names,latitudes,longitudes):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        results = requests.get(url).json()["response"]["groups"][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [157]:
la_venues = getNearbyVenues(names=la_data['Neighborhood'],
                                   latitudes=la_data['Latitude'],
                                   longitudes=la_data['Longitude']
                                  )

90057 - Los Angeles(Westlake)
90020 - Los Angeles(Hancock Park, Western Wilton, Wilshire Center, Windsor Square)
90005 - Los Angeles(Hancock Park, Koreatown, Wilshire Center, Wilshire Park, Windsor Square)
90006 - Los Angeles(Byzantine-Latino Quarter, Harvard Heights, Koreatown, Pico Heights)
90029 - Los Angeles(East Hollywood)
90017 - Los Angeles(Downtown Bunker Hill, City West, South Park-North)
90011 - Los Angeles(Southeast Los Angeles)
90004 - Los Angeles(Hancock Park, Rampart Village, Virgil Village, Wilshire Center, Windsor Square)
90038 - Los Angeles(Hollywood)
90028 - Los Angeles(Hollywood)
90037 - Los Angeles(South Los Angeles)
90034 - Los Angeles(Palms)
90026 - Los Angeles(Echo Park, Silver Lake)
90019 - Los Angeles(Arlington Heights, Country Club Park, Mid-City)
90044 - Athens,Los Angeles(South Los Angeles)
90003 - Los Angeles(South Los Angeles, Southeast Los Angeles)
90007 - Los Angeles(Southeast Los Angeles, Univerity Park)
90018 - Los Angeles(Jefferson Park, Leimert Park)

In [158]:
la_onehot = pd.get_dummies(la_venues[['Venue Category']], prefix="", prefix_sep="")
la_onehot['_Neighborhood'] = la_venues['Neighborhood']
la_onehot.drop('Neighborhood',axis=1, inplace = True)
fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]
la_onehot = la_onehot.rename(columns={'_Neighborhood':'Neighborhood'})

la_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,Arcade,...,Vietnamese Restaurant,Warehouse,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Yoshoku Restaurant
0,90057 - Los Angeles(Westlake),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,90057 - Los Angeles(Westlake),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,90057 - Los Angeles(Westlake),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,90057 - Los Angeles(Westlake),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,90057 - Los Angeles(Westlake),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [159]:
la_types = list(la_onehot.columns)
la_types

['Neighborhood',
 'ATM',
 'Accessories Store',
 'Airport Lounge',
 'Airport Service',
 'Airport Terminal',
 'American Restaurant',
 'Amphitheater',
 'Antique Shop',
 'Arcade',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Asian Restaurant',
 'Astrologer',
 'Athletics & Sports',
 'Auto Garage',
 'Automotive Shop',
 'BBQ Joint',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Baseball Field',
 'Baseball Stadium',
 'Basketball Court',
 'Basketball Stadium',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Garden',
 'Beer Store',
 'Big Box Store',
 'Bike Shop',
 'Bookstore',
 'Boutique',
 'Bowling Alley',
 'Boxing Gym',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Brewery',
 'Bridal Shop',
 'Bubble Tea Shop',
 'Buffet',
 'Burger Joint',
 'Burmese Restaurant',
 'Burrito Place',
 'Bus Line',
 'Bus Station',
 'Bus Stop',
 'Business Service',
 'Cafeteria',
 'Café',
 'Cajun / Creole Restaurant',
 'Campground',
 'Candy Store',
 'Cantonese Restaurant',
 'Caribbean Restaurant',
 'Carpet Store',
 '

In [160]:
#Grouping the venues by Neighborhood

la_grouped = la_onehot.groupby("Neighborhood").sum().reset_index()

la_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,Arcade,...,Vietnamese Restaurant,Warehouse,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Yoshoku Restaurant
0,"90001 - Los Angeles(South Los Angeles), Floren...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"90002 - Los Angeles(Southeast Los Angeles, Watts)",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,"90003 - Los Angeles(South Los Angeles, Southea...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"90004 - Los Angeles(Hancock Park, Rampart Vill...",0,0,0,0,0,2,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,"90005 - Los Angeles(Hancock Park, Koreatown, W...",0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5,"90006 - Los Angeles(Byzantine-Latino Quarter, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"90007 - Los Angeles(Southeast Los Angeles, Uni...",1,0,0,0,0,4,0,0,0,...,0,0,0,0,1,0,0,0,1,0
7,"90008 - Los Angeles(Baldwin Hills, Crenshaw, L...",0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"90010 - Los Angeles(Hancock Park, Wilshire Cen...",0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,90011 - Los Angeles(Southeast Los Angeles),0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [228]:
#Keep only number of Italian Restaurant

la_italian = la_grouped[['Neighborhood','Italian Restaurant']]
la_italian.head()

Unnamed: 0,Neighborhood,Italian Restaurant
0,"90001 - Los Angeles(South Los Angeles), Floren...",0
1,"90002 - Los Angeles(Southeast Los Angeles, Watts)",0
2,"90003 - Los Angeles(South Los Angeles, Southea...",0
3,"90004 - Los Angeles(Hancock Park, Rampart Vill...",1
4,"90005 - Los Angeles(Hancock Park, Koreatown, W...",1


In [184]:
#Merge Italian restaurant column to LA data

la_merged = pd.merge(la_data,la_italian, on='Neighborhood',how='left')


### Clustering the neighborhoods

In [241]:
#Standardized the variables

la_cluster = la_merged[['Population','Median Income','Italian Restaurant']]
la_cluster['Population'] = la_cluster['Population'].astype(str).astype(int)

la_cluster = StandardScaler().fit_transform(la_cluster)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [243]:
#cluster the neighborhoods using KMeans 

from sklearn.cluster import KMeans

kclusters = 5

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_cluster)

kmeans.labels_[0:10] 

#assign the cluster labels to the neighborhoods in the dataset

la_merged.insert(0, 'Cluster Labels', kmeans.labels_)


array([3, 3, 3, 1, 3, 2, 1, 1, 2, 3], dtype=int32)

In [240]:
la_merged

Unnamed: 0,Cluster Labels,Neighborhood,Zip,Name,Latitude,Longitude,Population,Median Income,Italian Restaurant
0,3,90057 - Los Angeles(Westlake),90057,Los Angeles(Westlake),34.061918,-118.277939,43986,31337,0
1,3,"90020 - Los Angeles(Hancock Park, Western Wilt...",90020,"Los Angeles(Hancock Park, Western Wilton, Wils...",34.066367,-118.309868,42383,42407,1
2,3,"90005 - Los Angeles(Hancock Park, Koreatown, W...",90005,"Los Angeles(Hancock Park, Koreatown, Wilshire ...",34.059281,-118.30742,43014,32461,1
3,1,"90006 - Los Angeles(Byzantine-Latino Quarter, ...",90006,"Los Angeles(Byzantine-Latino Quarter, Harvard ...",34.048013,-118.293953,62765,33790,0
4,3,90029 - Los Angeles(East Hollywood),90029,Los Angeles(East Hollywood),34.089953,-118.294824,41697,37379,0
5,2,"90017 - Los Angeles(Downtown Bunker Hill, City...",90017,"Los Angeles(Downtown Bunker Hill, City West, S...",34.052842,-118.264495,20689,28638,4
6,1,90011 - Los Angeles(Southeast Los Angeles),90011,Los Angeles(Southeast Los Angeles),34.00709,-118.258681,101214,33824,0
7,1,"90004 - Los Angeles(Hancock Park, Rampart Vill...",90004,"Los Angeles(Hancock Park, Rampart Village, Vir...",34.076259,-118.310715,67850,46581,1
8,2,90038 - Los Angeles(Hollywood),90038,Los Angeles(Hollywood),34.088017,-118.327168,32557,36996,3
9,3,90028 - Los Angeles(Hollywood),90028,Los Angeles(Hollywood),34.099869,-118.326843,30562,40068,1


In [210]:
#The summary of the clusters

la_sum = la_merged[['Cluster Labels','Population','Median Income','Italian Restaurant']]

la_sum_grouped = la_sum.groupby('Cluster Labels').mean().reset_index()
la_sum_grouped = la_sum_grouped[['Cluster Labels','Population','Median Income','Italian Restaurant']]
la_sum_grouped['Desirability'] = ['Low','Mid-Low','Mid','Mid-High','High']
la_sum_grouped['Number of Neighborhood'] = la_sum.groupby('Cluster Labels').size()
la_sum_grouped

Unnamed: 0,Cluster Labels,Population,Median Income,Italian Restaurant,Desirability,Number of Neighborhood
0,0,7204.833333,29296.0,0.666667,Low,6
1,1,70711.7,48913.9,0.7,Mid-Low,10
2,2,21728.444444,70406.111111,3.0,Mid,9
3,3,43232.166667,43308.9,0.3,Mid-High,30
4,4,22624.2,112789.6,0.0,High,5


The cluster with the highest desirability (i.e. the cluster with a high median income and population size but a low number of median income) is the cluster number 4. The table shows that this cluster consists of 5 neighborhoods. The average number of population is 22,624, which is still much lower than that of clusters of 1 and 3. However, the median income of this cluster stands out at $112,789, which is significantly higher than the rest of the clusters. In addition, there is no Italian restaurant around the neighborhoods that form this cluster. The combination of these characteristics leads to the high desirability of this cluster of neighborhood to the company.


In [238]:
#Create a LA map with the clusters

import numpy as np


map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(la_merged['Latitude'], la_merged['Longitude'], la_merged['Neighborhood'], la_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=int(cluster)*3, #more desirable neighborhoods are represented by larger circles
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    

       
map_clusters

In [226]:
#Cluster 4 summary

cluster_4 = la_merged['Cluster Labels'] == 4
la_merged[cluster_4]

Unnamed: 0,Cluster Labels,Neighborhood,Zip,Name,Latitude,Longitude,Population,Median Income,Italian Restaurant
51,4,90056 - Ladera Heights,90056,Ladera Heights,33.987945,-118.370442,8108,84438,0
53,4,90045 - Los Angeles(Los Angeles International ...,90045,"Los Angeles(Los Angeles International Airport,...",33.954017,-118.402447,39315,90399,0
54,4,90068 - Los Angeles(Hollywood),90068,Los Angeles(Hollywood),34.137411,-118.328915,21713,82718,0
55,4,"90049 - Los Angeles(Bel Air Estates, Brentwood)",90049,"Los Angeles(Bel Air Estates, Brentwood)",34.091829,-118.491244,33520,121671,0
58,4,"90077 - Los Angeles(Bel Air Estates, Beverly G...",90077,"Los Angeles(Bel Air Estates, Beverly Glen)",34.102084,-118.451629,10465,184722,0


The table above shows the neighborhoods in cluster 4, the most desirable cluster. There is no Italian restaurant within a radius of 1 kilometer from all the 5 neighborhoods in the cluster. The population sizes of these neighborhood range from 8,108 people in Ladera Heights to 39,315 people in the LA International Airport area. Lastly, the median income also varies, from $84,438 for Ladera Heights to 184,722 for Bel Air Estates - Beverly Glen. Overall, all the neighborhoods in this cluster are characterized by the non-existence of nearby Italian restaurants, a modest population size and a high median income.


In [230]:
#Cluster 0 summary

cluster_0 = la_merged['Cluster Labels'] == 0
la_merged[cluster_0]

Unnamed: 0,Cluster Labels,Neighborhood,Zip,Name,Latitude,Longitude,Population,Median Income,Italian Restaurant
30,0,"90013 - Los Angeles(Downtown Central, Downtown...",90013,"Los Angeles(Downtown Central, Downtown Fashion...",34.044639,-118.240413,9727,22808,1
41,0,"90015 - Los Angeles(Dowtown Fashion District, ...",90015,"Los Angeles(Dowtown Fashion District, South Pa...",34.039224,-118.266293,15134,32979,1
52,0,"90010 - Los Angeles(Hancock Park, Wilshire Cen...",90010,"Los Angeles(Hancock Park, Wilshire Center, Win...",34.062125,-118.315709,1943,47115,1
56,0,"90040 - Commerce, City of",90040,"Commerce, City of",33.994524,-118.149953,9798,43585,0
57,0,"90021 - Los Angeles(Downtown Fashion District,...",90021,"Los Angeles(Downtown Fashion District, Downtow...",34.029043,-118.239504,3003,12864,1
59,0,"90058 - Los Angeles(Southeast Los Angeles), Ve...",90058,"Los Angeles(Southeast Los Angeles), Vernon",34.001617,-118.222274,3624,16425,0


The table above shows the neighborhood in cluster 0, the least desirable cluster. In 4 out of 6 neighborhoods in the cluster, there is already 1 Italian restaurant with a radius of 1 kilometer from the center. The population size of these neighborhood is also relatively small, with the largest neighborhood having only 15,134 people in it. The median incomes are also significantly smaller compared to those in cluster 4. The highest median income in the cluster is $47,115, which is only slightly higher than half of the lowest median income in cluster 4.


### Conclusion

Based on the results presented above, it is clear that the company should open their restaurants in neighborhoods that make up cluster 4. The reason being these neighborhoods fulfill the criteria set by the company: high median income, decent size of population and less or no competitors. These areas are Ladera Heights, the LA Airport Area, Hollywood, Bel Air Estates - Brentwood and Bel Air Estates - Beverly Glen. These areas are located in either the north-west or south-west part of Los Angeles and are known for having affluent residents.
