# Battle of the Neighborhoods: Capstone Project


**Coursera IBM Data Science Capstone Project**   
May 2020  
by Claire Li

# Table of Contents
1. [Introduction/Business Problem](#introduction)
2. [Methodology Overview](#overview)
3. [Part 1: Demographic Data](#demodata)  
    1. [Preparing the Data](#dataLA)   
    2. [Clustering](#clusterLA) 
4. [Part 2: Foursquare Data](#foursquare)
    1. [Preparing the Data](#datafoursquare)   
    2. [Clustering](#clusterfoursquare) 
5. [Results](#Results)

## 1. Introduction/Business Problem <a name="introduction"></a>
The client is the owner of a fast-casual Chinese restaurant in Irvine, California, which is located in the suburb of Orange County. As the restaurant has proven to be a success in its years of business, the client is seeking to take the next step and expand to the Los Angeles metropolitan area. 

As a bustling urban area with some of the nation’s highest rent costs, it is a considerable investment to open a storefront in Los Angeles, but also promises the opportunity for high returns and rapid growth. The client is seeking a location for their new venture that shares similar characteristics with its existing location in Orange County. Specifically, they are looking for a similar demographic profile as well as a similar competitor landscape. Therefore, the aim of this study is to perform a preliminary analysis to research and suggest neighborhoods for the client’s proposed new venture. Specifically, the business problem is: which neighborhoods in Los Angeles have a similar demographic profile and competitor landscape to Irvine, California, and are potential locations for the client’s proposed new restaurant? 





## 2. Methodology Overview <a name="overview"></a>
The features in particular that we seek in an potential location include:
* A similar demographic profile to the original Irvine location. Factors to be included in the analysis are: age distribution, median income, and race/ethnicity. 
* Similar consumer preferences. This will be measured by the types of businesses in the area.  
* A favorable proportion of existing Chinese restaurants.

This study was conducted in two parts. First, we examine the demographic composition of various LA neighborhoods in order to find a similar demographic area to the original Irvine location. Second, we study competitor landscape and consumer preferences via the Foursquare API. 

### Los Angeles Neighborhood Data
We will use the list of L.A. County neighborhoods as defined by the Los Angeles Times, available here: 
http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/la-county-neighborhoods-v6.geojson

### Los Angeles Demographic Statistics
We are interested in identifying a neighborhood of Los Angeles with comparable characteristics to the original restaurant location in Orange County. To do this, we will use data about age distribution (available here: https://usc.data.socrata.com/Los-Angeles/Age-Distribution-LA-/rqg9-k6ju/data), income (available here: https://usc.data.socrata.com/Los-Angeles/Income-LA-/kygc-fzgm/data [**Edit**: this is now unavailable; this one was used instead: https://maps.latimes.com/neighborhoods/income/median/neighborhood/list/]), and race/ethnicity (available here: https://usc.data.socrata.com/Los-Angeles/Race-Ethnicity-LA-/jxw5-xxv5/data). 

### Irvine Demographic Statistics
We are specifically interested in the demographic data of Irvine, California, as provided by the US Census Bureau: https://www.census.gov/quickfacts/fact/table/irvinecitycalifornia/PST045219? and
https://data.census.gov/cedsci/table?q=%20irvine&g=1600000US0636770&tid=ACSDP1Y2018.DP05&hidePreview=true

### Foursquare API 
Foursquare is a search-and-discovery platform that aims to help users discover and share information about local businesses and attractions. We utilized the Foursquare API for location-based insights on local venues in order to (1) determine the top types of venues for each neighborhood, to get a sense for consumer preferences, (2) find a neighborhood with a similar business landscape as Irvine, and (3) research the relative frequency.  


## 3. Part 1: Demographic Data  <a name="demodata"></a>

In this section, we analyze the demographic information for LA and Irvine. 

### Preparing the Data <a name="dataLA"></a>   
We will first work with the Los Angeles data. The first step is to get a list of all the Los Angeles neighborhoods, along with their coordinates. We use data made available by the LA Times. 

#### Los Angeles Neighborhood Data 

In [10]:
#Import libraries
import numpy as np
import pandas as pd
! pip install geopandas
import geopandas as gpd 
! pip install --user folium
import folium
import json 
import requests 

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Libraries imported")

Libraries imported


In [11]:
#Download the geojson file from the LATimes
!wget -O LA_neighborhoods.geojson http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/la-county-neighborhoods-current.geojson
print("DONE")

#Use geopandas to read the geojson file 
la_geo = r"LA_neighborhoods.geojson" #geojson file
neighborhoods_data = gpd.read_file(la_geo)
neighborhoods_data.head()

--2020-05-16 02:55:19--  http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/la-county-neighborhoods-current.geojson
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.201.192
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.201.192|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1345788 (1.3M) [application/geo+json]
Saving to: ‘LA_neighborhoods.geojson’


2020-05-16 02:55:20 (4.17 MB/s) - ‘LA_neighborhoods.geojson’ saved [1345788/1345788]

DONE


Unnamed: 0,kind,external_id,name,slug,set,metadata,resource_uri,geometry
0,L.A. County Neighborhood (Current),acton,Acton,acton-la-county-neighborhood-current,/1.0/boundary-set/la-county-neighborhoods-curr...,"{'sqmi': 39.3391089485, 'type': 'unincorporate...",/1.0/boundary/acton-la-county-neighborhood-cur...,"MULTIPOLYGON (((-118.20262 34.53899, -118.1894..."
1,L.A. County Neighborhood (Current),adams-normandie,Adams-Normandie,adams-normandie-la-county-neighborhood-current,/1.0/boundary-set/la-county-neighborhoods-curr...,"{'sqmi': 0.805350187789, 'type': 'segment-of-a...",/1.0/boundary/adams-normandie-la-county-neighb...,"MULTIPOLYGON (((-118.30901 34.03741, -118.3004..."
2,L.A. County Neighborhood (Current),agoura-hills,Agoura Hills,agoura-hills-la-county-neighborhood-current,/1.0/boundary-set/la-county-neighborhoods-curr...,"{'sqmi': 8.14676029818, 'type': 'standalone-ci...",/1.0/boundary/agoura-hills-la-county-neighborh...,"MULTIPOLYGON (((-118.76193 34.16820, -118.7263..."
3,L.A. County Neighborhood (Current),agua-dulce,Agua Dulce,agua-dulce-la-county-neighborhood-current,/1.0/boundary-set/la-county-neighborhoods-curr...,"{'sqmi': 31.4626319451, 'type': 'unincorporate...",/1.0/boundary/agua-dulce-la-county-neighborhoo...,"MULTIPOLYGON (((-118.25468 34.55830, -118.2555..."
4,L.A. County Neighborhood (Current),alhambra,Alhambra,alhambra-la-county-neighborhood-current,/1.0/boundary-set/la-county-neighborhoods-curr...,"{'sqmi': 7.62381430605, 'type': 'standalone-ci...",/1.0/boundary/alhambra-la-county-neighborhood-...,"MULTIPOLYGON (((-118.12175 34.10504, -118.1168..."


In [12]:
#We will only work with the columns 'name', 'metadata', and 'geometry'
neighborhoods = neighborhoods_data[['name','metadata','geometry']]

#Add Longitude/Latitude coordinates using the centroid of the geographical region
neighborhoods['Longitude'] = neighborhoods.centroid.x
neighborhoods['Latitude'] = neighborhoods.centroid.y

#Drop unnecessary columns and rename columns 
neighborhoods.drop(columns = ['geometry', 'metadata'], axis = 1, inplace = True)
neighborhoods.rename(columns={'name':'Neighborhood'}, inplace=True)

print("There are a total of {} neighborhoods".format(neighborhoods.shape[0]))
neighborhoods.head()

There are a total of 272 neighborhoods


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood,Longitude,Latitude
0,Acton,-118.185799,34.495516
1,Adams-Normandie,-118.300288,34.031411
2,Agoura Hills,-118.760944,34.150734
3,Agua Dulce,-118.313371,34.508909
4,Alhambra,-118.135494,34.083967


We now have a list of the neighborhoods of LA along with their regions and coordinates. Let's plot this on a map to get a better visual. 

In [13]:
lat = 34.0522
long = -118.243683

# create map of LA using latitude and longitude values
map_la = folium.Map(location=[lat, long], zoom_start=10)

# add markers to map
for lat, lng, name in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

#### Los Angeles Income Data 

Next, we examine some of the demographic information for each of the neighborhoods, starting with income data. This gives us a list of neighborhoods alongside their median income, which we then join with our existing LA neighborhoods dataframe.  

(**UPDATE**: The dataset I was using has mysteriously disappeared, so we are now using this less-thorough one.) 
https://maps.latimes.com/neighborhoods/income/median/neighborhood/list/


In [14]:
#Read the income data
la_income = pd.read_html('https://maps.latimes.com/neighborhoods/income/median/neighborhood/list/')[1]
la_income.drop(columns = ['Rank'], inplace = True)
la_income.head()

Unnamed: 0,Neighborhood,Median Income
0,Bel-Air,"$207,938"
1,Hidden Hills,"$203,199"
2,Rolling Hills,"$184,777"
3,Beverly Crest,"$169,282"
4,Pacific Palisades,"$168,008"


In [39]:
#Merge the income data with neighborhoods 
la_data = neighborhoods.merge(la_income, how = 'left', left_on = 'Neighborhood', right_on = 'Neighborhood')

#Drop all of the NaN values and cast income values as integers
la_data.dropna(inplace=True)
la_data['Median Income']= la_data['Median Income'].str.replace("$", "")
la_data['Median Income']= la_data['Median Income'].str.replace(",", "")
la_data['Median Income'] = pd.to_numeric(la_data['Median Income'], downcast = 'integer') 
la_data.head()

Unnamed: 0,Neighborhood,Longitude,Latitude,Median Income
0,Acton,-118.185799,34.495516,83983
1,Adams-Normandie,-118.300288,34.031411,29606
2,Agoura Hills,-118.760944,34.150734,117608
3,Agua Dulce,-118.313371,34.508909,106078
4,Alhambra,-118.135494,34.083967,53224


#### Los Angeles Age Distribution Data 

Returning to our data, the next piece of information that we'd like to know is population and age distribution. 

In [40]:
#Get the data 
url = 'https://usc.data.socrata.com/resource/rqg9-k6ju.json?year=2018&$limit=20000'
response = requests.get(url)
data = response.json()
data
la_age_all = pd.DataFrame.from_records(data)

print(la_age_all.shape)
la_age_all.head()

(16408, 15)


Unnamed: 0,count,dataset,date,denominator,denominator_description,geoid,location,neighborhood,percent,policy_area,row_id,tract,tract_number,variable,year
0,,Age Distribution,2018-01-01T00:00:00.000,,Total Population,1400000US06037980001,"{'latitude': '34.19972039', 'longitude': '-118...",Burbank,,Demography,Population_Under_Age_18_2018_1400000US06037980001,Census Tract 9800.01,980001,Population Under Age 18,2018
1,1510.0,Age Distribution,2018-01-01T00:00:00.000,5800.0,Total Population,1400000US06037920313,"{'latitude': '34.37156477', 'longitude': '-118...",Santa Clarita,26.034483,Demography,Population_Under_Age_18_2018_1400000US06037920313,Census Tract 9203.13,920313,Population Under Age 18,2018
2,,Age Distribution,2018-01-01T00:00:00.000,,Total Population,1400000US06037980028,"{'latitude': '33.94102572', 'longitude': '-118...",Westchester,,Demography,Population_Under_Age_18_2018_1400000US06037980028,Census Tract 9800.28,980028,Population Under Age 18,2018
3,,Age Distribution,2018-01-01T00:00:00.000,,Total Population,1400000US06037980014,"{'latitude': '33.78420408', 'longitude': '-118...",Wilmington,,Demography,Population_Under_Age_18_2018_1400000US06037980014,Census Tract 9800.14,980014,Population Under Age 18,2018
4,893.0,Age Distribution,2018-01-01T00:00:00.000,3824.0,Total Population,1400000US06037920331,"{'latitude': '34.39710708', 'longitude': '-118...",Santa Clarita,23.35251,Demography,Population_Under_Age_18_2018_1400000US06037920331,Census Tract 9203.31,920331,Population Under Age 18,2018


Next, we clean the data so that we are only working with the relevant columns about age distribution.

In [41]:
#Drop all NaN values 
la_age_all.dropna(inplace=True)

#Only want the neighborhood, variable (= age category), count (= number in each age category), and denominator (= total population)
la_age = la_age_all[['neighborhood', 'variable', 'count', 'denominator']]
la_age.sort_values(by = 'neighborhood', inplace = True)

#Cast as int
la_age = la_age.astype({"count": int, "denominator": int})

#Sum up the total for each age group by neighborhood (neighborhoods potentially have mutliple tracts; we combine these to get a total for each neighborhood)
la_age = la_age.groupby(['neighborhood', 'variable']).agg({'count': 'sum', 'denominator': 'sum'})

#Calculate the percentage in each age group
la_age['Percentage'] = la_age['count']/la_age['denominator']
la_age.reset_index(drop = False, inplace = True)

#Rename columns and format percentage
la_age.rename(columns={'neighborhood':'Neighborhood', 'variable': 'Age Group', 'count': 'Count', 'denominator': 'Total Population'}, inplace=True)
la_age['Percentage'] = la_age['Percentage'] * 100

la_age.head(15)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood,Age Group,Count,Total Population,Percentage
0,Acton,Population Ages 18-24,635,7613,8.340996
1,Acton,Population Ages 25-34,736,7613,9.667674
2,Acton,Population Ages 35-44,782,7613,10.271903
3,Acton,Population Ages 45-54,1339,7613,17.588336
4,Acton,Population Ages 55-64,1568,7613,20.596348
5,Acton,Population Ages 65 & Older,1218,7613,15.998949
6,Acton,Population Under Age 18,1335,7613,17.535794
7,Adams-Normandie,Population Ages 18-24,3337,18237,18.297966
8,Adams-Normandie,Population Ages 25-34,2923,18237,16.027855
9,Adams-Normandie,Population Ages 35-44,2244,18237,12.304655


We then create a new DataFrame to display the percentage of each age group for each neighborhood. Then, we merge with the rest of our LA data.

In [42]:
#Create DataFrame displays age distribution and total population for each neighborhood 
percentages_list = la_age.groupby('Neighborhood')
result = percentages_list['Percentage'].unique()

percentages = pd.DataFrame(result)
percentages[['Ages 18-24','Ages 25-34', 'Ages 35-44','Ages 45-54','Ages 55-64', 'Ages 65 & Older', 'Under Age 18' ]] = pd.DataFrame(percentages.Percentage.tolist(), index= percentages.index)
percentages.drop(columns = ['Percentage'], inplace = True)
percentages = percentages[['Under Age 18', 'Ages 18-24','Ages 25-34', 'Ages 35-44','Ages 45-54','Ages 55-64', 'Ages 65 & Older']]

total_pop_list = la_age.groupby('Neighborhood')
result = total_pop_list['Total Population'].unique()
total_pop = pd.DataFrame(result)
total_pop = total_pop.astype({"Total Population": int})

la_age_final = percentages.merge(total_pop, how = 'outer', left_on = 'Neighborhood', right_on = 'Neighborhood')
la_age_final

#Merge with la_data 
la_data = la_data.merge(la_age_final, how = 'left', left_on = 'Neighborhood', right_on = 'Neighborhood')
print(la_data.shape)
la_data.head(40)

(264, 12)


Unnamed: 0,Neighborhood,Longitude,Latitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,Ages 65 & Older,Total Population
0,Acton,-118.185799,34.495516,83983,17.535794,8.340996,9.667674,10.271903,17.588336,20.596348,15.998949,7613.0
1,Adams-Normandie,-118.300288,34.031411,29606,20.946428,18.297966,16.027855,12.304655,11.443768,11.860503,9.118824,18237.0
2,Agoura Hills,-118.760944,34.150734,117608,22.520192,6.799345,9.77142,13.350578,17.763818,15.382991,14.411656,18943.0
3,Agua Dulce,-118.313371,34.508909,106078,17.72214,8.492882,12.714777,7.756505,15.635739,20.250368,17.42759,4074.0
4,Alhambra,-118.135494,34.083967,53224,17.292348,8.549674,16.339115,13.66418,14.331443,12.968673,16.854567,84974.0
5,Alondra Park,-118.334949,33.88852,57177,18.681319,9.442409,11.294261,10.826211,13.492063,15.934066,20.32967,4914.0
6,Altadena,-118.13555,34.193466,82676,19.43049,7.353173,12.282322,12.207979,16.704589,14.701841,17.319606,44389.0
7,Angeles Crest,-117.909387,34.290855,72841,,12.401353,19.954904,13.303269,15.332582,16.121759,2.931229,887.0
8,Arcadia,-118.037229,34.134077,75808,22.244256,6.236469,10.546328,13.178816,15.400149,14.381356,18.012626,54967.0
9,Arleta,-118.432099,34.241436,65649,22.267747,11.44496,17.824517,12.589153,12.992807,11.611885,11.268931,32949.0


#### Los Angeles Race and Ethnicity Data 

Finally, we will include race and ethnicity data for the Los Angeles neighborhoods. 

In [43]:
#Get the data
url = 'https://usc.data.socrata.com/resource/jxw5-xxv5.json?year=2016&$limit=20000'
response = requests.get(url)
data = response.json()
re_data_all = pd.DataFrame.from_records(data)

print(re_data_all.shape)
re_data_all.head()

(18752, 13)


Unnamed: 0,count,dataset,date,geoid,location,neighborhood,percent,policy_area,row_id,tract,tract_number,variable,year
0,66,Race & Ethnicity,2016-01-01T00:00:00.000,1400000US06037271500,"(34.01663, -118.4375635)",Mar Vista,2.0899303,Demography,Black_Population_2016_1400000US06037271500,"Census Tract 2715, Los Angeles County, California",271500,Black Population,2016
1,19,Race & Ethnicity,2016-01-01T00:00:00.000,1400000US06037311200,"(34.1714255, -118.3527755)",Burbank,0.5856967,Demography,Black_Population_2016_1400000US06037311200,"Census Tract 3112, Los Angeles County, California",311200,Black Population,2016
2,136,Race & Ethnicity,2016-01-01T00:00:00.000,1400000US06037311300,"(34.173525, -118.342414)",Burbank,3.4482758,Demography,Black_Population_2016_1400000US06037311300,"Census Tract 3113, Los Angeles County, California",311300,Black Population,2016
3,51,Race & Ethnicity,2016-01-01T00:00:00.000,1400000US06037311400,"(34.162038, -118.34958)",Burbank,2.2222223,Demography,Black_Population_2016_1400000US06037311400,"Census Tract 3114, Los Angeles County, California",311400,Black Population,2016
4,161,Race & Ethnicity,2016-01-01T00:00:00.000,1400000US06037311500,"(34.164754, -118.33837)",Burbank,2.9315367,Demography,Black_Population_2016_1400000US06037311500,"Census Tract 3115, Los Angeles County, California",311500,Black Population,2016


In [44]:
#Drop all the NaN values 
re_data_all.dropna(inplace=True)

#Choose only the columns we want
re_data = re_data_all[['neighborhood', 'variable', 'count']]
re_data.sort_values(by = 'neighborhood', inplace = True)

#Cast as int
re_data = re_data.astype({"count": int})

#Sum up the total for each age group by neighborhood (neighborhoods potentially have mutliple tracts; we combine these to get a total for each neighborhood)
re_data = re_data.groupby(['neighborhood', 'variable']).agg({'count': 'sum'})
re_data.reset_index(drop = False, inplace = True)

#Rename columns
re_data.rename(columns={'neighborhood':'Neighborhood', 'variable': 'Race/Ethnicity', 'count': 'Count'}, inplace=True)
re_data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood,Race/Ethnicity,Count
0,Acton,American Indian/Native Population,23
1,Acton,Asian Population,114
2,Acton,Black Population,180
3,Acton,Hispanic Population,1417
4,Acton,Native Hawaiian/Other Pacific Islander Population,5
5,Acton,Other Race Population,56
6,Acton,Population of Two or More Races,271
7,Acton,White Population,5705
8,Adams-Normandie,American Indian/Native Population,0
9,Adams-Normandie,Asian Population,1363


We create a DataFrame with each row being a Neighborhood, and each column being a category. Finally, we merge it with the rest of the LA demographic data in la_data and calculate percentages for each ethnicity group in each neighborhood.

In [45]:
counts_list = re_data.groupby('Neighborhood')
result = counts_list['Count'].unique()
counts = pd.DataFrame(result)
counts[['American Indian/Native Population','Asian Population', 'Black Population','Hispanic Population','Native Hawaiian/Other Pacific Islander Population', 'Other Race Population', 'Population of Two or More Races', 'White Population']] = pd.DataFrame(counts.Count.tolist(), index= counts.index)
counts.drop(columns = ['Count'], inplace = True)
counts

#Caclulate the percentage for each ethnicity group
la_data = la_data.merge(counts, how = 'left', left_on = 'Neighborhood', right_on = 'Neighborhood')
la_data.iloc[:, 12:] = la_data.iloc[:, 12:].div(la_data['Total Population'], axis=0) * 100

la_data.head(10)

Unnamed: 0,Neighborhood,Longitude,Latitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,Ages 65 & Older,Total Population,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population
0,Acton,-118.185799,34.495516,83983,17.535794,8.340996,9.667674,10.271903,17.588336,20.596348,15.998949,7613.0,0.302115,1.497439,2.364377,18.612899,0.065677,0.735584,3.559701,74.937607
1,Adams-Normandie,-118.300288,34.031411,29606,20.946428,18.297966,16.027855,12.304655,11.443768,11.860503,9.118824,18237.0,0.0,7.473817,20.73806,60.514339,0.060317,0.515436,0.564786,6.267478
2,Agoura Hills,-118.760944,34.150734,117608,22.520192,6.799345,9.77142,13.350578,17.763818,15.382991,14.411656,18943.0,0.05279,6.181703,1.567861,13.086628,0.0,0.47511,2.549755,75.637439
3,Agua Dulce,-118.313371,34.508909,106078,17.72214,8.492882,12.714777,7.756505,15.635739,20.250368,17.42759,4074.0,0.0,1.669121,0.147275,15.80756,6.136475,67.525773,,
4,Alhambra,-118.135494,34.083967,53224,17.292348,8.549674,16.339115,13.66418,14.331443,12.968673,16.854567,84974.0,0.304799,50.363641,1.758185,36.821851,0.124744,0.055311,1.356886,9.087486
5,Alondra Park,-118.334949,33.88852,57177,18.681319,9.442409,11.294261,10.826211,13.492063,15.934066,20.32967,4914.0,0.0,33.455433,4.192104,30.30118,0.793651,1.668702,30.769231,
6,Altadena,-118.13555,34.193466,82676,19.43049,7.353173,12.282322,12.207979,16.704589,14.701841,17.319606,44389.0,0.083354,7.465814,21.41071,26.517831,0.0,0.653315,4.911127,38.502782
7,Angeles Crest,-117.909387,34.290855,72841,,12.401353,19.954904,13.303269,15.332582,16.121759,2.931229,887.0,0.0,1.014656,9.357384,23.111612,5.18602,62.683202,,
8,Arcadia,-118.037229,34.134077,75808,22.244256,6.236469,10.546328,13.178816,15.400149,14.381356,18.012626,54967.0,0.272891,58.402314,1.635527,12.884094,0.136446,0.24924,2.748922,22.866447
9,Arleta,-118.432099,34.241436,65649,22.267747,11.44496,17.824517,12.589153,12.992807,11.611885,11.268931,32949.0,0.20638,8.828796,1.025828,76.405961,0.0,0.17603,0.667699,6.370451


This concludes the preparation of the LA demographic data. 

#### Irvine Data 

Next, we examine the same metrics for the city of Irvine in Orange County, where the client's original restaurant is located.


In [22]:
# The code was removed by Watson Studio for sharing.

In [23]:
#Read the Irvine data (uploaded as a csv to IBM Cloud Object Storage)
irvine_data = pd.read_csv(body, header = 1)
irvine_data.head()

Unnamed: 0,id,Geographic Area Name,Estimate!!SEX AND AGE!!Total population,Margin of Error!!SEX AND AGE!!Total population,Percent Estimate!!SEX AND AGE!!Total population,Percent Margin of Error!!SEX AND AGE!!Total population,Estimate!!SEX AND AGE!!Total population!!Male,Margin of Error!!SEX AND AGE!!Total population!!Male,Percent Estimate!!SEX AND AGE!!Total population!!Male,Percent Margin of Error!!SEX AND AGE!!Total population!!Male,...,"Percent Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population","Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population","Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Male","Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Male","Percent Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Male","Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Male","Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Female","Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Female","Percent Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Female","Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and over population!!Female"
0,1600000US0636770,"Irvine city, California",282584,66,282584,(X),138076,4009,48.9,1.4,...,168948,(X),82299,4228,48.7,1.6,86649,4804,51.3,1.6


The following code filters out and formats the data for race/ethnicity to be used in the analysis:

In [24]:
irvine_eth = irvine_data.filter(regex='Percent Estimate',axis=1)
irvine_eth = irvine_eth.filter(regex='HISPANIC OR LATINO AND RACE',axis=1)

#Fix the column labels
column_labels = ['Total Population']
for name in irvine_eth.columns[1:]: 
    new_name = name.replace("Percent Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!", "")
    column_labels.append(new_name)    
irvine_eth.columns = column_labels

#Drop unnecessary columns 
irvine_eth.drop(irvine_eth.iloc[:, 2:6], inplace = True, axis = 1)
irvine_eth.drop(irvine_eth.iloc[:, -2:], inplace = True, axis = 1)
irvine_eth.drop(columns = ['Not Hispanic or Latino', 'Total Population'], inplace = True)

#Format to match LA data
column_labels = ['Hispanic Population', 'White Population']
for name in irvine_eth.columns[2:]: 
    new_name = name.replace("Not Hispanic or Latino!!", "")
    column_labels.append(new_name)    
irvine_eth.columns = column_labels

irvine_eth = irvine_eth.sort_index(axis=1)
irvine_eth.columns = ['American Indian/Native Population', 'Asian Population', 'Black Population', 'Hispanic Population', 'Native Hawaiian/Other Pacific Islander Population', 'Other Race Population', 'Population of Two or More Races', 'White Population']

irvine_eth.head()

Unnamed: 0,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population
0,0.4,42.5,1.0,12.5,0.1,0.4,4.2,38.9


And the following code filters out and formats the data for age distribution to be used in the analysis:

In [25]:
irvine_age_all = irvine_data.filter(like='SEX AND AGE',axis=1)
irvine_age_all = irvine_age_all.filter(regex='Percent Estimate',axis=1)

irvine_age_all.drop(list(irvine_age_all.filter(regex='Error')), axis = 1, inplace = True)

irvine_age_all = irvine_age_all.iloc[:, 0:18]
irvine_age_all

Unnamed: 0,Percent Estimate!!SEX AND AGE!!Total population,Percent Estimate!!SEX AND AGE!!Total population!!Male,Percent Estimate!!SEX AND AGE!!Total population!!Female,Percent Estimate!!SEX AND AGE!!Total population!!Sex ratio (males per 100 females),Percent Estimate!!SEX AND AGE!!Total population!!Under 5 years,Percent Estimate!!SEX AND AGE!!Total population!!5 to 9 years,Percent Estimate!!SEX AND AGE!!Total population!!10 to 14 years,Percent Estimate!!SEX AND AGE!!Total population!!15 to 19 years,Percent Estimate!!SEX AND AGE!!Total population!!20 to 24 years,Percent Estimate!!SEX AND AGE!!Total population!!25 to 34 years,Percent Estimate!!SEX AND AGE!!Total population!!35 to 44 years,Percent Estimate!!SEX AND AGE!!Total population!!45 to 54 years,Percent Estimate!!SEX AND AGE!!Total population!!55 to 59 years,Percent Estimate!!SEX AND AGE!!Total population!!60 to 64 years,Percent Estimate!!SEX AND AGE!!Total population!!65 to 74 years,Percent Estimate!!SEX AND AGE!!Total population!!75 to 84 years,Percent Estimate!!SEX AND AGE!!Total population!!85 years and over,Percent Estimate!!SEX AND AGE!!Total population!!Median age (years)
0,282584,48.9,51.1,(X),6.4,6.0,6.5,7.7,9.4,16.2,14.7,12.6,5.6,4.4,6.3,3.1,1.2,(X)


In [26]:
#Fix the column labels
column_labels = ['Total Population']
for name in irvine_age_all.columns[1:]: 
    new_name = name.replace("Percent Estimate!!SEX AND AGE!!Total population!!", "")
    column_labels.append(new_name)
irvine_age_all.columns = column_labels

#Drop unnecessary columns 
irvine_age = irvine_age_all.drop(columns = ['Male', 'Female','Sex ratio (males per 100 females)', 'Median age (years)'], axis = 1)

#Create same age categories as the LA neighborhoods
irvine_age['Under Age 18'] = irvine_age.iloc[:,1:5].sum(axis=1)
#(these ages aren't exact but close enough lol)
irvine_age['Ages 18-24'] = irvine_age['20 to 24 years']
irvine_age['Ages 25-34'] = irvine_age['25 to 34 years']
irvine_age['Ages 35-44'] = irvine_age['35 to 44 years']
irvine_age['Ages 45-54'] = irvine_age['45 to 54 years']
irvine_age['Ages 55-64']= irvine_age['55 to 59 years'] + irvine_age['60 to 64 years']
irvine_age['Ages 65 & Older'] = irvine_age['65 to 74 years'] + irvine_age['75 to 84 years'] + irvine_age['85 years and over'] 

#Move Total Population column to the end
total_pop = irvine_age.pop('Total Population')
irvine_age['Total Population'] = total_pop

#Drop all the original categories 
irvine_age.drop(irvine_age.iloc[:, 0:13], inplace = True, axis = 1) 

irvine_age.astype(float, copy = False)
irvine_age.head()

Unnamed: 0,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,Ages 65 & Older,Total Population
0,26.6,9.4,16.2,14.7,12.6,10.0,10.6,282584


We obtain Irvine's median income from the census information, and create a DataFrame to concatenate with the age and race/ethnicity data. 

In [47]:
#Add Irvine to the neighborhoods DataFrame 
irvine_lat = 33.6846
irvine_long = -117.8265

data = {'Neighborhood': ['Irvine'], 'Latitude': [irvine_lat], 'Longitude': [irvine_long],'Median Income': [100969]}
temp_df = pd.DataFrame(data)
irvine = pd.concat([temp_df, irvine_age, irvine_eth], axis = 1)
irvine.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,Ages 65 & Older,Total Population,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population
0,Irvine,33.6846,-117.8265,100969,26.6,9.4,16.2,14.7,12.6,10.0,10.6,282584,0.4,42.5,1.0,12.5,0.1,0.4,4.2,38.9


We then append this to the existing LA neighborhoods DataFrame so that we can cluster the data.

In [48]:
all_demo = pd.concat([irvine, la_data], sort = False)
all_demo.reset_index(inplace = True, drop = True)
all_demo.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,Ages 65 & Older,Total Population,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population
0,Irvine,33.6846,-117.8265,100969,26.6,9.4,16.2,14.7,12.6,10.0,10.6,282584.0,0.4,42.5,1.0,12.5,0.1,0.4,4.2,38.9
1,Acton,34.495516,-118.185799,83983,17.535794,8.340996,9.667674,10.271903,17.588336,20.596348,15.998949,7613.0,0.302115,1.497439,2.364377,18.612899,0.065677,0.735584,3.559701,74.937607
2,Adams-Normandie,34.031411,-118.300288,29606,20.946428,18.297966,16.027855,12.304655,11.443768,11.860503,9.118824,18237.0,0.0,7.473817,20.73806,60.514339,0.060317,0.515436,0.564786,6.267478
3,Agoura Hills,34.150734,-118.760944,117608,22.520192,6.799345,9.77142,13.350578,17.763818,15.382991,14.411656,18943.0,0.05279,6.181703,1.567861,13.086628,0.0,0.47511,2.549755,75.637439
4,Agua Dulce,34.508909,-118.313371,106078,17.72214,8.492882,12.714777,7.756505,15.635739,20.250368,17.42759,4074.0,0.0,1.669121,0.147275,15.80756,6.136475,67.525773,,


### Clustering <a name="clusterLA"></a>   

We now run K-means clustering on this demographic data. This gives us candidates for neighborhoods that are similar in demographic composition to Irvine. 

In [49]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

#Drop the 'Neighborhood' column because it is a categorical variable
df = all_demo.drop(columns = ['Neighborhood'], axis = 1)

#Normalize the data using StandardScaler()
X = df.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)

#Run K-Means Clustering 
num_clusters = 15 #create 15 clusters of neighborhoods 

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init= 15)
k_means.fit(cluster_dataset)
labels = k_means.labels_

#Add the cluster labels back to our original dataframe! 
all_demo['Cluster Label'] = labels
all_demo.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,...,Total Population,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population,Cluster Label
0,Irvine,33.6846,-117.8265,100969,26.6,9.4,16.2,14.7,12.6,10.0,...,282584.0,0.4,42.5,1.0,12.5,0.1,0.4,4.2,38.9,10
1,Acton,34.495516,-118.185799,83983,17.535794,8.340996,9.667674,10.271903,17.588336,20.596348,...,7613.0,0.302115,1.497439,2.364377,18.612899,0.065677,0.735584,3.559701,74.937607,7
2,Adams-Normandie,34.031411,-118.300288,29606,20.946428,18.297966,16.027855,12.304655,11.443768,11.860503,...,18237.0,0.0,7.473817,20.73806,60.514339,0.060317,0.515436,0.564786,6.267478,3
3,Agoura Hills,34.150734,-118.760944,117608,22.520192,6.799345,9.77142,13.350578,17.763818,15.382991,...,18943.0,0.05279,6.181703,1.567861,13.086628,0.0,0.47511,2.549755,75.637439,7
4,Agua Dulce,34.508909,-118.313371,106078,17.72214,8.492882,12.714777,7.756505,15.635739,20.250368,...,4074.0,0.0,1.669121,0.147275,15.80756,6.136475,67.525773,,,14


We can check to see which cluster Irvine is in, and which neighborhoods are also in the same cluster (and are therefore similar).

In [50]:
irvine_label = all_demo.loc[all_demo['Neighborhood']=='Irvine', 'Cluster Label']
print("Irvine is in Cluster: ", irvine_label.iloc[0])


similar = pd.DataFrame(all_demo.loc[all_demo['Cluster Label'] == irvine_label.iloc[0]])
print("There are {} other neighborhoods in cluster {}".format(similar.shape[0]-1, irvine_label.iloc[0]))
similar

Irvine is in Cluster:  10
There are 8 other neighborhoods in cluster 10


Unnamed: 0,Neighborhood,Latitude,Longitude,Median Income,Under Age 18,Ages 18-24,Ages 25-34,Ages 35-44,Ages 45-54,Ages 55-64,...,Total Population,American Indian/Native Population,Asian Population,Black Population,Hispanic Population,Native Hawaiian/Other Pacific Islander Population,Other Race Population,Population of Two or More Races,White Population,Cluster Label
0,Irvine,33.6846,-117.8265,100969,26.6,9.4,16.2,14.7,12.6,10.0,...,282584.0,0.4,42.5,1.0,12.5,0.1,0.4,4.2,38.9,10
83,Glendale,34.181902,-118.246813,57112,17.568822,7.550456,15.92488,13.057713,14.86036,13.910626,...,200372.0,0.176172,15.523127,1.316551,17.454535,0.049408,0.135748,2.76885,58.382908,10
121,Lancaster,34.69275,-118.174337,56069,28.423251,9.291475,14.940394,12.792005,12.81257,11.583175,...,155605.0,0.155522,3.427268,19.977507,36.653064,0.245493,0.139456,2.623309,33.314482,10
132,Long Beach,33.805217,-118.160709,50985,22.84953,10.639079,17.105034,13.694558,13.542326,11.154377,...,470990.0,0.331854,13.088601,12.627232,42.704091,0.881335,0.190662,3.165885,28.206331,10
162,Palmdale,34.598026,-118.086611,63317,30.270144,10.644994,13.419727,12.102884,13.25158,11.323741,...,162358.0,0.354772,4.327474,12.894961,61.820175,0.043115,0.345533,2.214243,21.529583,10
167,Pasadena,34.16166,-118.135127,62825,18.944214,8.162153,17.6842,14.73532,13.147032,11.772471,...,143173.0,0.062163,14.817738,9.747648,33.521684,0.152962,0.363197,2.542379,36.714325,10
173,Pomona,34.058683,-117.762309,54242,25.681998,14.576994,14.573723,13.261747,12.235724,9.815276,...,152823.0,0.29773,9.068661,6.385819,70.033307,0.34288,0.28726,1.58484,11.691957,10
193,Santa Clarita,34.410701,-118.499399,88987,25.955314,9.13696,12.617181,13.638179,15.241403,11.98556,...,190304.0,0.136624,10.835295,3.926349,34.241004,0.077245,0.163948,3.809694,49.284303,10
220,Torrance,33.834646,-118.341684,76866,20.855648,7.380185,12.826521,13.161921,14.714602,14.333434,...,146392.0,0.23157,34.094759,2.457785,17.736625,0.530767,0.334035,4.670337,40.569157,10


In [53]:
# create map
map_clusters = folium.Map(location=[lat, long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_demo['Latitude'], all_demo['Longitude'], all_demo['Neighborhood'], all_demo['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

Thus, based on the total population, age distribution, median income, and race/ethnicity data, it seems that the neighborhoods most similar to Irvine are: **Glendale, Lancaster, Long Beach, Palmdale, Pasadena, Pomona, Santa Clarita**, and **Torrance**. These are potential candidates for the client's new restaurant location. To further refine our recommendation, we next turn to using Foursquare data. 

## Part 2: Foursquare Data  <a name="foursquare"></a>

In this next section, we use the Foursquare API to gather information about the business landscape of various neighborhoods. 

### Preparing the Data <a name="datafoursquare"></a>   

First, we will get the most common types of venues for each neighborhood. We return to our 'neighborhoods' DataFrame, which contains a list of the LA neighborhoods as well as their geographical coordinates. 

In [54]:
neighborhoods.head()

Unnamed: 0,Neighborhood,Longitude,Latitude
0,Acton,-118.185799,34.495516
1,Adams-Normandie,-118.300288,34.031411
2,Agoura Hills,-118.760944,34.150734
3,Agua Dulce,-118.313371,34.508909
4,Alhambra,-118.135494,34.083967


In [55]:
#@hidden 

#Define Foursquare credentials and version
CLIENT_ID = 'GVWEAF241OAMHXLTB4HIHSQEOFOW5FYIGFMVI30YZH4GVYXA' #Foursquare ID
CLIENT_SECRET = 'UYJDYFU4G2TSXSLAXXSTYPGXJMERLLYHTEORL5VNQOGB5T5Y' #Foursquare Secret
VERSION = '20180605' # Foursquare API version

We would like to explore venues in both Irvine and LA using the Foursquare API. We first add Irvine to our 'neighborhoods' DataFrame.

In [56]:
data = {'Neighborhood': ['Irvine'], 'Longitude': [irvine_long], 'Latitude': [irvine_lat]}
df = pd.DataFrame(data)
all_neighborhoods = pd.concat([df, neighborhoods], sort = False)
all_neighborhoods.reset_index(inplace = True, drop = True)
all_neighborhoods.sort_values(by = 'Neighborhood', inplace = True)
all_neighborhoods.head()

Unnamed: 0,Neighborhood,Longitude,Latitude
1,Acton,-118.185799,34.495516
2,Adams-Normandie,-118.300288,34.031411
3,Agoura Hills,-118.760944,34.150734
4,Agua Dulce,-118.313371,34.508909
5,Alhambra,-118.135494,34.083967


Then, we define the following function to get nearby venues in each neighborhood, make the API call, clean the json, and structure it into a DataFrame.

In [57]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 3000 # define radius = 3000m 

#function to get nearby venues in each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius = 3000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("Processing: {}".format(name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [58]:
#Contains all the venues for each neighborhood
venues = getNearbyVenues(names=all_neighborhoods['Neighborhood'],
                                   latitudes=all_neighborhoods['Latitude'],
                                   longitudes=all_neighborhoods['Longitude']
                                  )

venues.head()

Processing: Acton
Processing: Adams-Normandie
Processing: Agoura Hills
Processing: Agua Dulce
Processing: Alhambra
Processing: Alondra Park
Processing: Altadena
Processing: Angeles Crest
Processing: Arcadia
Processing: Arleta
Processing: Arlington Heights
Processing: Artesia
Processing: Athens
Processing: Atwater Village
Processing: Avalon
Processing: Avocado Heights
Processing: Azusa
Processing: Baldwin Hills/Crenshaw
Processing: Baldwin Park
Processing: Bel-Air
Processing: Bell
Processing: Bell Gardens
Processing: Bellflower
Processing: Beverly Crest
Processing: Beverly Grove
Processing: Beverly Hills
Processing: Beverlywood
Processing: Boyle Heights
Processing: Bradbury
Processing: Brentwood
Processing: Broadway-Manchester
Processing: Burbank
Processing: Calabasas
Processing: Canoga Park
Processing: Carson
Processing: Carthay
Processing: Castaic
Processing: Castaic Canyons
Processing: Central-Alameda
Processing: Century City
Processing: Cerritos
Processing: Charter Oak
Processing: C

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Acton,34.495516,-118.185799,La Cabaña,34.479308,-118.166997,Mexican Restaurant
1,Acton,34.495516,-118.185799,SUBWAY,34.493552,-118.188909,Sandwich Place
2,Acton,34.495516,-118.185799,Crazy Otto's Diner,34.490733,-118.162548,Breakfast Spot
3,Acton,34.495516,-118.185799,Jack in the Box,34.492398,-118.199205,Fast Food Restaurant
4,Acton,34.495516,-118.185799,SUBWAY,34.49242,-118.198858,Sandwich Place


To check the size of the resulting dataframe: 

In [59]:
print(venues.shape)
venues.head()

(22381, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Acton,34.495516,-118.185799,La Cabaña,34.479308,-118.166997,Mexican Restaurant
1,Acton,34.495516,-118.185799,SUBWAY,34.493552,-118.188909,Sandwich Place
2,Acton,34.495516,-118.185799,Crazy Otto's Diner,34.490733,-118.162548,Breakfast Spot
3,Acton,34.495516,-118.185799,Jack in the Box,34.492398,-118.199205,Fast Food Restaurant
4,Acton,34.495516,-118.185799,SUBWAY,34.49242,-118.198858,Sandwich Place


To check how many venues were returned for each neighborhood: 

In [60]:
venues.groupby('Neighborhood').count()
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

# Count the number of locations per Venue Category
venues.groupby('Venue Category').count()['Neighborhood'].sort_values(ascending=False).head(10)

There are 431 uniques categories.


Venue Category
Mexican Restaurant      1213
Coffee Shop              983
Fast Food Restaurant     880
Pizza Place              711
Sandwich Place           656
Grocery Store            648
Burger Joint             591
Park                     551
American Restaurant      513
Convenience Store        465
Name: Neighborhood, dtype: int64

### Clustering the Data <a name="clusterfoursquare"></a>   

We next want to analyze each neighborhood and prepare it for K-means clustering. We first perform onehot encoding. 

In [66]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
#IMPORTANT: apparently there is a venue type called "Neighborhood" which is why we must now call our column "NEIGHBORHOOD NAME"
onehot['Neighborhood Name'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

print(onehot.shape) 
onehot.head()

(22381, 432)


Unnamed: 0,Neighborhood Name,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,American Restaurant,Amphitheater,...,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we group rows by neighborhood and take the mean of the frequency of occurrence of each category.

In [67]:
grouped = onehot.groupby('Neighborhood Name').mean().reset_index()
grouped.head()

Unnamed: 0,Neighborhood Name,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Lounge,Airport Service,American Restaurant,Amphitheater,...,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adams-Normandie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Agoura Hills,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Agua Dulce,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
4,Alhambra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we put the information in a DataFrame, using the following function (from the lab) that sorts the venues in descending order. 

In [68]:
#function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories = row_categories[row_categories != 0]   #in case neighborhoods don't have that many top venues 
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [69]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood Name'] = grouped['Neighborhood Name']

for ind in np.arange(grouped.shape[0]):
    length = len(return_most_common_venues(grouped.iloc[ind, :], num_top_venues))  #Will fill the rest of the row with NaN if a neighborhood does not have enough data for 10 venues
    neighborhoods_venues_sorted.iloc[ind, 1:length+1] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Acton,Construction & Landscaping,Park,Café,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Sushi Restaurant,Arts & Entertainment,Breakfast Spot,Burger Joint
1,Adams-Normandie,Mexican Restaurant,Korean Restaurant,Food Truck,Pizza Place,Café,Science Museum,Taco Place,Coffee Shop,Burger Joint,Sandwich Place
2,Agoura Hills,Deli / Bodega,Trail,Italian Restaurant,Fast Food Restaurant,Mexican Restaurant,Park,Chinese Restaurant,Hotel,Pharmacy,Pizza Place
3,Agua Dulce,Winery,Trail,Pizza Place,Park,Mexican Restaurant,Home Service,Grocery Store,Gift Shop,Convenience Store,Construction & Landscaping
4,Alhambra,Chinese Restaurant,Park,Mexican Restaurant,Café,Vietnamese Restaurant,Convenience Store,Burger Joint,Grocery Store,Fast Food Restaurant,Szechuan Restaurant


In [72]:
neighborhoods_venues_sorted[neighborhoods_venues_sorted['Neighborhood Name'] == 'Irvine']

Unnamed: 0,Neighborhood Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
110,Irvine,Sandwich Place,Café,Japanese Restaurant,Fast Food Restaurant,Grocery Store,Coffee Shop,Dessert Shop,Shopping Mall,Ice Cream Shop,Mexican Restaurant


We will again use k-means clustering to cluster the neighborhoods. The goal of this part is to find a neighborhood with a similar business/competitor landscape as the original Irvine location. 

In [73]:
# set number of clusters
num_clusters = 15

grouped_clustering = grouped.drop('Neighborhood Name', 1)

# run k-means clustering
kmeans = KMeans(init = "k-means++", n_clusters=num_clusters, n_init = 15).fit(grouped_clustering)

labels = kmeans.labels_
neighborhoods_venues_sorted["Cluster Label"] = labels 

neighborhoods_venues_sorted.head()
venues_merged = all_neighborhoods.merge(neighborhoods_venues_sorted, how ='inner', left_on = 'Neighborhood', right_on = "Neighborhood Name")
venues_merged.drop('Neighborhood Name', 1, inplace = True)

venues_merged.head()

Unnamed: 0,Neighborhood,Longitude,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Label
0,Acton,-118.185799,34.495516,Construction & Landscaping,Park,Café,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Sushi Restaurant,Arts & Entertainment,Breakfast Spot,Burger Joint,0
1,Adams-Normandie,-118.300288,34.031411,Mexican Restaurant,Korean Restaurant,Food Truck,Pizza Place,Café,Science Museum,Taco Place,Coffee Shop,Burger Joint,Sandwich Place,8
2,Agoura Hills,-118.760944,34.150734,Deli / Bodega,Trail,Italian Restaurant,Fast Food Restaurant,Mexican Restaurant,Park,Chinese Restaurant,Hotel,Pharmacy,Pizza Place,3
3,Agua Dulce,-118.313371,34.508909,Winery,Trail,Pizza Place,Park,Mexican Restaurant,Home Service,Grocery Store,Gift Shop,Convenience Store,Construction & Landscaping,1
4,Alhambra,-118.135494,34.083967,Chinese Restaurant,Park,Mexican Restaurant,Café,Vietnamese Restaurant,Convenience Store,Burger Joint,Grocery Store,Fast Food Restaurant,Szechuan Restaurant,3


We can now examine which cluster Irvine is in, and which neighborhoods are also in that cluter. 

In [74]:
irvine_label = venues_merged.loc[venues_merged['Neighborhood']=='Irvine', 'Cluster Label']
print("Irvine is in cluster: ", irvine_label.iloc[0])

similar = pd.DataFrame(venues_merged.loc[venues_merged['Cluster Label'] == irvine_label.iloc[0]])
print("There are {} other venues in cluster {}".format(similar.shape[0]-1, irvine_label.iloc[0]))
similar

Irvine is in cluster:  8
There are 84 other venues in cluster 8


Unnamed: 0,Neighborhood,Longitude,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Label
1,Adams-Normandie,-118.300288,34.031411,Mexican Restaurant,Korean Restaurant,Food Truck,Pizza Place,Café,Science Museum,Taco Place,Coffee Shop,Burger Joint,Sandwich Place,8
9,Arlington Heights,-118.321401,34.043954,Korean Restaurant,Mexican Restaurant,Grocery Store,Ice Cream Shop,Café,Japanese Restaurant,Sandwich Place,Music Venue,Art Gallery,BBQ Joint,8
10,Artesia,-118.080648,33.867590,Indian Restaurant,Bakery,Café,Coffee Shop,Grocery Store,Sandwich Place,Park,BBQ Joint,Burger Joint,Dessert Shop,8
12,Atwater Village,-118.265063,34.126138,Bakery,Coffee Shop,Café,Mediterranean Restaurant,Ice Cream Shop,Middle Eastern Restaurant,Mexican Restaurant,Brewery,Burger Joint,Taco Place,8
13,Avalon,-118.331202,33.333827,Hotel,Seafood Restaurant,Pizza Place,Bar,Trail,Beach,Harbor / Marina,History Museum,Ice Cream Shop,Mexican Restaurant,8
23,Beverly Grove,-118.371392,34.075648,Coffee Shop,Café,American Restaurant,Mexican Restaurant,Salad Place,Clothing Store,French Restaurant,Grocery Store,Hotel,Italian Restaurant,8
24,Beverly Hills,-118.402109,34.078539,Hotel,American Restaurant,Boutique,Clothing Store,Italian Restaurant,Coffee Shop,Park,Ice Cream Shop,Café,Men's Store,8
25,Beverlywood,-118.393787,34.044646,Coffee Shop,American Restaurant,Sushi Restaurant,Grocery Store,Park,Liquor Store,Hotel,New American Restaurant,Café,Art Gallery,8
30,Burbank,-118.323533,34.187860,Coffee Shop,Mexican Restaurant,American Restaurant,Burger Joint,Park,Sandwich Place,Donut Shop,Hotel,Grocery Store,Pizza Place,8
34,Carthay,-118.369612,34.058058,Café,Coffee Shop,Italian Restaurant,Burger Joint,French Restaurant,Mexican Restaurant,Grocery Store,Hotel,American Restaurant,Sushi Restaurant,8


Though this gives us valuable information about the top types of venues for each neighborhood, this clustering procedure based just on the Foursquare data was not particularly informative, as the number of neighborhoods in Irvine's cluster is quite large. Therefore, we now include the demographic data from earlier, and perform k-means clustering on the entire dataset.  

In [75]:
#all_demo is the table containing demographic data for both LA and Irvine
all_demo.sort_values(by = ['Neighborhood'], inplace = True)

#Drop categorical values
df = all_demo.drop(columns = ['Neighborhood', 'Cluster Label'], axis = 1)

# Normalize the data using StandardScaler()
X = df.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
dem_cluster = pd.DataFrame(cluster_dataset)
dem_cluster['Neighborhood Name'] = all_demo['Neighborhood'].to_list()

In [76]:
#Create a master table (all_clustering) containing all of the demographic and the Foursquare data 
all_clustering = grouped.merge(dem_cluster, how = 'inner', left_on = 'Neighborhood Name', right_on = 'Neighborhood Name')
clustered_neighborhoods = all_clustering['Neighborhood Name']
all_clustering.drop('Neighborhood Name', axis = 1, inplace = True)
all_clustering.head()

#Run K-Means Clustering 
num_clusters = 15

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init= 15)
k_means.fit(all_clustering)
labels = k_means.labels_

#Create a table of all neighborhoods with cluster labels
final = all_neighborhoods[all_neighborhoods['Neighborhood'].isin(clustered_neighborhoods)]
final.sort_values(by = ['Neighborhood'], inplace = True)
final['Cluster Label'] = labels
final.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood,Longitude,Latitude,Cluster Label
1,Acton,-118.185799,34.495516,10
2,Adams-Normandie,-118.300288,34.031411,3
3,Agoura Hills,-118.760944,34.150734,10
4,Agua Dulce,-118.313371,34.508909,9
5,Alhambra,-118.135494,34.083967,1


To better visualize the resulting neighborhood clusters, we create a folium map. Neighborhoods in the same cluster are marked with the same color. 

In [79]:
# create map
map_clusters = folium.Map(location=[lat, long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final['Latitude'], final['Longitude'], final['Neighborhood'], final['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

Once again, we identify the other neighborhoods in the same cluster as Irvine. 

In [78]:
irvine_label = final.loc[final['Neighborhood']=='Irvine', 'Cluster Label']
print("Irvine is in cluster:", irvine_label.iloc[0])

final_similar = pd.DataFrame(final.loc[final['Cluster Label'] == irvine_label.iloc[0]])
print("There are {} other venues in cluster {}".format(final_similar.shape[0]-1, irvine_label.iloc[0]))
final_similar

Irvine is in cluster: 8
There are 8 other venues in cluster 8


Unnamed: 0,Neighborhood,Longitude,Latitude,Cluster Label
84,Glendale,-118.246813,34.181902,8
0,Irvine,-117.8265,33.6846,8
125,Lancaster,-118.174337,34.69275,8
136,Long Beach,-118.160709,33.805217,8
166,Palmdale,-118.086611,34.598026,8
171,Pasadena,-118.135127,34.16166,8
177,Pomona,-117.762309,34.058683,8
197,Santa Clarita,-118.499399,34.410701,8
226,Torrance,-118.341684,33.834646,8


As it happens, our list of neighborhoods in the same cluster as Irvine is the same as when we performed clustering with just the demographic data. We now examine the number of Chinese restaurants in these neighborhoods, as per the client's desire to open in a neighborhood with fewer Chinese restaurants. 

In [94]:
total_venues = onehot.groupby('Neighborhood Name').count()['ATM']
total_venues = pd.DataFrame(total_venues)
total_venues.rename(columns={'ATM':'Total Number of Venues'}, inplace=True)
total_venues.head()

Unnamed: 0_level_0,Total Number of Venues
Neighborhood Name,Unnamed: 1_level_1
Acton,24
Adams-Normandie,100
Agoura Hills,82
Agua Dulce,14
Alhambra,100


In [99]:
potential_neighborhoods = final_similar.Neighborhood.to_list()
potential_neighborhoods.remove("Irvine")

chinese_restaurants = onehot.groupby(['Neighborhood Name']).sum().reset_index()
chinese_restaurants = chinese_restaurants[['Neighborhood Name','Chinese Restaurant']]
chinese_restaurants.rename(columns={'Chinese Restaurant':'Num_Chinese Restaurants'}, inplace=True)
chinese_restaurants = chinese_restaurants.merge(total_venues, how = "inner", left_on = "Neighborhood Name", right_on = "Neighborhood Name")
chinese_restaurants.head(10)

num_chinese = chinese_restaurants[chinese_restaurants["Neighborhood Name"].isin(potential_neighborhoods)]
num_chinese.sort_values(by = "Num_Chinese Restaurants", ascending = False, inplace = True)
num_chinese['Relative Frequency'] = num_chinese['Num_Chinese Restaurants']/num_chinese['Total Number of Venues']
num_chinese.sort_values(by = ['Relative Frequency'])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood Name,Num_Chinese Restaurants,Total Number of Venues,Relative Frequency
135,Long Beach,0,100,0.0
170,Pasadena,1,100,0.01
223,Torrance,1,100,0.01
82,Glendale,1,56,0.017857
126,Lancaster,2,100,0.02
195,Santa Clarita,2,70,0.028571
176,Pomona,4,99,0.040404
165,Palmdale,2,35,0.057143


Based on these results, the recommended neighborhoods ranked in order of least to most relative frequency of Chinese restaurants are: **Long Beach**, **Torrance**, **Pasadena**, **Glendale**, **Santa Clarita**, **Palmdale**, **Lancaster**, and **Pomona**. 

## 5. Results <a name="results"></a>

The purpose of this study was to identify potential neighborhoods for our client to open the second location of their fast-casual Chinese restaurant. We aimed to find neighborhoods that would be similar to the client's original restaurant location in Irvine, California. 

To accomplish this, we researched demographic data, including median income, age distribution, and race/ethnicity, and the existing business landscape of Los Angeles neighborhoods via the Foursquare API. After performing K-means clustering, it was determined that the neighborhoods most similar to Irvine are: **Long Beach**, **Torrance**, **Pasadena**, **Glendale**, **Santa Clarita**, **Palmdale**, **Lancaster**, and **Pomona**. These represent potential locations for the client's newest venture. Further details are provided in the business report. 
