## Opening a French Restaurant in Chicago, IL, USA

This Jupyter Notebook run in Python 3.7 serves to collect, prepare and analyze data about all 77 communities in Chicago to help a French Restaurant owner decide which communities to prioritize when searching for a location to open his first restaurant unit in the city.

First, let's start by importing all libraries we'll need for this analysis:

In [124]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import json # library to handle JSON files

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!pip install folium
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

print('Libraries imported.')

Libraries imported.


## 1. Data requirements and collection

In this data science project, we will run a K-means Clustering maching learning model taking into account the following 5 features about Chicago communities:
1. Population
2. Population Density
3. Number of restaurants
4. Ratio people/restaurant
5. Frequency of each restaurant category

Population and Population Density are features that we will collect from a Wikipedia page. Restaurant information will be collected using Foursquare Places API. Finally the ratio Population/Restaurant will be calculated from the data collected from both Wikipedia and Foursquare.

### 1a. Collecting list of communities, their population and population density.

In [125]:
#Gets and reads table from the Wikipedia webpage
url='https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
source = requests.get(url)
source_data = pd.read_html(source.text)[0]
source_data

Unnamed: 0,Number[8],Name[8],2017[9],Area (sq mi.)[10],Area (km2),2017density (/sq mi.),2017density (/km2)
0,01,Rogers Park,55062,1.84,4.77,29925.0,11554.11
1,02,West Ridge,76215,3.53,9.14,21590.65,8336.2
2,03,Uptown,57973,2.32,6.01,24988.36,9648.06
3,04,Lincoln Square,41715,2.56,6.63,16294.92,6291.5
4,05,North Center,35789,2.05,5.31,17458.05,6740.59
5,06,Lake View,100470,3.12,8.08,32201.92,12433.23
6,07,Lincoln Park,67710,3.16,8.18,21427.22,8273.1
7,08,Near North Side,88893,2.74,7.1,32442.7,12526.2
8,09,Edison Park,11605,1.13,2.93,4235.4,1635.3
9,10,Norwood Park,37089,4.37,11.32,8487.19,3276.92


In [126]:
# checking if there are any duplicate communities.
#There are 78 rows in the table, so if the code below yields value 78, there are no duplicates.
source_data['Name[8]'].nunique()

78

By observing the table scrapped above, we want to make a few corrections:
1. Drop all columns except for Name[8], which represents Community Name, 2017[9] which represents Population data in 2017, and 2017 Density in km2.
2. Correct the name of the Loop to remove the paranthesis and the [11] footnote mark
3. Delete the last row with totals

In [127]:
source_data=source_data.drop(['Number[8]','Area (sq mi.)[10]','Area (km2)','2017density (/sq mi.)'], axis=1)
source_data=source_data.drop([77])
source_data['Name[8]']=source_data.replace(["(The) Loop[11]"],"Loop")
source_data

Unnamed: 0,Name[8],2017[9],2017density (/km2)
0,Rogers Park,55062,11554.11
1,West Ridge,76215,8336.2
2,Uptown,57973,9648.06
3,Lincoln Square,41715,6291.5
4,North Center,35789,6740.59
5,Lake View,100470,12433.23
6,Lincoln Park,67710,8273.1
7,Near North Side,88893,12526.2
8,Edison Park,11605,1635.3
9,Norwood Park,37089,3276.92


Now let's rename the columns to names that makes sense:

In [128]:
source_data=source_data.rename(columns={"Name[8]": "Community", "2017[9]": "Population", "2017density (/km2)": "Density (/km2)"})
source_data.head()

Unnamed: 0,Community,Population,Density (/km2)
0,Rogers Park,55062,11554.11
1,West Ridge,76215,8336.2
2,Uptown,57973,9648.06
3,Lincoln Square,41715,6291.5
4,North Center,35789,6740.59


In [129]:
# chacking data types
source_data.dtypes

Community          object
Population          int64
Density (/km2)    float64
dtype: object

### 1b. Collecting Restaurant information accross communities

We want to use the Foursquare API to get the names and categories of all restaurants in all Chicago communities. To do so, we first need to obtain latitude and longitude coordinates for all 77 communities in Chicago.

In [130]:
geolocator = Nominatim(user_agent="coord_explorer")

# creates a dataframe to collect coordinates for all communities
columns_names = ['Community','Latitude','Longitude']
community_data = pd.DataFrame(columns = columns_names)

for comm in source_data['Community']:
    community = comm
    geolocator = Nominatim(user_agent="coord_explorer")
    address = comm + ", Chicago, IL"
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    
    community_data = community_data.append({'Community': community,
                                            'Latitude': latitude,
                                            'Longitude': longitude,}, ignore_index=True)

community_data

Unnamed: 0,Community,Latitude,Longitude
0,Rogers Park,42.010531,-87.670748
1,West Ridge,42.003548,-87.696243
2,Uptown,41.96663,-87.655546
3,Lincoln Square,41.97599,-87.689616
4,North Center,41.956107,-87.67916
5,Lake View,41.94705,-87.655423
6,Lincoln Park,41.939945,-87.63612
7,Near North Side,41.900033,-87.634497
8,Edison Park,42.006113,-87.813992
9,Norwood Park,41.98559,-87.800582


In [131]:
# merge with the remaining information of source_data:
community_data = community_data.merge(source_data, on='Community', how = 'left')
community_data

Unnamed: 0,Community,Latitude,Longitude,Population,Density (/km2)
0,Rogers Park,42.010531,-87.670748,55062,11554.11
1,West Ridge,42.003548,-87.696243,76215,8336.2
2,Uptown,41.96663,-87.655546,57973,9648.06
3,Lincoln Square,41.97599,-87.689616,41715,6291.5
4,North Center,41.956107,-87.67916,35789,6740.59
5,Lake View,41.94705,-87.655423,100470,12433.23
6,Lincoln Park,41.939945,-87.63612,67710,8273.1
7,Near North Side,41.900033,-87.634497,88893,12526.2
8,Edison Park,42.006113,-87.813992,11605,1635.3
9,Norwood Park,41.98559,-87.800582,37089,3276.92


Let's visualize all communities in a Chicago map

In [132]:
# Get coordinates (longitude and latitude) of Chicago

address = 'Chicago, IL'

geolocator = Nominatim(user_agent="geo_explorer")
location = geolocator.geocode(address)
chicago_latitude = location.latitude
chicago_longitude = location.longitude
print('The geograpical coordinates of Chicago, IL are lat:{}, lng:{}.'.format(chicago_latitude, chicago_longitude))

The geograpical coordinates of Chicago, IL are lat:41.8755616, lng:-87.6244212.


In [133]:
# create map of Chicago
chicago_map = folium.Map(location=[chicago_latitude, chicago_longitude], zoom_start=10)

# add neighbordhood markers to map
for lat, lng, label in zip(community_data['Latitude'], community_data['Longitude'], community_data['Community']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(chicago_map)  
    
chicago_map

Now let's get Restaurant data in each community from Foursquare. We will start with a test using the first listed community in our community_data dataset

In [134]:
CLIENT_ID = 'S533B0JYPPDULC0CWZPUGPXOI3PBMB1MQZVRLW4XQNJPXEVM'
CLIENT_SECRET = 'MUGAY5GYI2CD45CNFRYUTSNZHBN2UO2J20DEZGQNACI5JDP3'
ACCESS_TOKEN = 'MMFSJNNH0Z05PDMR5CYAC40OSTDHIQOEKU5U2SN52ZFJBKFR'
VERSION = '20180604'
LIMIT = 100

In [135]:
# Set parameters to only get results for the first community in our community_data DataFrame
latitude = community_data['Latitude'][0]
longitude = community_data['Longitude'][0]
search_query = 'Restaurant'
radius = 500

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&oauth_token={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN, VERSION, latitude, longitude, search_query, radius, LIMIT)
results = requests.get(url).json()['response']['venues']
results

[{'id': '4b75fc9bf964a52070342ee3',
  'name': 'Hong Kong Restaurant',
  'location': {'address': '6958 N Clark St',
   'lat': 42.00814437866211,
   'lng': -87.67413330078125,
   'labeledLatLngs': [{'label': 'entrance',
     'lat': 42.008298,
     'lng': -87.673924},
    {'label': 'display', 'lat': 42.00814437866211, 'lng': -87.67413330078125}],
   'distance': 386,
   'postalCode': '60626',
   'cc': 'US',
   'city': 'Chicago',
   'state': 'IL',
   'country': 'United States',
   'formattedAddress': ['6958 N Clark St', 'Chicago, IL 60626']},
  'categories': [{'id': '4bf58dd8d48988d145941735',
    'name': 'Chinese Restaurant',
    'pluralName': 'Chinese Restaurants',
    'shortName': 'Chinese',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/asian_',
     'suffix': '.png'},
    'primary': True}],
  'referralId': 'v-1609210223',
  'hasPerk': False},
 {'id': '4df0fa23d4c04d0392c87e45',
  'name': 'Taqueria & Restaurant Cd. Hidalgo',
  'location': {'address': '7104 N Clark S

In [136]:
# Get the 3 data fields we want: venue id, venue name, and venue category

print("Venue ID: ", results[0]['id'])
print("Venue Name: ", results[0]['name'])
print("Category: ", results[0]['categories'][0]['name'])

Venue ID:  4b75fc9bf964a52070342ee3
Venue Name:  Hong Kong Restaurant
Category:  Chinese Restaurant


Now based on the successful test above, let's write a function that gets Restaurant information from Foursquare for all 77 communities in Chicago

In [137]:
search_query = 'Restaurant'
radius = 1000

restaurants_list=[]
for comm, lat, lng in zip(community_data['Community'], community_data['Latitude'], community_data['Longitude']):
                    
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&oauth_token={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN, VERSION, lat, lng, search_query, radius, LIMIT)
        
    # try to make the GET request
    try:
        results = requests.get(url).json()['response']['venues']
        restaurants_list.append([(comm, v['id'], v['name'], v['categories'][0]['name']) for v in results])
        restaurants_data = pd.DataFrame([item for venue_list in restaurants_list for item in venue_list])
        restaurants_data.columns = ['Community','Venue ID','Venue Name','Category'] 
                              
    except:
        print("Failed to retrieve data for: ", comm) # alert in case Foursquare doesn't return any information for any community
        
restaurants_data

Failed to retrieve data for:  West Ridge
Failed to retrieve data for:  Lincoln Square
Failed to retrieve data for:  Lake View
Failed to retrieve data for:  Near North Side
Failed to retrieve data for:  Albany Park
Failed to retrieve data for:  Logan Square
Failed to retrieve data for:  Archer Heights
Failed to retrieve data for:  New City
Failed to retrieve data for:  Chicago Lawn
Failed to retrieve data for:  Edgewater


Unnamed: 0,Community,Venue ID,Venue Name,Category
0,Rogers Park,4b75fc9bf964a52070342ee3,Hong Kong Restaurant,Chinese Restaurant
1,Rogers Park,4df0fa23d4c04d0392c87e45,Taqueria & Restaurant Cd. Hidalgo,Mexican Restaurant
2,Rogers Park,5a03af34b5461848270c64b6,South Of The Border Restaurant,South American Restaurant
3,Rogers Park,4f44ff7c19836ed00198048d,Campeche Restaurant,Food
4,Rogers Park,4e4e317abd4101d0d7a46b18,Fabi's Restaurant,Mexican Restaurant
5,Rogers Park,4f32b71219836c91c7f29557,Angelos Restaurant,Food
6,Rogers Park,50900127e4b061e9d64fb0f9,Redz Belizean Restaurant,Caribbean Restaurant
7,Rogers Park,4b721efbf964a520b36f2de3,Great Wall Chinese Restaurant,Chinese Restaurant
8,Rogers Park,4f43b68719834bc91f586d42,Guadalupana Restaurant,Food
9,Rogers Park,4e4e5cbdbd4101d0d7a881b4,RoPa Restaurant & Wine Bar,Mediterranean Restaurant


In [138]:
#Checking the number of French Restaurants listed:
restaurants_data[restaurants_data['Category']=="French Restaurant"]

Unnamed: 0,Community,Venue ID,Venue Name,Category
90,Lincoln Park,5fd17b6c8fe7ac1127f63d97,Ann Sather Restaurant & Catering,French Restaurant


Interesting enough, we could only find one French restaurants listed accross all Chicago communities in Foursquare.

We also observe that some of the categories retrieved are unrelated to restaurants or eating out activities (for instance "Building" or "Office"). Let's remove from the dataset all datapoints that do not fall under dining experiences.

In [139]:
# retrieve a list of categories that do not contain the word Restaurant
non_rest = restaurants_data[restaurants_data['Category'].str.contains("Restaurant") == False]['Category'].unique()
non_rest

array(['Food', 'Wings Joint', 'Diner', 'Bakery', 'Café', 'Noodle House',
       'Breakfast Spot', 'Pizza Place', 'Pub', 'Bagel Shop', 'Steakhouse',
       'Park', 'Burger Joint', 'Miscellaneous Shop', 'BBQ Joint', 'Bar',
       'Food Court', 'Sandwich Place', 'Taco Place', 'Arcade',
       'Deli / Bodega', 'Lounge', 'Coffee Shop', 'Concert Hall',
       'Building', 'Grocery Store',
       'Residential Building (Apartment / Condo)', 'Office',
       'Hot Dog Joint', 'Conference Room', 'Wine Bar', 'Snack Place',
       'Event Space', 'Antique Shop', 'Market', 'Pop-Up Shop',
       'Gastropub', 'Gay Bar', 'Sports Bar'], dtype=object)

In [140]:
# prepares a list of categories to be removed
exclude_list = (['Park','Miscellaneous Shop','Arcarde','Lounge','Building','Concert Hall','Grocery Store',
                'Residential Building (Apartment / Condo)','Office','Conference Room','Event Space','Antique Shop','Market','Pop-Up Shop'])

# creates a new dataframe only taking data where Category does not fall in the exclude_list above
select_restaurants_data = restaurants_data[~restaurants_data['Category'].isin(exclude_list)]
select_restaurants_data

Unnamed: 0,Community,Venue ID,Venue Name,Category
0,Rogers Park,4b75fc9bf964a52070342ee3,Hong Kong Restaurant,Chinese Restaurant
1,Rogers Park,4df0fa23d4c04d0392c87e45,Taqueria & Restaurant Cd. Hidalgo,Mexican Restaurant
2,Rogers Park,5a03af34b5461848270c64b6,South Of The Border Restaurant,South American Restaurant
3,Rogers Park,4f44ff7c19836ed00198048d,Campeche Restaurant,Food
4,Rogers Park,4e4e317abd4101d0d7a46b18,Fabi's Restaurant,Mexican Restaurant
5,Rogers Park,4f32b71219836c91c7f29557,Angelos Restaurant,Food
6,Rogers Park,50900127e4b061e9d64fb0f9,Redz Belizean Restaurant,Caribbean Restaurant
7,Rogers Park,4b721efbf964a520b36f2de3,Great Wall Chinese Restaurant,Chinese Restaurant
8,Rogers Park,4f43b68719834bc91f586d42,Guadalupana Restaurant,Food
9,Rogers Park,4e4e5cbdbd4101d0d7a881b4,RoPa Restaurant & Wine Bar,Mediterranean Restaurant


We can now calculate the number of restaurants and similar venues in each community, which will be used in our Clustering model:

In [141]:
count_rest=select_restaurants_data.groupby('Community').count()
count_rest

Unnamed: 0_level_0,Venue ID,Venue Name,Category
Community,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Armour Square,6,6,6
Ashburn,6,6,6
Auburn Gresham,8,8,8
Austin,6,6,6
Avalon Park,5,5,5
Avondale,31,31,31
Belmont Cragin,20,20,20
Beverly,1,1,1
Bridgeport,10,10,10
Brighton Park,7,7,7


In [142]:
count_rest.shape

(64, 3)

The Foursquare loop function could retrieve restaurant information for 64 of the 77 communities. Therefore, 13 communites were not identified in the search: West Ridge, Lincoln Square, Lake View, Near North Side, Albany Park, Logan Square, Archer Heights, New City, Chicago Lawn, Edgewater

## 2. Clustering

### 2a. Preparing data for clustering model

Before we run our Cluster modelling process, we need to normalize our data.

We will start by transforming our Categorical variable - Restaurant Categories - into a numerical frequency of categories in each community. All values will, by definition, fit a scale of float values ranging from 0 to 1.

In [143]:
# using the get_dummies function
categ_data = pd.get_dummies(select_restaurants_data['Category'])
categ_data['Community'] = select_restaurants_data['Community']
fixed_columns = [categ_data.columns[-1]] + list(categ_data.columns[:-1])
categ_data = categ_data[fixed_columns]
categ_data.head()

Unnamed: 0,Community,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Cuban Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Eastern European Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Food,Food Court,French Restaurant,Gastropub,Gay Bar,German Restaurant,Greek Restaurant,Hot Dog Joint,Hunan Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Moroccan Restaurant,New American Restaurant,Noodle House,Peruvian Restaurant,Pizza Place,Polish Restaurant,Pub,Restaurant,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,South American Restaurant,Southern / Soul Food Restaurant,Sports Bar,Steakhouse,Sushi Restaurant,Taco Place,Thai Restaurant,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yemeni Restaurant
0,Rogers Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Rogers Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Rogers Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Rogers Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Rogers Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [144]:
categfreq_data=categ_data.groupby('Community').mean().reset_index()
categfreq_data

Unnamed: 0,Community,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Cuban Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Eastern European Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Food,Food Court,French Restaurant,Gastropub,Gay Bar,German Restaurant,Greek Restaurant,Hot Dog Joint,Hunan Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Moroccan Restaurant,New American Restaurant,Noodle House,Peruvian Restaurant,Pizza Place,Polish Restaurant,Pub,Restaurant,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,South American Restaurant,Southern / Soul Food Restaurant,Sports Bar,Steakhouse,Sushi Restaurant,Taco Place,Thai Restaurant,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yemeni Restaurant
0,Armour Square,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ashburn,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Auburn Gresham,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Austin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Avalon Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Avondale,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.16129,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.0,0.0,0.129032,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.032258,0.0,0.258065,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.032258,0.032258,0.032258,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064516,0.0,0.0,0.0
6,Belmont Cragin,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.05,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Beverly,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Bridgeport,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Brighton Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [145]:
categfreq_data.shape

(64, 69)

As we can see above, we have 69 categories of restaurants. That is fine for Clustering model, but it's difficult for human eyes to interpret. To make it easier, let's look at the top 5 categories per community.

In [146]:
# we first define a function that gets top categories
def return_top_categories(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [147]:
# now we run a loop that will retrieve top categories for all 64 communities
num_top_venues = 5

indicators = ['st', 'nd', 'rd']
# prepare columns of the dataframe according to our selection of number of top categories that we want to visualize
columns = ['Community'] # Community is our first column, as that is our main reference
for n in np.arange(num_top_venues):
    try:
        columns.append('{}{} Category'.format(n+1, indicators[n]))
    except:
        columns.append('{}th Category'.format(n+1))

# create a new dataframe
top_categ = pd.DataFrame(columns=columns)
top_categ['Community'] = categfreq_data['Community']

for i in np.arange(categfreq_data.shape[0]):
    top_categ.iloc[i, 1:] = return_top_categories(categfreq_data.iloc[i, :], num_top_venues)

top_categ

Unnamed: 0,Community,1st Category,2nd Category,3rd Category,4th Category,5th Category
0,Armour Square,Chinese Restaurant,Food,Fast Food Restaurant,American Restaurant,Deli / Bodega
1,Ashburn,American Restaurant,Sports Bar,Hot Dog Joint,Italian Restaurant,Yemeni Restaurant
2,Auburn Gresham,Food,Southern / Soul Food Restaurant,Chinese Restaurant,Mexican Restaurant,American Restaurant
3,Austin,Southern / Soul Food Restaurant,Greek Restaurant,Breakfast Spot,Food,Taco Place
4,Avalon Park,Food,Fast Food Restaurant,Diner,Comfort Food Restaurant,Gay Bar
5,Avondale,Mexican Restaurant,Chinese Restaurant,Food,Vietnamese Restaurant,Diner
6,Belmont Cragin,Mexican Restaurant,Restaurant,American Restaurant,Food,Eastern European Restaurant
7,Beverly,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega
8,Bridgeport,Food,American Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant
9,Brighton Park,Mexican Restaurant,Seafood Restaurant,Chinese Restaurant,Breakfast Spot,Deli / Bodega


Now let's normalize our 4 Numerical variables - Population, Density, Restaurant Count and a newly calculated ratio Population/Restaurant - by taking each value and dividing by the maximum value of the respective variable. That way, these four variables will fit a range of float values between 0 and 1.

In [148]:
# merge Population, Density and Cont Restaurant data for each community
comb_data=count_rest.merge(community_data, on='Community', how = 'left')
comb_data.head()

Unnamed: 0,Community,Venue ID,Venue Name,Category,Latitude,Longitude,Population,Density (/km2)
0,Armour Square,6,6,6,41.840231,-87.632986,13455,5195.0
1,Ashburn,6,6,6,41.747533,-87.711163,43792,3479.05
2,Auburn Gresham,8,8,8,41.743387,-87.656042,46278,4739.53
3,Austin,6,6,6,41.887876,-87.764851,95260,5144.07
4,Avalon Park,5,5,5,41.745035,-87.588658,9985,3084.18


In [149]:
# rename the Venue ID column to be Restaurant count
comb_data=comb_data.rename(columns={'Venue ID': 'Restaurant Count'})

#drop columns that we won't use for clustering
comb_data=comb_data.drop(['Venue Name','Category','Latitude','Longitude'], axis=1)

comb_data.head()

Unnamed: 0,Community,Restaurant Count,Population,Density (/km2)
0,Armour Square,6,13455,5195.0
1,Ashburn,6,43792,3479.05
2,Auburn Gresham,8,46278,4739.53
3,Austin,6,95260,5144.07
4,Avalon Park,5,9985,3084.18


In [150]:
# Now let's add a column that calculates average number of people per restaurant
comb_data['People/Restaurant']=comb_data['Population']/comb_data['Restaurant Count']
comb_data.head()

Unnamed: 0,Community,Restaurant Count,Population,Density (/km2),People/Restaurant
0,Armour Square,6,13455,5195.0,2242.5
1,Ashburn,6,43792,3479.05,7298.666667
2,Auburn Gresham,8,46278,4739.53,5784.75
3,Austin,6,95260,5144.07,15876.666667
4,Avalon Park,5,9985,3084.18,1997.0


Now let's merge this dataset with the category frequency counts for each community

In [151]:
comb_data=comb_data.merge(categfreq_data, on='Community', how = 'right')
comb_data.head()

Unnamed: 0,Community,Restaurant Count,Population,Density (/km2),People/Restaurant,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Cuban Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Eastern European Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Food,Food Court,French Restaurant,Gastropub,Gay Bar,German Restaurant,Greek Restaurant,Hot Dog Joint,Hunan Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Moroccan Restaurant,New American Restaurant,Noodle House,Peruvian Restaurant,Pizza Place,Polish Restaurant,Pub,Restaurant,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,South American Restaurant,Southern / Soul Food Restaurant,Sports Bar,Steakhouse,Sushi Restaurant,Taco Place,Thai Restaurant,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yemeni Restaurant
0,Armour Square,6,13455,5195.0,2242.5,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ashburn,6,43792,3479.05,7298.666667,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Auburn Gresham,8,46278,4739.53,5784.75,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Austin,6,95260,5144.07,15876.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Avalon Park,5,9985,3084.18,1997.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [152]:
comb_data.shape

(64, 73)

Now let's just normalize columns 1-4 by dividing each value by the maximum value in each column

In [159]:
# create new dataframe cluster_data, initiating it as copy of comb_data
cluster_columns = comb_data.columns.tolist()
cluster_data = pd.DataFrame (columns = cluster_columns)
cluster_data = comb_data.copy()

cluster_data['Restaurant Count']=cluster_data['Restaurant Count']/cluster_data['Restaurant Count'].max()
cluster_data['Population']=cluster_data['Population']/cluster_data['Population'].max()
cluster_data['Density (/km2)']=cluster_data['Density (/km2)']/cluster_data['Density (/km2)'].max()
cluster_data['People/Restaurant']=cluster_data['People/Restaurant']/cluster_data['People/Restaurant'].max()
cluster_data

Unnamed: 0,Community,Restaurant Count,Population,Density (/km2),People/Restaurant,African Restaurant,American Restaurant,Arcade,Argentinian Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bar,Breakfast Spot,Burger Joint,Café,Cajun / Creole Restaurant,Cantonese Restaurant,Caribbean Restaurant,Chinese Restaurant,Coffee Shop,Comfort Food Restaurant,Cuban Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Eastern European Restaurant,Ethiopian Restaurant,Fast Food Restaurant,Food,Food Court,French Restaurant,Gastropub,Gay Bar,German Restaurant,Greek Restaurant,Hot Dog Joint,Hunan Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Latin American Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Moroccan Restaurant,New American Restaurant,Noodle House,Peruvian Restaurant,Pizza Place,Polish Restaurant,Pub,Restaurant,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,South American Restaurant,Southern / Soul Food Restaurant,Sports Bar,Steakhouse,Sushi Restaurant,Taco Place,Thai Restaurant,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wings Joint,Yemeni Restaurant
0,Armour Square,0.125,0.141245,0.449624,0.080834,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ashburn,0.125,0.45971,0.301109,0.263091,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Auburn Gresham,0.166667,0.485807,0.410203,0.20852,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Austin,0.125,1.0,0.445216,0.572297,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Avalon Park,0.104167,0.104818,0.266934,0.071985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Avondale,0.645833,0.392274,0.630667,0.043451,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.16129,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.0,0.0,0.129032,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.032258,0.0,0.258065,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.032258,0.032258,0.032258,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064516,0.0,0.0,0.0
6,Belmont Cragin,0.416667,0.838862,0.682952,0.144024,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.05,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Beverly,0.020833,0.218581,0.218807,0.750559,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Bridgeport,0.208333,0.353107,0.53782,0.121249,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Brighton Park,0.145833,0.470428,0.550556,0.230764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2b. Running the Clustering model

#### The number K of clusters is a key parameter when preparing a Clustering model. After a few iterations testing K numbers from 3 to 8, I found that 5 clusters gave the best results. 3 or 4 clusters didn't yield a clear profile of clusters, while 6-8 clusters resulted in two or more clusters with only 1 or 2 communities each. Therefore, we will proceed with K=5.

In [160]:
# set number of clusters
kclusters = 5

X = cluster_data.drop('Community', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 0, 3, 2, 3, 2, 2, 1, 3, 1, 4, 1, 1, 3, 3, 1, 3, 3, 4, 3, 4, 4,
       1, 1, 4, 3, 1, 2, 1, 2, 2, 4, 4, 2, 2, 1, 3, 4, 0, 3, 2, 3, 2, 3,
       3, 3, 4, 4, 2, 4, 2, 4, 1, 1, 2, 2, 4, 1, 3, 3, 1, 0, 2, 4],
      dtype=int32)

In [161]:
# Confirms that all 64 communities got a cluster label 
kmeans.labels_.shape[0]

64

Now for our final analysis, let's create a data frame that merges initial community information, top 5 restaurant categories, and their cluster labels

In [162]:
# initiate a dataframe using Community name as key field
final_data=pd.DataFrame(columns={'Community'})
final_data['Community']=cluster_data['Community']

# add cluster labels
final_data.insert(1, 'Cluster Labels', kmeans.labels_)

# add columns Restaurant count, Population, Density and People/Restaurant columns from comb_data
subset=comb_data.iloc[:,0:5]
final_data=final_data.merge(subset, on='Community', how = 'right')

# add top 5 restaurant categories for each community
final_data=final_data.merge(top_categ, on='Community', how = 'right')

# add Longitude and Latitude at the end of the table
coordinates_subset = community_data[['Community','Latitude','Longitude']]
final_data=final_data.merge(coordinates_subset, on='Community', how = 'left')

print(final_data.shape)
final_data

(64, 13)


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
0,Armour Square,4,6,13455,5195.0,2242.5,Chinese Restaurant,Food,Fast Food Restaurant,American Restaurant,Deli / Bodega,41.840231,-87.632986
1,Ashburn,0,6,43792,3479.05,7298.666667,American Restaurant,Sports Bar,Hot Dog Joint,Italian Restaurant,Yemeni Restaurant,41.747533,-87.711163
2,Auburn Gresham,3,8,46278,4739.53,5784.75,Food,Southern / Soul Food Restaurant,Chinese Restaurant,Mexican Restaurant,American Restaurant,41.743387,-87.656042
3,Austin,2,6,95260,5144.07,15876.666667,Southern / Soul Food Restaurant,Greek Restaurant,Breakfast Spot,Food,Taco Place,41.887876,-87.764851
4,Avalon Park,3,5,9985,3084.18,1997.0,Food,Fast Food Restaurant,Diner,Comfort Food Restaurant,Gay Bar,41.745035,-87.588658
5,Avondale,2,31,37368,7286.8,1205.419355,Mexican Restaurant,Chinese Restaurant,Food,Vietnamese Restaurant,Diner,41.938921,-87.711168
6,Belmont Cragin,2,20,79910,7890.9,3995.5,Mexican Restaurant,Restaurant,American Restaurant,Food,Eastern European Restaurant,41.931698,-87.76867
7,Beverly,1,1,20822,2528.12,20822.0,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega,41.718153,-87.671767
8,Bridgeport,3,10,33637,6214.03,3363.7,Food,American Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,41.837938,-87.651028
9,Brighton Park,1,7,44813,6361.18,6401.857143,Mexican Restaurant,Seafood Restaurant,Chinese Restaurant,Breakfast Spot,Deli / Bodega,41.818922,-87.698942


### 2c. Visualizing and analyzing community clusters

Let's first visualize our clusters on a Chicago map

In [163]:
# create map
map_clusters = folium.Map(location=[chicago_latitude, chicago_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, com, cluster in zip(final_data['Latitude'], final_data['Longitude'], final_data['Community'], final_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster) + ', ' + str(com) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's now analyze the characteristics of each cluster. We will start by analyzing statistics of numerical data accross all of them to use as reference when looking at each of them separately.

In [164]:
# start by taking a quick look at the statistics of all numerical data: Restaurant Count, Population, Density and People/Restaurant
subset=final_data.iloc[:,2:6]
subset.describe()

Unnamed: 0,Restaurant Count,Population,Density (/km2),People/Restaurant
count,64.0,64.0,64.0,64.0
mean,9.796875,32471.546875,4676.1875,5781.675779
std,10.486847,20489.63495,2191.099279,5379.199691
min,1.0,2254.0,358.23,492.083333
25%,4.0,18974.75,3136.2375,2249.946429
50%,6.0,27284.5,4336.92,4455.0
75%,11.5,42747.0,6309.31,6941.694444
max,48.0,95260.0,11554.11,27742.0


Now let's look at features of each cluster separately.

In [165]:
cluster_number=0
count=final_data[final_data['Cluster Labels']==cluster_number].nunique()[3]
print("There are {} communities in Cluster number {}.".format(count, cluster_number))
final_data[final_data['Cluster Labels']==cluster_number]

There are 3 communities in Cluster number 0.


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
1,Ashburn,0,6,43792,3479.05,7298.666667,American Restaurant,Sports Bar,Hot Dog Joint,Italian Restaurant,Yemeni Restaurant,41.747533,-87.711163
38,Morgan Park,0,1,22394,2620.11,22394.0,American Restaurant,Yemeni Restaurant,Food,Deli / Bodega,Dim Sum Restaurant,41.690312,-87.666716
61,West Pullman,0,1,27742,3008.78,27742.0,Seafood Restaurant,Yemeni Restaurant,Ethiopian Restaurant,Cuban Restaurant,Deli / Bodega,41.671775,-87.638358


The first cluster (Cluster 0) has 3 communities with a high ratio of People/Restaurant compared to the entire set of 64 clustered communities (above 3rd quartile = 6,941 people/restaurant). Those communities are also located far from Chicago downtown and, based on the list of top 5 categories in those communities, there is no indication that a French restaurant would be particularly successful there.

In [166]:
cluster_number=1
count=final_data[final_data['Cluster Labels']==cluster_number].nunique()[3]
print("There are {} communities in Cluster number {}.".format(count, cluster_number))
final_data[final_data['Cluster Labels']==cluster_number]

There are 14 communities in Cluster number 1.


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
7,Beverly,1,1,20822,2528.12,20822.0,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega,41.718153,-87.671767
9,Brighton Park,1,7,44813,6361.18,6401.857143,Mexican Restaurant,Seafood Restaurant,Chinese Restaurant,Breakfast Spot,Deli / Bodega,41.818922,-87.698942
11,Calumet Heights,1,3,13188,2909.67,4396.0,Diner,Burger Joint,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,41.730035,-87.579213
12,Chatham,1,3,31120,4073.05,10373.333333,Mexican Restaurant,Caribbean Restaurant,Yemeni Restaurant,Fast Food Restaurant,Deli / Bodega,41.741145,-87.612548
15,Dunning,1,8,43689,4534.52,5461.125,Mexican Restaurant,Eastern European Restaurant,Bar,American Restaurant,Seafood Restaurant,41.952809,-87.796449
22,Gage Park,1,10,40873,7173.25,4087.3,Mexican Restaurant,Asian Restaurant,Coffee Shop,Gay Bar,Gastropub,41.795033,-87.696164
23,Garfield Ridge,1,6,36396,3322.12,6066.0,Mexican Restaurant,Restaurant,Food,Ethiopian Restaurant,Cuban Restaurant,41.803617,-87.745489
26,Hegewisch,1,1,9418,693.95,9418.0,Mexican Restaurant,Yemeni Restaurant,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega,41.653646,-87.546988
28,Humboldt Park,1,6,56427,6051.83,9404.5,Mexican Restaurant,Caribbean Restaurant,Taco Place,Breakfast Spot,Fast Food Restaurant,41.900828,-87.723959
35,Lower West Side,1,21,32888,4333.83,1566.095238,Mexican Restaurant,Food,Restaurant,Taco Place,Food Court,41.8542,-87.665609


Cluster 1 seems to feature high interest for Mexican food, as those are the most frequent categories of restaurants open in the region. It also seems that Caribbean, American and Fast Food are popular cuisines in those locations. I also noticed that most communities (10 out of 14) have Population values above the average of 27,284 people per community, seen across all 64 analyzed communities.

In [167]:
cluster_number=2
count=final_data[final_data['Cluster Labels']==cluster_number].nunique()[3]
print("There are {} communities in Cluster number {}.".format(count, cluster_number))
final_data[final_data['Cluster Labels']==cluster_number]

There are 15 communities in Cluster number 2.


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
3,Austin,2,6,95260,5144.07,15876.666667,Southern / Soul Food Restaurant,Greek Restaurant,Breakfast Spot,Food,Taco Place,41.887876,-87.764851
5,Avondale,2,31,37368,7286.8,1205.419355,Mexican Restaurant,Chinese Restaurant,Food,Vietnamese Restaurant,Diner,41.938921,-87.711168
6,Belmont Cragin,2,20,79910,7890.9,3995.5,Mexican Restaurant,Restaurant,American Restaurant,Food,Eastern European Restaurant,41.931698,-87.76867
27,Hermosa,2,17,24144,7967.57,1420.235294,Mexican Restaurant,Restaurant,Latin American Restaurant,American Restaurant,Pizza Place,41.928643,-87.734502
29,Hyde Park,2,14,26827,6433.52,1916.214286,Food,Japanese Restaurant,Mexican Restaurant,American Restaurant,Gastropub,41.794446,-87.593924
30,Irving Park,2,19,54606,6568.06,2874.0,Food,Pizza Place,Mexican Restaurant,Chinese Restaurant,Mediterranean Restaurant,41.953365,-87.736447
33,Lincoln Park,2,15,67710,8273.1,4514.0,Chinese Restaurant,Bagel Shop,Noodle House,Scandinavian Restaurant,Seafood Restaurant,41.939945,-87.63612
34,Loop,2,46,35880,8395.97,780.0,Food,Chinese Restaurant,American Restaurant,Italian Restaurant,Restaurant,41.881609,-87.629457
40,Near South Side,2,48,23620,5123.44,492.083333,Chinese Restaurant,Food,Asian Restaurant,American Restaurant,Cantonese Restaurant,41.8567,-87.624774
42,North Center,2,19,35789,6740.59,1883.631579,Chinese Restaurant,Thai Restaurant,Latin American Restaurant,American Restaurant,Mexican Restaurant,41.956107,-87.67916


Cluster number 2 gathers 15 communities with high Population and Population Density, compared to the total averages seen across all 64 communities. It also comprises a set of communities with very diverse cuisines, which indicates that restaurant guests in the area probably appreciate having variety of food and dining experiences. This is also the cluster where European cuisine is more frequently seen in top 5 categories: Greek, Eastern European, Scandinavian, Italian, and Ukrainian cuisines are featured in the top 5 categories. This could be an indication that French cuisine – also European – could be successful in the location.

In [168]:
cluster_number=3
count=final_data[final_data['Cluster Labels']==cluster_number].nunique()[3]
print("There are {} communities in Cluster number {}.".format(count, cluster_number))
final_data[final_data['Cluster Labels']==cluster_number]

There are 17 communities in Cluster number 3.


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
2,Auburn Gresham,3,8,46278,4739.53,5784.75,Food,Southern / Soul Food Restaurant,Chinese Restaurant,Mexican Restaurant,American Restaurant,41.743387,-87.656042
4,Avalon Park,3,5,9985,3084.18,1997.0,Food,Fast Food Restaurant,Diner,Comfort Food Restaurant,Gay Bar,41.745035,-87.588658
8,Bridgeport,3,10,33637,6214.03,3363.7,Food,American Restaurant,Restaurant,Asian Restaurant,Japanese Restaurant,41.837938,-87.651028
13,Clearing,3,4,25891,3920.22,6472.75,Food,Pizza Place,Asian Restaurant,Latin American Restaurant,Fast Food Restaurant,41.780588,-87.773388
14,Douglas,3,3,20781,4862.78,6927.0,African Restaurant,Café,Food,German Restaurant,Gay Bar,41.834857,-87.617954
16,East Garfield Park,3,4,19996,4000.26,4999.0,Food,Diner,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega,41.880866,-87.702833
17,East Side,3,3,23737,3075.47,7912.333333,Food,Bar,Italian Restaurant,Deli / Bodega,Dim Sum Restaurant,41.713569,-87.532781
19,Englewood,3,3,25075,3153.59,8358.333333,Food,Restaurant,Ethiopian Restaurant,Comfort Food Restaurant,Cuban Restaurant,41.779756,-87.645884
25,Greater Grand Crossing,3,4,31766,3454.91,7941.5,Food,Southern / Soul Food Restaurant,Yemeni Restaurant,Ethiopian Restaurant,Cuban Restaurant,41.766886,-87.620845
36,McKinley Park,3,7,15767,4317.5,2252.428571,Food,Diner,American Restaurant,Chinese Restaurant,Fast Food Restaurant,41.8317,-87.673664


Cluster number 3 features varying values of Population, Density and People/Restaurant ratio across communities, hence I we cannot see a pattern regarding those numerical features. Cuisines in this area seem concentrated around General and American Food, Fast Food and Delis, as well as Cuban and Asian Food. Those cuisines are generally very different from French cuisine, so this is probably not a strong location for a French restaurant. 

In [169]:
cluster_number=4
count=final_data[final_data['Cluster Labels']==cluster_number].nunique()[3]
print("There are {} communities in Cluster number {}.".format(count, cluster_number))
final_data[final_data['Cluster Labels']==cluster_number]

There are 15 communities in Cluster number 4.


Unnamed: 0,Community,Cluster Labels,Restaurant Count,Population,Density (/km2),People/Restaurant,1st Category,2nd Category,3rd Category,4th Category,5th Category,Latitude,Longitude
0,Armour Square,4,6,13455,5195.0,2242.5,Chinese Restaurant,Food,Fast Food Restaurant,American Restaurant,Deli / Bodega,41.840231,-87.632986
10,Burnside,4,1,2254,1426.68,2254.0,Comfort Food Restaurant,Fast Food Restaurant,Cuban Restaurant,Deli / Bodega,Dim Sum Restaurant,41.730035,-87.596714
18,Edison Park,4,1,11605,1635.3,11605.0,Steakhouse,Yemeni Restaurant,Ethiopian Restaurant,Cuban Restaurant,Deli / Bodega,42.006113,-87.813992
20,Forest Glen,4,4,19019,2294.78,4754.75,Restaurant,Indian Restaurant,Chinese Restaurant,Yemeni Restaurant,Ethiopian Restaurant,41.991752,-87.751674
21,Fuller Park,4,4,2439,1326.34,609.75,Fast Food Restaurant,Italian Restaurant,Diner,Southern / Soul Food Restaurant,Ethiopian Restaurant,41.818089,-87.632551
24,Grand Boulevard,4,8,22313,4951.2,2789.125,Southern / Soul Food Restaurant,Chinese Restaurant,BBQ Joint,Restaurant,Caribbean Restaurant,41.813923,-87.617272
31,Jefferson Park,4,7,26808,4442.33,3829.714286,Chinese Restaurant,Restaurant,Middle Eastern Restaurant,Food,Eastern European Restaurant,41.969738,-87.763118
32,Kenwood,4,3,17189,6381.45,5729.666667,Chinese Restaurant,Caribbean Restaurant,BBQ Joint,Yemeni Restaurant,Fast Food Restaurant,41.809144,-87.597991
37,Montclare,4,5,13830,5393.73,2766.0,Restaurant,Mexican Restaurant,Pizza Place,Eastern European Restaurant,Ethiopian Restaurant,41.925309,-87.800893
46,O'Hare,4,6,12377,358.23,2062.833333,Bar,Sports Bar,Greek Restaurant,Italian Restaurant,Pub,41.973101,-87.906768


The last Cluster, number 4, comprises communities where there seems to be a concentration of American, Southern and Fast Food, as well as Chinese, Middle Eastern and African Food. The Population in most of those communities are below average of the entire community dataset. Similarly, the average number of people per restaurant seems generally below average of the entire dataset. This may be an indication that the area is already filled with many restaurants and opening a new dining venue there could be more challenging.

### 3. Conclusion

#### Based on the analysis above, it seems that Cluster number 2 is the most promising set of communities for a new French Restaurant: it gathers areas where a broad variety of cuisines make the top 5 categories, and the only cluster where European restaurants have more prominent presence. Moreover, communities in this cluster have large Population and Population Density, which increases chances of higher daily occupancy rates, a crucial metric for financially sustaining restaurants anywhere.