# Applied Data Science Capstone
This notebook will be used for my capstone project in Coursera. 

In [207]:
import pandas as pd 
import numpy as np 
import folium
import requests
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Introduction
Let's imagine that in the past year, my family and I opened a restaurant in Columbus, OH. Specifically, in the Northwest Columbus community. We've achieved a far amount of success and want to expand. Since we know the Northwest community enjoys our food, we should aim to go into a neighborhood that is similar to the Northwest. Through this notebook, we will analyze and cluster and neighborhoods of Columbus to determine potential neighborhoods for our next place of business. 

## Data
The list of Columbus neighborhoods was obtained from [this website](http://opendata.columbus.gov/datasets/c4b483507f374e62bd705450e116e017_25/data). The data here also included the area of each neighborhood in squre feet, which I used to approximate the radius of each neighborhood assuming they are circular. This is not wholly accurate but a good enough approximation. To find the coordinates at the center of each neighborhood, [this map](https://www.arcgis.com/home/webmap/viewer.html?layers=c4b483507f374e62bd705450e116e017), where centers of the neighborhoods were approximated, then the coordinates were copy and pasted into a spreadsheet that was then exported as the [Columbus_Communities.csv](https://github.com/alexanderWhile/Coursera_Capstone/blob/master/Columbus_Communities.csv) found in this repository. 

Let's import our data into a dataframe and preview the information found in it. 

In [86]:
COLUMBUS_COMMUNITIES = pd.read_csv("Columbus_Communities.csv")
COLUMBUS_COMMUNITIES

Unnamed: 0,Community,Latitude,Longitude,SHAPEAREA,Radius
0,Airport,39.996795,-82.889889,113071700.0,1800
1,Brewery District,39.947067,-83.003872,17264720.0,700
2,Clintonville,40.047406,-83.013828,171040600.0,2200
3,Downtown,39.963515,-82.999752,68036960.0,1400
4,Dublin Road Corridor,39.97233,-83.036144,15067140.0,700
5,East Columbus,39.990087,-82.924736,40191390.0,1100
6,Far East,39.957463,-82.840279,417956000.0,3500
7,Far North,40.132768,-82.992714,227008600.0,2600
8,Far Northwest,40.115179,-83.063095,195710000.0,2400
9,Far South,39.862131,-83.002327,705915900.0,4600


We will start by making a map of the centers of all the communities in Columbus. 

In [87]:
COLUMBUS_LATITUDE = 39.9612
COLUMBUS_LONGITUDE = -82.9988

COLUMBUS_MAP = folium.Map(
    location = [COLUMBUS_LATITUDE, COLUMBUS_LONGITUDE],
    zoom_start = 10,
)

for lat, lng, label, radius in zip(COLUMBUS_COMMUNITIES.Latitude, COLUMBUS_COMMUNITIES.Longitude, COLUMBUS_COMMUNITIES.Community, COLUMBUS_COMMUNITIES.Radius):
    folium.vector_layers.Circle(
        [lat,lng],
        radius=radius,
        color='blue',
        popup=label,
        fill=True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(COLUMBUS_MAP)

COLUMBUS_MAP

*Note: GitHub will not render any folium maps. To see them, follow the link [here](https://nbviewer.jupyter.org/github/alexanderWhile/Coursera_Capstone/blob/master/notebook.ipynb)*

Now let's import our Foursquare credentials to begin utilizing the API and looking up venues. 

In [88]:
CLIENT_ID = 'IA4SDU5HX0UHCL4VSZJDAHBXWJHJY4HPTFNBLWHG4YHYSLWH'
CLIENT_SECRET = '21PID34DCUTLYIWC2RRRRWMBKIE1ZUUXQKE2ZEAASQ4VIWX5'
VERSION = '20200416'

LIMIT = 100

print("Client ID:",CLIENT_ID)
print("Client Secret:", CLIENT_SECRET)
print("Version:", VERSION)
print("Limit:", LIMIT)

Client ID: IA4SDU5HX0UHCL4VSZJDAHBXWJHJY4HPTFNBLWHG4YHYSLWH
Client Secret: 21PID34DCUTLYIWC2RRRRWMBKIE1ZUUXQKE2ZEAASQ4VIWX5
Version: 20200416
Limit: 100


Now we will preview our API calls by making a map of the venues in the community familiar to us, Northwest Columbus. First we will make a folium map centered on the community.

In [89]:
NORTHWEST = COLUMBUS_COMMUNITIES[COLUMBUS_COMMUNITIES.Community == 'Northwest'].reset_index()



NORTHWEST_MAP = folium.Map(
    location = [NORTHWEST.loc[0,'Latitude'], NORTHWEST.loc[0,'Longitude']],
    zoom_start=13
)

folium.vector_layers.Circle(
    [NORTHWEST.loc[0,'Latitude'], NORTHWEST.loc[0,'Longitude']],
    radius = int(NORTHWEST.loc[0,'Radius']),
    color = 'red',
    popup = NORTHWEST.loc[0,'Community'],
).add_to(NORTHWEST_MAP)

NORTHWEST_MAP

Next we will make our API call.

In [90]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    NORTHWEST.loc[0,'Latitude'],
    NORTHWEST.loc[0,'Longitude'],
    NORTHWEST.loc[0,'Radius'],
    LIMIT)

results = requests.get(url).json()
print("Success")

Success


Here we define a function to get the category of each venue from the .json file. 

In [91]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

print("Function Defined!")

Function Defined!


We convert the results into a well-formatted data frame.

In [92]:
NORTHWEST_VENUES = results['response']['groups'][0]['items']
NORTHWEST_VENUES = json_normalize(NORTHWEST_VENUES)

FILTERED_COLUMNS = ['venue.name','venue.categories','venue.location.lat','venue.location.lng']
NORTHWEST_VENUES = NORTHWEST_VENUES.loc[:,FILTERED_COLUMNS]

NORTHWEST_VENUES['venue.categories'] = NORTHWEST_VENUES.apply(get_category_type,axis = 1)

NORTHWEST_VENUES.columns = [col.split(".")[-1] for col in NORTHWEST_VENUES.columns]

NORTHWEST_VENUES

Unnamed: 0,name,categories,lat,lng
0,Los Guachos Taqueria,Taco Place,40.064524,-83.057044
1,Graeter's Ice Cream,Ice Cream Shop,40.064990,-83.075559
2,The Grumpy Troll Tavern,Bar,40.064340,-83.060700
3,Somewhere In Particular Brewing,Brewery,40.061978,-83.075634
4,City Egg,Breakfast Spot,40.064127,-83.058756
...,...,...,...,...
95,CVS pharmacy,Pharmacy,40.064550,-83.093849
96,Hallmark,Gift Shop,40.053139,-83.067978
97,Starbucks,Coffee Shop,40.064276,-83.095291
98,Giant Eagle Supermarket,Supermarket,40.064116,-83.095382


Check the number of venues.

In [93]:
print("There are", NORTHWEST_VENUES.shape[0], "venues nearby.")

There are 100 venues nearby.


And finally add the venues to our folium map.

In [94]:
for lat, lng, label,cat in zip(NORTHWEST_VENUES.lat, NORTHWEST_VENUES.lng, NORTHWEST_VENUES.name, NORTHWEST_VENUES.categories):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius = 5,
        color = 'blue',
        popup = label+",\n"+cat,
        fill = True,
        fill_color='blue',
        fill_opacity = 0.9
    ).add_to(NORTHWEST_MAP)

NORTHWEST_MAP

As a sanity check, we will repeat the process to map all the venues in Downtown Columbus.

In [105]:
DOWNTOWN = COLUMBUS_COMMUNITIES[COLUMBUS_COMMUNITIES.Community == 'Downtown'].reset_index()

DOWNTOWN_MAP = folium.Map(
    location = [DOWNTOWN.loc[0,'Latitude'], DOWNTOWN.loc[0,'Longitude']],
    zoom_start=14
)

folium.vector_layers.Circle(
    [DOWNTOWN.loc[0,'Latitude'], DOWNTOWN.loc[0,'Longitude']],
    radius=int(DOWNTOWN.loc[0,'Radius']),
    color = 'red',
    popup = DOWNTOWN.loc[0,'Community'],
).add_to(DOWNTOWN_MAP)

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    DOWNTOWN.loc[0,'Latitude'],
    DOWNTOWN.loc[0,'Longitude'],
    DOWNTOWN.loc[0,'Radius'],
    LIMIT)

results = requests.get(url).json()

DOWNTOWN_VENUES = results['response']['groups'][0]['items']
DOWNTOWN_VENUES = json_normalize(DOWNTOWN_VENUES)

DOWNTOWN_VENUES = DOWNTOWN_VENUES.loc[:,FILTERED_COLUMNS]

DOWNTOWN_VENUES['venue.categories'] = DOWNTOWN_VENUES.apply(get_category_type,axis = 1)

DOWNTOWN_VENUES.columns = [col.split(".")[-1] for col in DOWNTOWN_VENUES.columns]

for lat, lng, label,cat in zip(DOWNTOWN_VENUES.lat, DOWNTOWN_VENUES.lng, DOWNTOWN_VENUES.name, DOWNTOWN_VENUES.categories):
    folium.vector_layers.CircleMarker(
        [lat, lng],
        radius = 5,
        color = 'blue',
        popup = label+",\n"+cat,
        fill = True,
        fill_color = 'blue',
        fill_opacity = 0.9
    ).add_to(DOWNTOWN_MAP)

DOWNTOWN_MAP

Now, let's define a function to get the nearby venues for any community.

In [106]:
def get_nearby_venues(names, latitudes, longitudes, radii):
    venues_list = []
    for name, lat, lng, rad in zip(names, latitudes, longitudes, radii):
        print(name)

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            rad,
            LIMIT)

        results = requests.get(url).json()['response']['groups'][0]['items']

        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])

    nearby_venues.columns = [
        'Community',
        'Community Latitude',
        'Community Longitude',
        'Venue',
        'Venue Latitude',
        'Venue Longitude',
        'Venue Category']
    
    return(nearby_venues)

print("Success!")

Success!


Now we'll run the function. 

In [143]:
COLUMBUS_VENUES = get_nearby_venues( 
    names = COLUMBUS_COMMUNITIES.Community,
    latitudes = COLUMBUS_COMMUNITIES.Latitude,
    longitudes = COLUMBUS_COMMUNITIES.Longitude,
    radii = COLUMBUS_COMMUNITIES.Radius
)

Airport
Brewery District
Clintonville
Downtown
Dublin Road Corridor
East Columbus
Far East
Far North
Far Northwest
Far South
Far West
Fifth by Northwest
Fort Hayes
Franklinton
German Village
Greater Hilltop
Harmon Road Corridor
Harrison West
Hayden Run
Italian Village
Livingston Avenue Area
Mid East
Milo-Grogan
Near East
North Central
North Linden
Northeast
Northland
Northwest
Olentangy West
Rocky Fork-Blacklick
South East
South Linden
South Side
Southwest
State of Ohio
University District
Victorian Village
West Scioto
Westland
Wolfe Park


Let's check the shape and preview the results.

In [151]:
print(COLUMBUS_VENUES.shape)
COLUMBUS_VENUES

(2301, 7)


Unnamed: 0,Community,Community Latitude,Community Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Airport,39.996795,-82.889889,Fairfield Inn & Suites Columbus Airport,39.999001,-82.890372,Hotel
1,Airport,39.996795,-82.889889,CMH Passenger Drop-off / Pick-up,39.998053,-82.884504,Airport Service
2,Airport,39.996795,-82.889889,Southwest Airlines Ticket Counter,39.997923,-82.884274,Airport Service
3,Airport,39.996795,-82.889889,Starbucks,39.997821,-82.881962,Coffee Shop
4,Airport,39.996795,-82.889889,Enterprise Rent-A-Car,39.998289,-82.887247,Rental Car Location
...,...,...,...,...,...,...,...
2296,Westland,39.922398,-83.145575,Wendy’s,39.951000,-83.147086,Fast Food Restaurant
2297,Westland,39.922398,-83.145575,Barking Sheep Towing And Recovery,39.920861,-83.181729,Business Service
2298,Westland,39.922398,-83.145575,Prairie Township Road Department,39.905203,-83.174205,Construction & Landscaping
2299,Westland,39.922398,-83.145575,Holt Park,39.902807,-83.117600,Park


Let's see how many venues each community returned. 

In [145]:
COLUMBUS_VENUES_GROUPED = COLUMBUS_VENUES.groupby('Community').count()
COLUMBUS_VENUES_GROUPED.reset_index(inplace=True)
COLUMBUS_VENUES_GROUPED[['Community','Venue']]

Unnamed: 0,Community,Venue
0,Airport,50
1,Brewery District,38
2,Clintonville,100
3,Downtown,96
4,Dublin Road Corridor,9
5,East Columbus,14
6,Far East,100
7,Far North,100
8,Far Northwest,38
9,Far South,100


We notice that a few communities returned very few venues. These communities likely are very small or not an area generally favorable for venues. Let's filter out any communities that have 15 or less venues.

In [169]:
EXCLUDED = COLUMBUS_VENUES_GROUPED.loc[COLUMBUS_VENUES_GROUPED['Venue'] <= 15]
EXCLUDED.Community

4       Dublin Road Corridor
5              East Columbus
12                Fort Hayes
16      Harmon Road Corridor
20    Livingston Avenue Area
22               Milo-Grogan
24             North Central
32              South Linden
35             State of Ohio
40                Wolfe Park
Name: Community, dtype: object

In [174]:
COLUMBUS_VENUES_FILTERED = COLUMBUS_VENUES[~COLUMBUS_VENUES.Community.isin(EXCLUDED.Community)]
COLUMBUS_VENUES_FILTERED

Unnamed: 0,Community,Community Latitude,Community Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Airport,39.996795,-82.889889,Fairfield Inn & Suites Columbus Airport,39.999001,-82.890372,Hotel
1,Airport,39.996795,-82.889889,CMH Passenger Drop-off / Pick-up,39.998053,-82.884504,Airport Service
2,Airport,39.996795,-82.889889,Southwest Airlines Ticket Counter,39.997923,-82.884274,Airport Service
3,Airport,39.996795,-82.889889,Starbucks,39.997821,-82.881962,Coffee Shop
4,Airport,39.996795,-82.889889,Enterprise Rent-A-Car,39.998289,-82.887247,Rental Car Location
...,...,...,...,...,...,...,...
2295,Westland,39.922398,-83.145575,Arby's,39.951749,-83.126998,Fast Food Restaurant
2296,Westland,39.922398,-83.145575,Wendy’s,39.951000,-83.147086,Fast Food Restaurant
2297,Westland,39.922398,-83.145575,Barking Sheep Towing And Recovery,39.920861,-83.181729,Business Service
2298,Westland,39.922398,-83.145575,Prairie Township Road Department,39.905203,-83.174205,Construction & Landscaping


Now let's see how many unique categories of venues there are throughout Columbus.

In [175]:
print('There are {} unique categories'.format(len(COLUMBUS_VENUES_FILTERED['Venue Category'].unique())))

There are 269 unique categories


Now we will use one-hot encoding to count how many of each venue type there is in each Community.

In [179]:
COLUMBUS_ONEHOT = pd.get_dummies(COLUMBUS_VENUES_FILTERED[['Venue Category']],prefix="",prefix_sep="")

COLUMBUS_ONEHOT['Community'] = COLUMBUS_VENUES_FILTERED['Community']

cols = list(COLUMBUS_ONEHOT)
cols.insert(0,cols.pop(cols.index('Community')))

COLUMBUS_ONEHOT = COLUMBUS_ONEHOT.loc[:,cols]
COLUMBUS_ONEHOT

Unnamed: 0,Community,ATM,Accessories Store,Acupuncturist,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Airport,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Airport,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Airport,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Airport,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Airport,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,Westland,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2296,Westland,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2297,Westland,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2298,Westland,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we will calculate the relative frequency of each category for each community. 

In [177]:
COLUMBUS_GROUPED = COLUMBUS_ONEHOT.groupby('Community').mean().reset_index()

COLUMBUS_GROUPED

Unnamed: 0,Community,ATM,Accessories Store,Acupuncturist,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Animal Shelter,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Airport,0.0,0.0,0.0,0.0,0.02,0.14,0.02,0.06,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
1,Brewery District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Clintonville,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
3,Downtown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052083,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Far East,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.03,0.0,0.0
5,Far North,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.08,0.0,...,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0
6,Far Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Far South,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0
8,Far West,0.022472,0.0,0.0,0.0,0.0,0.0,0.0,0.05618,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033708,0.0,0.0
9,Fifth by Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.01,0.0,0.0,0.0,0.01,0.02,0.0,0.01,0.0,0.01


Let's see what the top five categories are for each of our categories. 

In [178]:
num_top_venues = 5

for comm in COLUMBUS_GROUPED.Community:
    print("---"+comm+"---")
    temp = COLUMBUS_GROUPED[COLUMBUS_GROUPED.Community == comm].T.reset_index()
    temp.columns = ['venue','percent']
    temp = temp.iloc[1:]
    temp['percent'] = temp['percent'].astype(float)*100
    temp = temp.round({'percent':1})
    print(temp.sort_values('percent',ascending = False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---Airport---
                  venue  percent
0       Airport Service     14.0
1   Rental Car Location     12.0
2   American Restaurant      6.0
3  Fast Food Restaurant      6.0
4                 Hotel      6.0


---Brewery District---
                venue  percent
0                 Bar     10.5
1          Sports Bar      5.3
2              Bakery      5.3
3  Athletics & Sports      5.3
4             Brewery      5.3


---Clintonville---
            venue  percent
0     Coffee Shop      5.0
1   Deli / Bodega      4.0
2          Bakery      4.0
3  Sandwich Place      4.0
4            Bank      4.0


---Downtown---
                 venue  percent
0  American Restaurant      5.2
1              Theater      4.2
2              Brewery      4.2
3                 Café      4.2
4                Hotel      4.2


---Far East---
                  venue  percent
0           Pizza Place     10.0
1        Sandwich Place      6.0
2   American Restaurant      6.0
3  Fast Food Restaurant      5.0
4  

If we look at our current community, Northwest Columbus, we can see that there is a lot of variety of venues, as no category is has a frequency greater than 6%. 

Now let's define a function to return the most common categories for each community and create a new data frame containing the top 10 categories found in each community.

In [180]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

print("Function Defined!")

Function Defined!


In [181]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']
columns = ['Community']
for ind in range(0,num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

COLUMBUS_VENUES_SORTED = pd.DataFrame(columns=columns)
COLUMBUS_VENUES_SORTED.Community = COLUMBUS_GROUPED.Community

for ind in range(0,COLUMBUS_GROUPED.shape[0]):
    COLUMBUS_VENUES_SORTED.iloc[ind,1:] = return_most_common_venues(COLUMBUS_GROUPED.loc[ind], num_top_venues)

COLUMBUS_VENUES_SORTED

Unnamed: 0,Community,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Airport,Airport Service,Rental Car Location,Coffee Shop,Fast Food Restaurant,Furniture / Home Store,American Restaurant,Hotel,Food,Brewery,Clothing Store
1,Brewery District,Bar,Bakery,Sports Bar,Bank,Athletics & Sports,Brewery,Thai Restaurant,Italian Restaurant,Taco Place,Beer Bar
2,Clintonville,Coffee Shop,Bank,Pizza Place,Sandwich Place,Deli / Bodega,Bakery,Diner,Grocery Store,Fast Food Restaurant,Salon / Barbershop
3,Downtown,American Restaurant,Theater,Hotel,Brewery,Park,Café,Bar,Coffee Shop,Sandwich Place,Art Gallery
4,Far East,Pizza Place,Sandwich Place,American Restaurant,Fast Food Restaurant,Gym / Fitness Center,Mexican Restaurant,Coffee Shop,Discount Store,Bar,Gas Station
5,Far North,American Restaurant,Pizza Place,Bar,Salon / Barbershop,Italian Restaurant,Fast Food Restaurant,Pharmacy,Sandwich Place,Steakhouse,Cosmetics Shop
6,Far Northwest,Sandwich Place,Fast Food Restaurant,Pizza Place,Pharmacy,Ice Cream Shop,Park,Golf Course,Sports Bar,Shipping Store,Food
7,Far South,Fast Food Restaurant,American Restaurant,Sandwich Place,Pizza Place,Hotel,Wings Joint,Bank,Discount Store,Breakfast Spot,Chinese Restaurant
8,Far West,Hotel,Fast Food Restaurant,American Restaurant,Department Store,Coffee Shop,Wings Joint,Italian Restaurant,Mexican Restaurant,Supplement Shop,Supermarket
9,Fifth by Northwest,Bar,Pizza Place,Coffee Shop,Bank,Italian Restaurant,Mexican Restaurant,Sandwich Place,American Restaurant,Comfort Food Restaurant,Asian Restaurant


## Clustering
Let's run our k-means clustering algorithm to group our communities. We have 31 communities, so let's run the algorithm with a few different cluster sizes:
- 3 clusters
- 5 clusters 
- 10 clusters

We will create each set of clusters, then map them out to look at their distribution throughout the neighborhoods. 

In [248]:
kclusters = [3, 5, 10]

COLUMBUS_GROUPED_CLUSTERING = COLUMBUS_GROUPED.drop('Community',1)

CLUSTERS = [COLUMBUS_VENUES_SORTED.copy(), COLUMBUS_VENUES_SORTED.copy(), COLUMBUS_VENUES_SORTED.copy()]

MERGED = []
CLUSTER_MAPS = []


for n in range(0,3):
    kmeans = KMeans(n_clusters=kclusters[n], random_state=0).fit(COLUMBUS_GROUPED_CLUSTERING)
    
    CLUSTERS[n].insert(0,'Cluster', kmeans.labels_+1)

    MERGED.append(COLUMBUS_COMMUNITIES[~COLUMBUS_COMMUNITIES.Community.isin(EXCLUDED.Community)])

    CLUSTERS[n].set_index('Community', inplace=True)

    MERGED[n] = MERGED[n].join(CLUSTERS[n], on='Community')

    CLUSTER_MAPS.append(folium.Map(
        [COLUMBUS_LATITUDE,COLUMBUS_LONGITUDE],
        zoom_start = 10
    ))

    x = np.arange(kclusters[n])
    ys = [i + x + (i*x)**2 for i in range(kclusters[n])]
    colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array] 

    markers_colors = []
    for lat, lon, poi, cluster, rad in zip(MERGED[n].Latitude, MERGED[n].Longitude, MERGED[n].Community, MERGED[n].Cluster, MERGED[n].Radius):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.Circle(
            [lat,lon],
            radius = rad,
            popup = label,
            color = rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7
    ).add_to(CLUSTER_MAPS[n])

print("Clusters and Maps created!")


Clusters and Maps created!


### 3 Clusters

In [252]:
CLUSTER_MAPS[0]

3 Clusters


### 5 Clusters

In [253]:

CLUSTER_MAPS[1]

### 10 Clusters

In [251]:
CLUSTER_MAPS[2]

After looking at the maps, I believe that the 10 cluster group is the best for sorting out the neighborhoods into small enough groups while isolating any anomalies, such as Airport. Let's save that cluster and map as their own variables. 

In [256]:
CLUSTERS_FINAL = CLUSTERS[2]
MAP_FINAL = CLUSTER_MAPS[2]
print(CLUSTERS_FINAL.head())
MAP_FINAL

Cluster 1st Most Common Venue 2nd Most Common Venue  \
Community                                                               
Airport                 6       Airport Service   Rental Car Location   
Brewery District        7                   Bar                Bakery   
Clintonville            2           Coffee Shop                  Bank   
Downtown                2   American Restaurant               Theater   
Far East                1           Pizza Place        Sandwich Place   

                 3rd Most Common Venue 4th Most Common Venue  \
Community                                                      
Airport                    Coffee Shop  Fast Food Restaurant   
Brewery District            Sports Bar                  Bank   
Clintonville               Pizza Place        Sandwich Place   
Downtown                         Hotel               Brewery   
Far East           American Restaurant  Fast Food Restaurant   

                   5th Most Common Venue 6th Most Common 

We see that our current community, Northwest, is in cluster 2. So, let's see what other communities are in the same cluster. 

In [264]:
TARGET_COMMUNITIES = CLUSTERS_FINAL[CLUSTERS_FINAL.Cluster == 2]
TARGET_COMMUNITIES.reset_index(inplace=True)
print(TARGET_COMMUNITIES.Community)

0            Clintonville
1                Downtown
2               Far North
3      Fifth by Northwest
4               Northeast
5               Northland
6               Northwest
7          Olentangy West
8    Rocky Fork-Blacklick
9     University District
Name: Community, dtype: object


## Conclusion
After running our clustering algorithms, we can see that the most similar communities in Columbus to our current Northwest community are as follows:
- Clintonville
- Downtown
- Far North
- Fifth by Northwest 
- Northeast 
- Northland 
- Olentangy West 
- Rocky Fork-Blacklick 
- University District

These should therefore be our priority when looking for a location for our next restaurant. The next logical steps in our analysis would be looking at available properties, rent prices, and crime rates in these areas. 