# Introduction/Business Problem

Sad fact of life, sometimes a Tallahassee resident has to move. It's always an upheaval, and one is never sure if one is going to like the new place better or worse than the old one.

Using data analysis and machine learning tools, we can classify neighborhoods in Tallahassee. This way, if you really like where you live, you can focus your search on neighborhoods most similar to your own. On the other hand, if you're not so crazy about your spot, you can look in neighborhoods determined to be fairly different. 

It won't find you the perfect place to live, but by narrowing you search for areas to look at and avoid, this analysis will help you find what you're looking for faster and easier – through *data science*. 

# Part Zero: Data Finding

There wasn't an easily available source of neighborhood latitude and longitude for Tallahassee. I started by scraping zip code data, but several of the zip codes in Tallahassee aren't really neighborhoods, so the analysis ended up being pretty boring, with fewer than 10 zip codes analyzed.

[Here is the zip code analysis of Tallahassee (if you're interested).](https://github.com/dannybrown37/Coursera_Capstone/blob/main/Tallahassee%20Zip%20Code%20Analysis.ipynb)

I resolved to find neighborhood data. I did find a list of neighborhoods in Tallahassee, but it did not include latitudes and longitudes. Because Google's API charges per call and I am cheap, I opted to automate as much as I could, while ultimately just searching Google through the browser and saving the data I found that way. It was annoyingingly manual, but it was a one-time thing that gave me some data to work with. I did learn of a new module that sped things up considerably and be super useful in the future: `pyperclip`. 

[Here is the partially automated but still painfully manual method I used to collect latitudes and logitudes for 166 Tallahassee neighborhoods (again, only if you're interested. This isn't key to the analysis.)](https://github.com/dannybrown37/Coursera_Capstone/blob/main/Tallahassee%20Neighborhood%20Collection.ipynb)

# Part One: Data Parsing

First, we import the many libraries that we need for this exercise

In [4]:
# Imports and installs
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans

Next, we take the list of neighborhoods with latitidues and longitudes (collecting this is described/linked in Part Zero above), parse it, and store it in a dictionary.

In [5]:
painfully_gotten_gains = ("""
All Saints 30.4341,-84.2878
Amelia Circle 30.4475,-84.3236
Apalachee Ridge Estates 30.4103,-84.2680
Arvah Branch 30.4940,-84.1659
Astoria Court/Trimble Road 30.4721,-84.3351
Avondale 30.4580,-84.1785
Beacon Hill 30.4037,-84.2661
Benjamin's Run 30.4567,-84.1811
Betton Hills 30.4737,-84.2569
Betton Woods 30.4803,-84.2450
Blairstone Forest 30.4148,-84.2555
Bloxham Terrace 30.4476,-84.3338
Bobbin Brook 30.5157,-84.2661
Bonaventure 30.4320,-84.2302
Bond Westside 30.4222,-84.2950
Breckenridge on Park 30.4460,-84.2312
Brewster Estates 30.4674,-84.2210
Bucklake Woods 30.4652,-84.2013
Buckwood 30.4650,-84.2075
Callen 30.4171,-84.3082
Camellia Gardens 30.4960,-84.3328
Campbell Park 30.4004,-84.2707
Campus Circle 30.4508,-84.3075
Capital Hills 30.4529,-84.2575
Centerville Rural Community 30.5695,-84.1541
Centre Point Village 30.4750,-84.2378
Chaires 30.4363,-84.1174
Chapel Ridge 30.4459,-84.3108
Charter Oaks/Dellview 30.4371,-84.3175
Chase Ridge 30.4320,-84.1992
China Grove 30.4222,-84.2412
College Terrace 30.4164,-84.2924
Copper Creek 30.4317,-84.2091
Countryside of Tallahassee 30.4508,-84.1785
Downtown Tallahassee 30.4389,-84.2821
Duck Lake Point 30.5109,-84.3529
Durward 30.4736,-84.2707
Eastgate 30.4973,-84.2397
Easton 30.4576,-84.1844
Elberta Empire 30.4281,-84.3065
Elysian Forest 30.4062,-84.1429
Evergreen Terrace 30.4439,-84.3286
FAMU 30.4269,-84.2851
Ferndale Place 30.4435,-84.2555
Forest Heights—Holly Hills 30.4647,-84.3131
Foxcroft 30.5341,-84.2292
Franklin/Call Street 30.4415,-84.2766
Frenchtown 30.4508,-84.2924
Gardenbrook Lane 30.4381,-84.2425
Glendale 30.4700,-84.2713
Glenview/Pinegrove 30.4651,-84.2733
Golden Eagle 30.5990,-84.2358
Golden Park 30.4791,-84.3079
Grassroots Community Membership Association 30.4153,-84.1898
Greater Brandt Hills 30.4575,-84.2476
Griffin Heights 30.4536,-84.3075
Grove at Summerbrook, The 30.5673,-84.2496
Hartsfield Plantation 30.4846,-84.3273
Hartsfield Village 30.4705,-84.3312
Hartsfield Woods 30.4765,-84.3391
Hawks Nest 30.5091,-84.2641
Hi-Lo/Ty Ty 30.4516,-84.2450
Hidden Lakes 30.4686,-84.3296
Hidden Valley 30.4771,-84.2206
High Halden 30.4623,-84.1138
Highlands 30.5417,-84.2206
Hillcrest Court 30.4491,-84.2680
Huntington Estates 30.4990,-84.3450
Indianhead/Lehigh Acres 30.4217,-84.2563
Inglewood 30.4465,-84.2559
Jake Gaither 30.4245,-84.2890
Kenilwood 30.5273,-84.2159
Kenmare Commons 30.5143,-84.2256
Killearn Acres 30.5430,-84.2015
Killearn Estates 30.5205,-84.2121
Killearn Lakes Plantation 30.6071,-84.2437
Kinsail 30.5344,-84.2180
Kuhlacre-Teal 30.4625,-84.2470
La Verde 30.4135,-84.2439
Lafayette Meadows 30.4657,-84.1738
Lafayette Oaks 30.4821,-84.1883
Lafayette Park 30.4529,-84.2740
Lake Bradford/Cascade Lake 30.4087,-84.3463
Lake Breeze 30.5051,-84.3200
Lake Carolyn Estates 30.5574,-84.2042
Lakeshore 30.4985,-84.2990
Lakeview 30.4584,-84.2838
Lakewood 30.3956,-84.2766
Leewood Hills 30.4894,-84.2571
Leon Arms 30.4156,-84.2993
Levy Park 30.4593,-84.2898
Liberty Park 30.4117,-84.3085
Linene Woods 30.4938,-84.2898
Live Oak Plantation (Millstream) 30.4981,-84.2581
Lonnie Gray Road 30.3799,-84.3451
Los Robles 30.4623,-84.2759
Mabry Manor 30.4268,-84.3345
Maclay Hammock 30.5102,-84.2549
Magnolia Heights 30.4404,-84.2680
Meadow Hills 30.4618,-84.1982
Meadowbrook 30.4505,-84.2364
Melody Hills 30.4715,-84.2397
Miccosukee Land Co-op 30.5187,-84.1092
Midtown 30.4590,-84.2736
Midtown West 30.4603,-84.2944
Midyette Plantation 30.4845,-84.1930
Millers Landing 30.5412,-84.3240
Mission Hills/Buena Vista 30.4533,-84.3294
Myers Park 30.4273,-84.2792
Normal School 30.4214,-84.2917
Northampton 30.5429,-84.2167
Oak Ridge Place 30.3847,-84.2809
Old St Augustine 30.4234,-84.2382
Old Town 30.4470,-84.2713
Ox Bottom Manor 30.5572,-84.2648
Paremore 30.4748,-84.2384
Park Brook Circle 30.4430,-84.2374
Parkside-Park Terrace 30.4682,-84.2950
Pebble Creek 30.5253,-84.2157
Penny Lane 30.4849,-84.2592
Piedmont/Live Oak 30.4914,-84.2674
Pine Meadows 30.4118,-84.1402
Piney-Z Plantation 30.4400,-84.1986
Plantation Heights 30.4766,-84.2746
Providence 30.4227,-84.3055
Quail Rise 30.5538,-84.2124
Rockbrook Village 30.4320,-84.2112
Rose Hollow 30.4918,-84.2519
Rosehill 30.5430,-84.2674
Royal Oaks 30.5234,-84.2285
San Luis 30.4581,-84.3266
Sawgrass Plantation 30.5064,-84.2338
Saxon Street 30.4145,-84.2952
Seminole Manor 30.4300,-84.3424
Settler's Creek 30.4846,-84.3424
Settlers Springs 30.3955,-84.2765
South Bronough Street 30.4297,-84.2838
South City 30.4167,-84.2674
Southwood 30.4173,-84.2095
Starmount 30.4787,-84.2815
Stoney Creek Crossing 30.4437,-84.1798
Summerbrook 30.5728,-84.2608
Suokoko Villa/Holton Street 30.4152,-84.2972
Sweetwater Oaks 30.4691,-84.2262
Terrace Park 30.4586,-84.2692
Three Lanterns 30.4147,-84.2504
Timber Lake 30.4274,-84.1952
Town and Country 30.4743,-84.3029
Towne East 30.4377,-84.2391
Township One North 30.4652,-84.2506
Tredington Park 30.5388,-84.2206
Tuskegee 30.4114,-84.2930
Twin Lakes 30.4248,-84.2061
Villa Mitchell 30.4309,-84.2950
Villas of Westridge 30.4642,-84.3282
Vineyards, The 30.4757,-84.1754
Waterford 30.3892,-84.1316
Waverly Hills 30.4814,-84.2687
Weems Plantation 30.4557,-84.2189
Wellswood/Suburban Hills 30.4818,-84.2834
Westover 30.4800,-84.3522
Whitfield Plantation 30.5025,-84.2114
Wilkinson Woods 30.3285,-84.2200
Wilson Green 30.3953,-84.2852
Windwood Hills 30.4283,-84.1712
Woodgate 30.4834,-84.2503""")
painfully_gotten_gains = [p.strip() for p in painfully_gotten_gains.split("\n") if p != ""]

In [6]:
coords = {}
for line in painfully_gotten_gains:
    hood, crds = line.rsplit(" ", 1)
    coords[hood] = crds.split(",")
coords

{'All Saints': ['30.4341', '-84.2878'],
 'Amelia Circle': ['30.4475', '-84.3236'],
 'Apalachee Ridge Estates': ['30.4103', '-84.2680'],
 'Arvah Branch': ['30.4940', '-84.1659'],
 'Astoria Court/Trimble Road': ['30.4721', '-84.3351'],
 'Avondale': ['30.4580', '-84.1785'],
 'Beacon Hill': ['30.4037', '-84.2661'],
 "Benjamin's Run": ['30.4567', '-84.1811'],
 'Betton Hills': ['30.4737', '-84.2569'],
 'Betton Woods': ['30.4803', '-84.2450'],
 'Blairstone Forest': ['30.4148', '-84.2555'],
 'Bloxham Terrace': ['30.4476', '-84.3338'],
 'Bobbin Brook': ['30.5157', '-84.2661'],
 'Bonaventure': ['30.4320', '-84.2302'],
 'Bond Westside': ['30.4222', '-84.2950'],
 'Breckenridge on Park': ['30.4460', '-84.2312'],
 'Brewster Estates': ['30.4674', '-84.2210'],
 'Bucklake Woods': ['30.4652', '-84.2013'],
 'Buckwood': ['30.4650', '-84.2075'],
 'Callen': ['30.4171', '-84.3082'],
 'Camellia Gardens': ['30.4960', '-84.3328'],
 'Campbell Park': ['30.4004', '-84.2707'],
 'Campus Circle': ['30.4508', '-84.307

# Part Two: FourSquare API and Data Analysis

I'm aware that the `@hidden_cell` attribute is only for IBM Watson notebooks and my secret key is horribly on naked display. By the time you see this, however, I will have generated a new one. And besides – I have the same free account you can get, and haven't given FourSquare any of my payment info.

In any case, here's my exposed API credentials.

In [7]:
# @hidden_cell
CLIENT_ID = 'VWZMMMOLA50YBUWRTGPQQCHWL2VSYTWMW3JQ1WUPR42EBDPT' # your Foursquare ID
CLIENT_SECRET = 'YFP4MEWMOX2WHCISRKA0SRE1NY0PF4MAS0U0NPZJ5C0AHTB2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VWZMMMOLA50YBUWRTGPQQCHWL2VSYTWMW3JQ1WUPR42EBDPT
CLIENT_SECRET:YFP4MEWMOX2WHCISRKA0SRE1NY0PF4MAS0U0NPZJ5C0AHTB2


Here I have adapted this function from the IBM Data Science course materials to find up to 100 nearby venues within 1000 meters of the coordinates passed – for our purposes, meaning that we're finding hte venues for each neighborhood in Tallahassee.

In [8]:
def get_nearby_venues(coords):
    RADIUS = 1000
    LIMIT = 100
    URL = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'
    venues_list=[]
    for hood, coordinates in coords.items():
        print(hood, end=", ")
        lat, lng = coordinates[0], coordinates[1]
            
        # create the API request URL
        url = URL.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            hood, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category'
    ]
    
    return nearby_venues

Now we run the function on the list of coordinates. This takes a while, but each neighborhood will be printed below the cell as it is completed.

In [9]:
tally_venues = get_nearby_venues(coords)
tally_venues.shape

All Saints, Amelia Circle, Apalachee Ridge Estates, Arvah Branch, Astoria Court/Trimble Road, Avondale, Beacon Hill, Benjamin's Run, Betton Hills, Betton Woods, Blairstone Forest, Bloxham Terrace, Bobbin Brook, Bonaventure, Bond Westside, Breckenridge on Park, Brewster Estates, Bucklake Woods, Buckwood, Callen, Camellia Gardens, Campbell Park, Campus Circle, Capital Hills, Centerville Rural Community, Centre Point Village, Chaires, Chapel Ridge, Charter Oaks/Dellview, Chase Ridge, China Grove, College Terrace, Copper Creek, Countryside of Tallahassee, Downtown Tallahassee, Duck Lake Point, Durward, Eastgate, Easton, Elberta Empire, Elysian Forest, Evergreen Terrace, FAMU, Ferndale Place, Forest Heights—Holly Hills, Foxcroft, Franklin/Call Street, Frenchtown, Gardenbrook Lane, Glendale, Glenview/Pinegrove, Golden Eagle, Golden Park, Grassroots Community Membership Association, Greater Brandt Hills, Griffin Heights, Grove at Summerbrook, The, Hartsfield Plantation, Hartsfield Village, Ha

(3198, 7)

Let's show how many values we have for each neighborhood. This looks kind of weird because it's the same value in every row, but it's just giving us the count of each value, so don't worry, it's as expected. 

In [10]:
tally_venues.groupby("Neighborhood").count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All Saints,87,87,87,87,87,87
Amelia Circle,47,47,47,47,47,47
Apalachee Ridge Estates,4,4,4,4,4,4
Arvah Branch,5,5,5,5,5,5
Astoria Court/Trimble Road,7,7,7,7,7,7
...,...,...,...,...,...,...
Westover,6,6,6,6,6,6
Whitfield Plantation,5,5,5,5,5,5
Wilson Green,9,9,9,9,9,9
Windwood Hills,5,5,5,5,5,5


In [11]:
print('There are {} uniques venue categories.'.format(len(tally_venues['Venue Category'].unique())))

There are 252 uniques venue categories.


Now it's time for one-hot encoding. This means that we're going to take each our 252 unique venue categories and give them a column. Then each neighborhood row will have a 1 if that venue type is nearby and a 0 if it is not nearby. See the example below the next cell.

In [12]:
# one hot encoding
tally_onehot = pd.get_dummies(tally_venues[['Venue Category']], prefix="", prefix_sep="")

tally_onehot.drop(["Neighborhood"], axis=1, inplace=True)

tally_onehot.insert(0, "Neighborhood", tally_venues['Neighborhood'])

tally_onehot.head(1000)

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Animal Shelter,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,All Saints,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,All Saints,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,All Saints,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,All Saints,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,All Saints,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Ferndale Place,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,Ferndale Place,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,Ferndale Place,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,Ferndale Place,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
tally_onehot.shape

(3198, 252)

Now we're going to normalize the data, giving each venue type a value between 0 and 1 based on the proportion of that venue type in total versus the amount that shows up in that neighborhood.

In [14]:
tally_grouped = tally_onehot.groupby('Neighborhood').mean().reset_index()
tally_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Animal Shelter,Antique Shop,Arcade,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,All Saints,0.0,0.0,0.034483,0.0,0.000000,0.0,0.011494,0.0,0.0,...,0.034483,0.000000,0.000000,0.000000,0.0,0.011494,0.0,0.000000,0.0,0.0
1,Amelia Circle,0.0,0.0,0.021277,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.021277,0.021277,0.021277,0.0,0.000000,0.0,0.021277,0.0,0.0
2,Apalachee Ridge Estates,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3,Arvah Branch,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
4,Astoria Court/Trimble Road,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,Westover,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
159,Whitfield Plantation,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
160,Wilson Green,0.0,0.0,0.000000,0.0,0.111111,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
161,Windwood Hills,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0


Now let's check the shape of our grouped data – as you can see below, there's 163 neighborhoods and 252 columns for venue types.

In [15]:
tally_grouped.shape

(163, 252)

Let's find the ten most common venue types for each neighborhood in town. This is done over the text two cells, with the output below that.

In [16]:
# Function borrowed to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
# More code borrowed to show the 10 most common venue types per neighborhood in a dataframe
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = tally_grouped['Neighborhood']

for ind in np.arange(tally_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tally_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,All Saints,Bar,Coffee Shop,Café,Pizza Place,Sandwich Place,Restaurant,American Restaurant,Vegetarian / Vegan Restaurant,Food Truck,Mexican Restaurant
1,Amelia Circle,Smoke Shop,Restaurant,Sandwich Place,Convenience Store,Discount Store,Grocery Store,Thrift / Vintage Store,Coffee Shop,Sports Bar,South American Restaurant
2,Apalachee Ridge Estates,Pool,Park,Football Stadium,Convenience Store,Cupcake Shop,Dance Studio,Cosmetics Shop,Food,Fondue Restaurant,Flower Shop
3,Arvah Branch,Playground,Museum,Trail,Restaurant,Farm,Fish Market,Fast Food Restaurant,Farmers Market,Donut Shop,Flea Market
4,Astoria Court/Trimble Road,School,Intersection,Soccer Stadium,Gas Station,Restaurant,Pizza Place,Home Service,Dance Studio,Deli / Bodega,Flower Shop
5,Avondale,Plaza,Grocery Store,Dry Cleaner,Food & Drink Shop,Food,Fondue Restaurant,Flower Shop,Flea Market,Fish Market,Fast Food Restaurant
6,Beacon Hill,Pool,Football Stadium,Baseball Field,Park,Yoga Studio,Farmers Market,Eye Doctor,Fabric Shop,Farm,Fish Market
7,Benjamin's Run,Plaza,Garden,Grocery Store,Dry Cleaner,Food,Fondue Restaurant,Flower Shop,Flea Market,Fish Market,Fast Food Restaurant
8,Betton Hills,Lake,Martial Arts School,Pet Store,Park,Furniture / Home Store,Fast Food Restaurant,Eye Doctor,Fabric Shop,Farm,Farmers Market
9,Betton Woods,Fast Food Restaurant,Thrift / Vintage Store,Skate Park,Gas Station,Motorsports Shop,Eye Doctor,Storage Facility,Grocery Store,Optical Shop,Sandwich Place


# Part Three: K-Means Clustering

Finally, we're going to use K-Means clustering to classify the neighborhoods of town based on the venues in each one. Through some tinkering and experimentation, I found that 4 clusters seemed appropriate, as using fewer felt anticlimactic and using more left some labels being completely unused.

In [25]:
# set number of clusters
kclusters = 4

tally_grouped_clustering = tally_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(tally_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([3, 3, 0, 3, 3, 2, 0, 2, 0, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 0,
       3, 3, 3, 0, 3, 3, 3, 3, 0, 3, 0, 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 0,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 2, 3, 3, 0, 3, 3, 3, 3, 0, 0,
       0, 3, 0, 3, 3, 3, 3, 0, 3, 3, 2, 3, 3, 3, 3, 0, 3, 1, 3, 3, 0, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 0, 3, 3, 3, 3, 0, 3, 0, 3,
       3, 0, 3, 0, 4, 3, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3,
       3, 0, 3, 3, 3, 0, 0, 3, 0])

Now we make a dataframe of coordinates for merging into the classification dataframe, and finally, plotting on a map.

For the next cell, since it transposes the dataframe, it will throw an error if you call it a second time. Just call it a third time and it will right itself back out.

In [30]:
coords = pd.DataFrame(coords).T
coords.columns = ["Latitude", "Longitude"]
coords.index.name = "Neighborhood"
coords

Unnamed: 0_level_0,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
All Saints,30.4341,-84.2878
Amelia Circle,30.4475,-84.3236
Apalachee Ridge Estates,30.4103,-84.2680
Arvah Branch,30.4940,-84.1659
Astoria Court/Trimble Road,30.4721,-84.3351
...,...,...
Whitfield Plantation,30.5025,-84.2114
Wilkinson Woods,30.3285,-84.2200
Wilson Green,30.3953,-84.2852
Windwood Hills,30.4283,-84.1712


In [34]:
# add clustering labels
try:   # this causes an error if run more than once
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
except ValueError:
    pass
    
tally_merged = coords
tally_merged = tally_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

tally_merged.head(10) # check the last columns!

Unnamed: 0_level_0,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
All Saints,30.4341,-84.2878,3.0,Bar,Coffee Shop,Café,Pizza Place,Sandwich Place,Restaurant,American Restaurant,Vegetarian / Vegan Restaurant,Food Truck,Mexican Restaurant
Amelia Circle,30.4475,-84.3236,3.0,Smoke Shop,Restaurant,Sandwich Place,Convenience Store,Discount Store,Grocery Store,Thrift / Vintage Store,Coffee Shop,Sports Bar,South American Restaurant
Apalachee Ridge Estates,30.4103,-84.268,0.0,Pool,Park,Football Stadium,Convenience Store,Cupcake Shop,Dance Studio,Cosmetics Shop,Food,Fondue Restaurant,Flower Shop
Arvah Branch,30.494,-84.1659,3.0,Playground,Museum,Trail,Restaurant,Farm,Fish Market,Fast Food Restaurant,Farmers Market,Donut Shop,Flea Market
Astoria Court/Trimble Road,30.4721,-84.3351,3.0,School,Intersection,Soccer Stadium,Gas Station,Restaurant,Pizza Place,Home Service,Dance Studio,Deli / Bodega,Flower Shop
Avondale,30.458,-84.1785,2.0,Plaza,Grocery Store,Dry Cleaner,Food & Drink Shop,Food,Fondue Restaurant,Flower Shop,Flea Market,Fish Market,Fast Food Restaurant
Beacon Hill,30.4037,-84.2661,0.0,Pool,Football Stadium,Baseball Field,Park,Yoga Studio,Farmers Market,Eye Doctor,Fabric Shop,Farm,Fish Market
Benjamin's Run,30.4567,-84.1811,2.0,Plaza,Garden,Grocery Store,Dry Cleaner,Food,Fondue Restaurant,Flower Shop,Flea Market,Fish Market,Fast Food Restaurant
Betton Hills,30.4737,-84.2569,0.0,Lake,Martial Arts School,Pet Store,Park,Furniture / Home Store,Fast Food Restaurant,Eye Doctor,Fabric Shop,Farm,Farmers Market
Betton Woods,30.4803,-84.245,3.0,Fast Food Restaurant,Thrift / Vintage Store,Skate Park,Gas Station,Motorsports Shop,Eye Doctor,Storage Facility,Grocery Store,Optical Shop,Sandwich Place


In [36]:
# Define latitude and longtiude of Tallahassee
latitude = "30.4383"
longitude = "-84.2807"

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# drop nan values
tally_merged.dropna(inplace=True)


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tally_merged['Latitude'], tally_merged['Longitude'], tally_merged.index, tally_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters