# Improving Voter Equity: a campaign plan for St. Louis City.

_by Brian Goldstein. July 2019_
_for IBM Data Science Professional Certificate Capstone_

## Table of Contents
1. Introduction
2. Data
3. Methodology
4. Results
5. Discussion
6. Conclusion

### Introduction
Since the 1960's, the law of the land in the United States has been "one man, one vote." For a myriad of reasons, this ideal has been deferred, watered down, subverted or otherwise compromised. However, one thing we can do to improve equity in voting rights is to expand participation for those who have a legal right to vote. 

In the city of St. Louis, my hometown, Voter Equity is no laughing matter. The city of St. Louis' data department used data from the 2016 election to determine that voters in Majority White (MW) wards were 30% more likely to vote than voters in Majority Black (MB) wards.  What this 1.3x disparity means for representation and policy outcomes is obvious.  If voting equity were a reality, over 15,000 more votes from MB wards would have been cast in that election.  

Additionally, compared to its neighboring counties, the City of St. Louis has the 2nd lowest voter registration rate, again, concentrated in MB wards. 

Therefore, in order to increase voter equity in the city of St. Louis, this project proposes using Foursquare location data to predict where 501c3 and municipal efforts should be directed at registering and turning out voters for elections.  

### Data 
In addition to leveraging data available from Foursquare's Places API, this report will use open source data from the city of St. Louis, the city of St. Louis Election Board, the Missouri Secretary of State, and the US Census. 


Step one is to identify the neighborhoods where we need to focus on. St. Louis' unique geography helps here; The city is conveniently divided into two sides (North Side & South Side) bifurcated by a central corridor. We will not use ward boundaries because first, wards change every 10 years as a result of census results and second the Board of Alderman are going from 28 to 14 members. 

As a shortcut (and to underscore the lasting legacy of segregation) to understand St. Louis, in 2010 the north side was 94.0% Black, 3.7% White, 0.2% Native American/Alaska Native, 0.2% Asian, 1.5% Two or More Races and 0.5% Some Other Race. 1.1% of the population was of Hispanic or Latino origin.

In 2010 the central corridor was 35.0% Black, 55.4% White, 0.2% Native American/Alaska Native, 6.4% Asian, 2.2% Two or More Races, and 0.7% Some Other Race. 2.8% of the population was of Hispanic or Latino origin.

In 2010 the south side was 26.0% black, 65.4% white, 0.3% Native American/Alaska Native, 3.3% Asian, 3.0% Two or More Races, and 2.0% Some Other Race. 5.3% of the population was of Hispanic or Latino origin. 

Exceptions exist however, and ward geography is not 1:1 the same as neighborhood delination. 

The table found here https://en.wikipedia.org/wiki/List_of_neighborhoods_of_St.Louis conveniently provides the demographic information we need. 

An additional table, here https://en.wikipedia.org/wiki/Board_of_Aldermen_of_the_City_of_St.Louis Helps to complete the picture.  Each of the 28 wards lists the parts of the constituent neighborhoods. 

We will take the demographic data and sort by % of Black residents. Our further analysis will focus on exploring these neighborhoods. 

Our next step is to conduct an exploratory analysis of these wards, similar to this course's New York neighborhoods project. 

In this step, we'll create a Pandas dataframe with neighborhood, latitude and longitude.  We'll create a map of St. Louis using Folium, with our focus neighborhoods superimposed on top. Then we'll get the top 100 venues within a 400m radius of each neighborhood by creating a function for this, and get the category data for later segmentation.

We'll create a dataframe that lists the top 5 venues for each neighborhood in MB wards. In other words, these are the places that people already go to. These locations would be the logical starting point for organizations and civic workers to build relationships with to encourage civic participation. However, we also need to understand trends here - in case new establishments rise up. 

To better understand the trends of venues in each neighborhood, we'll use Venue Category to help. 

With Venue Category available, we'll one-hot encode and then group rows by neighborhood and by taking the mean of the frequency of occurrence of each category. We'll print and then transform into a data frame the 5 most frequently occuring venue categories. This data gives campaign planners something to look for in developing new relationships and programs. 

Finally, we'll use KMeans clustering to group the most similar neighborhoods (by Venue Category) together. 

### Methodology

In [1]:
# STEP ZERO: 
# Install Dependencies
import pandas as pd
import numpy as np
import json
import requests
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

print('Dependencies are installed.')

Dependencies are installed.


In [2]:
# STEP ONE:
# Scrape URL for Neighborhood Information
neighborhoodUrl = 'https://en.wikipedia.org/wiki/List_of_neighborhoods_of_St._Louis'

source = requests.get(neighborhoodUrl).text
hood_Data = BeautifulSoup(source, 'lxml')


In [3]:
# Make a neighborhood df from neighborhood list
col = ['Neighborhood, Population, White, Black, Hispanic/Latino, AIAN, Asian, Mixed Race, Corridor']
demo = pd.DataFrame(columns = col)

content = hood_Data.find('table', class_='wikitable sortable')
neighborhood = 0
pop = 0
white = 0
black = 0
hispanic = 0
aian = 0
asian = 0
mixed = 0
corridor = 0

for tr in content.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            neighborhood = td.text
            i = i + 1
        elif i == 1:
            pop = td.text
            i = i + 1
        elif i == 2: 
            white = td.text
            i = i + 1
        elif i == 3:
            black = td.text
            i = i + 1
        elif i == 4:
            hispanic = td.text
            i = i + 1
        elif i == 5:
            aian = td.text
            i = i + 1
        elif i == 6:
            asian = td.text
            i = i + 1
        elif i == 7:
            mixed = td.text
            i = i + 1
        elif i == 8:
            corridor = td.text.strip('\n').replace(']','')
            i = i + 1
    demo = demo.append({'Neighborhood': neighborhood,'Population': pop,'White': white, 'Black': black, 'Hispanic/Latino': hispanic, 'AIAN': aian,'Asian': asian, 'Mixed Race': mixed, 'Corridor': corridor},ignore_index=True)
    
    
    
    
    

In [4]:
demo

Unnamed: 0,"Neighborhood, Population, White, Black, Hispanic/Latino, AIAN, Asian, Mixed Race, Corridor",AIAN,Asian,Black,Corridor,Hispanic/Latino,Mixed Race,Neighborhood,Population,White
0,,0,0,0,0,0,0,0,0,0
1,,1.52,4.3,54.7,North,20.5,3.5,Academy,3006,16.9
2,,0.1,0,91.8,North,0.5,1.3,Baden,7268,6.3
3,,0.3,1.2,25.1,South,3.2,3.8,Benton Park,3532,68.2
4,,0,1.9,59.6,South,10.5,5.1,Benton Park West,4404,28.0
5,,0.4,4.6,13.8,South,7.5,3.9,Bevo Mill,12654,74.2
6,,0.2,1.7,74.4,Central,2.1,2.6,Botanical Heights,1037,20.3
7,,0.3,3.6,3.6,South,3.5,2.0,Boulevard Heights,8708,89.5
8,,0.6,1.3,33.8,South,7.1,3.7,Carondelet,8661,57.3
9,,0.3,0,98.0,North,0.5,0.9,Carr Square,2774,0.5


In [5]:
demo['Black'] = demo['Black'].astype(float)
demo.dtypes

Neighborhood, Population, White, Black, Hispanic/Latino, AIAN, Asian, Mixed Race, Corridor     object
AIAN                                                                                           object
Asian                                                                                          object
Black                                                                                         float64
Corridor                                                                                       object
Hispanic/Latino                                                                                object
Mixed Race                                                                                     object
Neighborhood                                                                                   object
Population                                                                                     object
White                                                                             

In [6]:
# Clean demographic dataframe up.  
demo = demo[demo.Black>=50]
demo = demo.drop(['AIAN', 'Asian', 'White', 'Hispanic/Latino', 'Mixed Race', 'Population', 'Neighborhood, Population, White, Black, Hispanic/Latino, AIAN, Asian, Mixed Race, Corridor' ], axis=1)
demo.reset_index(drop = True, inplace = True)

In [7]:
demo.head(40)

Unnamed: 0,Black,Corridor,Neighborhood
0,54.7,North,Academy
1,91.8,North,Baden
2,59.6,South,Benton Park West
3,74.4,Central,Botanical Heights
4,98.0,North,Carr Square
5,92.7,North,College Hill
6,92.9,North,Columbus Square
7,50.8,South,Dutchtown
8,97.1,North,Fairground
9,64.3,Central,Forest Park Southeast


In [8]:
# STEP TWO 
# Assign Lat and Long info to each neighborhood
# Superimpose Neighborhoods on to map 
# Function for getting venue data for each neighborhood
# Dataframe of top 5 venues for each neighborhood

In [9]:
pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 2.8MB/s ta 0:00:01
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


In [10]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

In [11]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, St. Louis, Missouri'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords
    
get_latlng('Benton Park West')

[38.597660000000076, -90.23096999999996]

In [12]:
neighborhoods = demo['Neighborhood']    
coords = [ get_latlng(neighborhood) for neighborhood in neighborhoods.tolist() ]
coords

[[38.6772799233944, -90.50661997388784],
 [38.70553000000007, -90.23000999999994],
 [38.597660000000076, -90.23096999999996],
 [38.620960000000025, -90.25062999999994],
 [38.63909000000007, -90.19946999999996],
 [38.674000000000035, -90.20877999999999],
 [38.63691000000006, -90.18941999999998],
 [38.58063000000004, -90.24566999999996],
 [38.66745000000003, -90.21775999999994],
 [38.62695000000008, -90.25708999999995],
 [38.65787000000006, -90.25941999999998],
 [38.60890000000006, -90.22569999999996],
 [38.62018000000006, -90.22923999999995],
 [38.64395923120887, -90.2776771879114],
 [38.59063000000003, -90.23359999999997],
 [38.666210000000035, -90.23468999999994],
 [38.66876000000008, -90.27992999999998],
 [38.66176000000007, -90.20348999999999],
 [38.65215000000006, -90.21940999999998],
 [38.668330000000026, -90.25406999999996],
 [38.67318000000006, -90.25921999999997],
 [38.61534000000006, -90.20233999999994],
 [38.65417000000008, -90.25063999999998],
 [38.58598000000006, -90.219799

In [13]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
demo['Latitude'] = df_coords['Latitude']
demo['Longitude'] = df_coords['Longitude']

In [14]:
demo.head()

Unnamed: 0,Black,Corridor,Neighborhood,Latitude,Longitude
0,54.7,North,Academy,38.67728,-90.50662
1,91.8,North,Baden,38.70553,-90.23001
2,59.6,South,Benton Park West,38.59766,-90.23097
3,74.4,Central,Botanical Heights,38.62096,-90.25063
4,98.0,North,Carr Square,38.63909,-90.19947


In [15]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [16]:
address = 'Saint Louis'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of St. Louis are {}, {}.'.format(latitude, longitude))

# create map of St. Louis using latitude and longitude values
map_stl = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, black, neighborhood in zip(demo['Latitude'], demo['Longitude'], demo['Black'], demo['Neighborhood']):
    label = '{}, {}'.format(neighborhood, black)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#2E8B57',
        fill_opacity=0.7,
        parse_html=False).add_to(map_stl)  

map_stl

The geograpical coordinate of St. Louis are 38.6268039, -90.1994097.


In [17]:
# STEP THREE 
# One Hot Encoding 
# Transform in to 5 most frequent venues in each neighborhood. 
# KMeans clustering to group neighborhoods into clusters. 
# Map clusters on to Map. 

In [18]:
# Hide this cell 
CLIENT_ID = 'TPZMR2JC5SC15QXU45AI5JEXFYXT4F3TDSCHAAJEAMPAJN2X' # your Foursquare ID
CLIENT_SECRET = 'FL41RZXE5EYYZW0UKZLHU0XXT153YMDKTNSN1UYRHWI4QL3Z' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [19]:
neighborhood_latitude = demo.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = demo.loc[1, 'Longitude'] # neighborhood longitude value

neighborhood_name = demo.loc[1, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Baden are 38.70553000000007, -90.23000999999994.


In [20]:
# type your answer here
limit = 100
radius = 400

# Create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude,
    radius,
    limit
)

#display url
url

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d31da549b514f00304b514a'},
  'headerLocation': 'Baden',
  'headerFullLocation': 'Baden, St Louis',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 38.70913000360007,
    'lng': -90.22540541279474},
   'sw': {'lat': 38.70192999640006, 'lng': -90.23461458720513}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4f30bca6e4b015f62a9d6f94',
       'name': 'Da New Bricks',
       'location': {'lat': 38.70574852125042,
        'lng': -90.23011611506234,
        'labeledLatLngs': [{'label': 'display',
          'lat': 38.70574852125042,
          'lng': -90.23011611506234}],
        'distance': 26,
        'postalCode': '63147',
        'cc': 'US',
        'city': 'St Louis',
        's

In [21]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
LIMIT = 100
radius = 400

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Da New Bricks,Bar,38.705749,-90.230116
1,Candy Bar,Bar,38.702442,-90.22933
2,The Boss Night Club,Nightclub,38.702389,-90.229406


In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

stl_venues = getNearbyVenues(names=demo['Neighborhood'],
                                   latitudes=demo['Latitude'],
                                   longitudes=demo['Longitude']
                                  )

stl_venues.head(3)

Academy
Baden
Benton Park West
Botanical Heights
Carr Square
College Hill
Columbus Square
Dutchtown
Fairground
Forest Park Southeast
Fountain Park
Fox Park
The Gate District
Grand Center
Gravois Park
Greater Ville
Hamilton Heights
Hyde Park
JeffVanderLou
Kingsway East
Kingsway West
LaSalle Park
Lewis Place
Marine Villa
Mark Twain
Mark Twain/I-70 Industrial
Near North Riverfront
North Point
North Riverfront
O’Fallon
Old North St. Louis
Peabody Darst Webbe
Penrose
Riverview
St. Louis Place
Tiffany
Vandeventer
The Ville
Visitation Park
Walnut Park East
Walnut Park West
Wells/Goodfellow
West End


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Academy,38.67728,-90.50662,McArthurs Bakery,38.679504,-90.506024,Bakery
1,Academy,38.67728,-90.50662,Saint Louis Bread Co.,38.679104,-90.502683,Bakery
2,Academy,38.67728,-90.50662,Hunan Empress,38.679268,-90.504412,Chinese Restaurant


In [25]:
stl_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Academy,10,10,10,10,10,10
Baden,6,6,6,6,6,6
Benton Park West,18,18,18,18,18,18
Botanical Heights,13,13,13,13,13,13
Carr Square,6,6,6,6,6,6
College Hill,5,5,5,5,5,5
Columbus Square,2,2,2,2,2,2
Dutchtown,8,8,8,8,8,8
Fairground,4,4,4,4,4,4
Forest Park Southeast,33,33,33,33,33,33


In [26]:
print('There are {} uniques categories.'.format(len(stl_venues['Venue Category'].unique())))

There are 108 uniques categories.


In [27]:
# one hot encoding
stl_onehot = pd.get_dummies(stl_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
stl_onehot['Neighborhood'] = stl_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [stl_onehot.columns[-1]] + list(stl_onehot.columns[:-1])
stl_onehot = stl_onehot[fixed_columns]

stl_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,BBQ Joint,Bakery,Bar,Beer Bar,Bike Rental / Bike Share,...,Theater,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Waste Facility,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Academy,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Academy,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Academy,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Academy,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Academy,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
stl_grouped = stl_onehot.groupby('Neighborhood').mean().reset_index()
stl_grouped

Unnamed: 0,Neighborhood,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,BBQ Joint,Bakery,Bar,Beer Bar,Bike Rental / Bike Share,...,Theater,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Waste Facility,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Academy,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Baden,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Benton Park West,0.0,0.0,0.055556,0.0,0.0,0.055556,0.055556,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Botanical Heights,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0
4,Carr Square,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0
5,College Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Columbus Square,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Dutchtown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0
8,Fairground,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Forest Park Southeast,0.0,0.030303,0.030303,0.0,0.060606,0.0,0.060606,0.030303,0.030303,...,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303


In [29]:
# We're going to turn this into a text output. 
num_top_venues = 5

for hood in stl_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = stl_grouped[stl_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Academy----
                  venue  freq
0           Pizza Place   0.3
1                Bakery   0.2
2    Chinese Restaurant   0.1
3  Fast Food Restaurant   0.1
4                  Café   0.1


----Baden----
                venue  freq
0                 Bar  0.33
1         Pizza Place  0.17
2       Grocery Store  0.17
3           Nightclub  0.17
4  Chinese Restaurant  0.17


----Benton Park West----
                venue  freq
0  Mexican Restaurant  0.33
1         Pizza Place  0.17
2        Intersection  0.11
3         Art Gallery  0.06
4          Restaurant  0.06


----Botanical Heights----
          venue  freq
0  Intersection  0.08
1    Food Truck  0.08
2          Park  0.08
3   Music Venue  0.08
4     Gift Shop  0.08


----Carr Square----
           venue  freq
0    Pizza Place  0.17
1      BBQ Joint  0.17
2  Grocery Store  0.17
3            Gym  0.17
4    Video Store  0.17


----College Hill----
                    venue  freq
0  Furniture / Home Store   0.2
1           Grocer

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = stl_grouped['Neighborhood']

for ind in np.arange(stl_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(stl_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Academy,Pizza Place,Bakery,Shipping Store,Fast Food Restaurant,Bowling Alley
1,Baden,Bar,Pizza Place,Grocery Store,Nightclub,Chinese Restaurant
2,Benton Park West,Mexican Restaurant,Pizza Place,Intersection,Taco Place,Bakery
3,Botanical Heights,Mediterranean Restaurant,Bakery,Mexican Restaurant,Food Truck,Park
4,Carr Square,Pizza Place,Gym,BBQ Joint,Video Store,Grocery Store


In [31]:
# set number of clusters
kclusters = 12

stl_grouped_clustering = stl_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(stl_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([11, 11,  1,  1, 11,  1,  4,  1,  6,  1], dtype=int32)

In [42]:
stl_data = demo.drop(16)
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

stl_merged = stl_data

# merge neighborhood_venues_sorted with stl_data to add latitude/longitude for each neighborhood
stl_merged = stl_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

stl_merged.head # check the last columns!

ValueError: cannot insert Cluster Labels, already exists

In [49]:
stl_merged['Cluster Labels'] = stl_merged['Cluster Labels'].astype('int64')
stl_merged.dtypes

Black                    float64
Corridor                  object
Neighborhood              object
Latitude                 float64
Longitude                float64
Cluster Labels             int64
1st Most Common Venue     object
2nd Most Common Venue     object
3rd Most Common Venue     object
4th Most Common Venue     object
5th Most Common Venue     object
dtype: object

### Results

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(stl_merged['Latitude'], stl_merged['Longitude'], stl_merged['Neighborhood'], stl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
                (lat, lon),
                radius=5,
                popup=label,
                color=rainbow[cluster-1],
                fill=True,
                fill_color=rainbow[cluster-1],
                fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Examining clusters

In [53]:
stl_merged.loc[stl_merged['Cluster Labels'] == 0, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,North,0,Home Service,Liquor Store,Storage Facility,Fish & Chips Shop,Snack Place


In [64]:
stl_merged.loc[stl_merged['Cluster Labels'] == 1, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,South,1,Mexican Restaurant,Pizza Place,Intersection,Taco Place,Bakery
3,Central,1,Mediterranean Restaurant,Bakery,Mexican Restaurant,Food Truck,Park
5,North,1,Food,Grocery Store,Gas Station,Furniture / Home Store,Nightclub
7,South,1,Ice Cream Shop,Grocery Store,Cajun / Creole Restaurant,Pharmacy,Sandwich Place
9,Central,1,Pizza Place,BBQ Joint,Bar,New American Restaurant,Brewery
10,North,1,Discount Store,Electronics Store,Park,Outdoors & Recreation,Liquor Store
11,South,1,Art Gallery,Intersection,Piano Bar,Laundromat,New American Restaurant
14,South,1,Mexican Restaurant,Bar,Café,Concert Hall,Restaurant
20,North,1,American Restaurant,Bar,Coffee Shop,Seafood Restaurant,Pizza Place
21,Central,1,Food Truck,Lounge,Pool,Event Space,Farmers Market


In [63]:
stl_merged.loc[stl_merged['Cluster Labels'] == 2, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
28,North,2,Waste Facility,Yoga Studio,Food Truck,Dive Bar,Dog Run


In [62]:
stl_merged.loc[stl_merged['Cluster Labels'] == 3, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
22,North,3,American Restaurant,Cosmetics Shop,Business Service,Food Truck,Dog Run


In [61]:
stl_merged.loc[stl_merged['Cluster Labels'] == 4, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,North,4,Dive Bar,Mexican Restaurant,Yoga Studio,Food Truck,Dog Run


In [60]:
stl_merged.loc[stl_merged['Cluster Labels'] == 5, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
26,North,5,River,Yoga Studio,Food Court,Dive Bar,Dog Run


In [59]:
stl_merged.loc[stl_merged['Cluster Labels'] == 6, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
8,North,6,Convenience Store,Bar,Chinese Restaurant,Yoga Studio,Food Truck
19,North,6,Convenience Store,Bar,Cocktail Bar,Yoga Studio,Food Truck
27,North,6,Bar,Electronics Store,Convenience Store,Yoga Studio,Food Truck
29,North,6,Wine Bar,Convenience Store,Gas Station,Yoga Studio,Food Court


In [58]:
stl_merged.loc[stl_merged['Cluster Labels'] == 7, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
40,North,7,Wings Joint,Lounge,Yoga Studio,Food Truck,Dive Bar


In [57]:
stl_merged.loc[stl_merged['Cluster Labels'] == 8, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
37,North,8,Discount Store,Grocery Store,Food Truck,Dive Bar,Dog Run


In [56]:
stl_merged.loc[stl_merged['Cluster Labels'] == 9, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
12,Central,9,Park,Southern / Soul Food Restaurant,Gym,Garden,Food Court
13,Central,9,Park,Theater,American Restaurant,Vegetarian / Vegan Restaurant,Dog Run
17,North,9,Park,American Restaurant,Deli / Bodega,New American Restaurant,Discount Store
34,North,9,History Museum,Bar,Liquor Store,Park,Yoga Studio
39,North,9,Fried Chicken Joint,Park,Child Care Service,Yoga Studio,Food Court


In [55]:
stl_merged.loc[stl_merged['Cluster Labels'] == 10, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
32,North,10,Gas Station,Yoga Studio,Food Truck,Dive Bar,Dog Run


In [54]:
stl_merged.loc[stl_merged['Cluster Labels'] == 11, stl_merged.columns[[1] + list(range(5, stl_merged.shape[1]))]]

Unnamed: 0,Corridor,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North,11,Pizza Place,Bakery,Shipping Store,Fast Food Restaurant,Bowling Alley
1,North,11,Bar,Pizza Place,Grocery Store,Nightclub,Chinese Restaurant
4,North,11,Pizza Place,Gym,BBQ Joint,Video Store,Grocery Store
15,North,11,Moving Target,Liquor Store,Gas Station,Fast Food Restaurant,Chinese Restaurant
24,North,11,Discount Store,Candy Store,BBQ Joint,Chinese Restaurant,Dog Run


### Conclusion

Civic and political campaigns have long relied upon voter files and predictive modeling to identify individual voters (or potential voters) for calls to action and support. However, the individual theory of civic engagement is fundamentally incomplete: while we all enter the voting booth on our own, the neighborhood that the polling place is in, and the community of voters surrounding that polling place shapes who gets to the ballot in the first place. 

By bringing in venue data from Foursquare, we've been able to cluster these neighborhoods on the basis of the places where people go. In other words, we've identified the centers of gravity for community organizing. Employing a venue based strategy offers a hidden benefit. 

Voter files are only as good as their latest data. Because the precise targets we're trying to reach are infrequent or non-voters, voter file managers have an understandably difficult time getting data about them into the file and keeping their information up to date, especially compared to an every elections voter. 

By removing the need to track each voter individually and focus instead on programming with stable community partners, less error or wasted effort is introduced. 

An additional thought: Experienced organizers and longtime residents may not actually find anything new in this report. This makes sense; the report is less about uncovering some hidden understanding than to make explicit the implicit wisdom gained through many contact hours in the field. 

Organizing is a profession with high burnout rates, chronic under-resourcing, and short tenures in most locations. It is rare to meet community activists who've been able to make the kinds of years and decades long efforts to erode or eradicate stubborn problems like gaps in voter equity. The ultimate worth of reports and analysis like this is to facilitate a faster time to effectiveness for organizers and campaigns who are not privliged enough to have deep human expertise to rely upon. If this model for neighborhood based organizing using venue data can help a campaign more effectively distribute their limited resources, that's an outstanding result in the author's opinion. 

### Discussion

In examining the clusters further, we can identify several trends. First, that in manually choosing the amount of clusters, there was plenty of trial and error. This is a methodological drawback to KMeans clustering, however, it still helped here to really understand the defining characteristics of each cluster. 

Cluster 9 is obviously centered around St. Louis' parks. An effective strategy would leverage the ongoing programming in those parks to find partners and opportunities for crowd appeals. 

Conversely, cluster 1 is dominated by restaurants and other places where people will actually sit and congregate in small groups. Here, one on one interaction would seem intrusive, and barring renting out some of these places, rallies might seem odd. Multiple different rapport techniques will have to be used to work with these proprietors in order to build an effective coalition.

The other clusters each present unique challenges for organizing, and may require task-force like effort to effect meaningful change. 