# Final Capstone Project - Week 5 - The Battle of the Neighborhoods - Applied Data Science Capstone - IBM/Coursera

> ## Targeted Marketing Campaign Prediction & User Behavior Analysis using K-Means Clustering

> ## Brandy Guillory

> ## December 26, 2019

## Table of contents
1. [Introduction – Business Understanding](#0)<br>
2. [Data Science Methodology](#datasciencemethodology)
3. [Results](#results)
4. [Discussion](#discussion)
5. [Conclusion](#conclusion)

## Introduction – Business Understanding<a id="0"></a>

### 1.1 Background

There has been an evolution of marketing. Since the 1900s starting on both radio and television, the marketing focus was on "selling" starting. The Golden age of Advertising introduced such ads as "Uncle Sam Wants You for the Army" and "Eat Your Wheaties". Marketing became more personalized with a focus on brand awareness and problem solving. Then there was the digital ad revolution  that began with online advertising in the 1990s and mobile ads in 2000. There is a plethora of data collected daily about users and the ability to harness this data to produce more targeted and personalized ad campaigns to create better customer experience and revenue generation.

### 1.2 Problem

An employee at a fictitious big data marketing company, Insights LLC has been tasked with helping its customer determine an ideal marketing campaign in San Francisco.  

### 1.3 Interest

Insights, LLC has a customer who would like to create more personalized ad campaigns for its target customer segments to increase revenue and customer satisfaction. With the plethora of data collected & speed in which it is collected on its customers, the ability to harness it for either a) an increase of revenue via new products/services b) identification of user behavior for both positive & negative trends in customer satisfaction. I am using data science methodology to solve this business problem.

## 2. Data Science Methodology<a name="datasciencemethodology"></a>

### 2.1	Data Requirements – Data Tooling,  Sources & Collection

The data tooling I will be using will be Python language for (data cleansing, data manipulation, data modeling, data analytics & visualization), Jupyter notebook for sharing code & data analysis pushed to GitHub for source control.


The customer has asked me to gather insights for the city of San Francisco

 

I am using the following data: 
•	Wikipedia web scrape: Neighborhood data for the various cities & population
•	Geocoder nominatim: Retrieval of latitude and longitude of the neighborhoods 
•	Foursquare Places API: venue, rating likes data for these neighborhoods 
•   Foursquare Trending: to show restaurants trending within a specific radius
•	Kaggle datasets: SF crime data

The data that I will be using will be both structured & unstructured.

Factors that will influence the marketing campaign: user behavior, neighborhood segmentation via clustering. Other factors such as trending venues, crime, venues most frequented per neighborhood, population
 

In [1]:
#### Import statements

In [2]:
%%capture
!pip install geocoder
!pip install folium

%autosave 3

#import statements, installation of libraries for data visualization/analysis & geographical data retrieval
import pandas as pd
import geocoder
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
import folium 
import urllib.request
from urllib.request import urlopen
from pandas.io.json import json_normalize
from folium.plugins import MarkerCluster
from sklearn.cluster import KMeans

#### Kaggle Data Collection - San Francisco Crime by Neighborhood 

In [3]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_8c374c33edfe4768aefdf80edd7dcdf1 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='0UC_-4gkD7ErBMrWYUaX2ekuOqlEFdPzsgEAS8Z-5sMQ',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_8c374c33edfe4768aefdf80edd7dcdf1.get_object(Bucket='datascienceprofessionalcertificat-donotdelete-pr-6tfwiuoky1hud9',Key='sfcrimes.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
# loads the csv into a dataframe and displays first 3 rows
sfcrimebyneighborhood_df = pd.read_csv(body)
sfcrimebyneighborhood_df.head(3)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414


#### San Francisco neighborhood data web scrape

In [4]:
# webscrap of a webpage, store the table data into a dataframe and display first 3 rows
res = requests.get("http://www.healthysf.org/bdi/outcomes/zipmap.htm")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
sanfrancisconeighborhood_df = pd.read_html(str(table))
sanfrancisconeighborhood_df = pd.concat(sanfrancisconeighborhood_df)
sanfrancisconeighborhood_df.head(3)

Unnamed: 0,0,1,2,3,4
0,,San Francisco Burden of Disease & Injury Study...,,,
1,,About Site Determinants Health Outcomes Web Li...,,,
0,About Site,Determinants,Health Outcomes,Web Links,Site Map


### Foursquare Data Collection to retrieve lat and long of SF neighborhoods, venue data of venues within a proximity of the neighborhoods & like/check-in data on the venues

#### Hide code below for API creds

In [5]:
# @hidden_cell D25W0VRV4RNHII3UBCJ0UM2NNC2SVM1F0HSDXERJ22IX53O0
#CLIENT_ID = 'SCTT2VODVEZKPLXHIWXIOBTGYP0DOLG4FUI3NKNA325L0H4F' # your Foursquare ID, first dev acct
CLIENT_ID = 'D25W0VRV4RNHII3UBCJ0UM2NNC2SVM1F0HSDXERJ22IX53O0' # your Foursquare ID, second dev acct, bmguillo
#CLIENT_SECRET = 'OMDS5ZU4LM3XAD055FLQM2GEM3F5E44LFVSQLSH4FKHJ1GVX' # your Foursquare Secret
CLIENT_SECRET = 'WL3DTSLLCXM2LZGBUCFKFKBPY1LR02UFT2V03NBIIGODUKFE' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

#### In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>foursquare_agent</em>, as shown below and get coordinates for 3 SF Neighborhoods via function
#### In addition I pull all nearby venues within a 500 mile radius using location coordinates of a specific neighborhood in San Francisco "The Castro" 

In [6]:
#function to call geocoder to populate lat, long of the castro neighborhood in san francisco
def getcoords(clientid,secret,vrsn,lmt,address):
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address,timeout=10)
    latitude = location.latitude
    longitude = location.longitude
    return latitude,longitude

address = 'the castro san francisco'
#address = 'Potrero Hill san francisco'
#address = 'Chinatown san francisco'
coords = getcoords(CLIENT_ID,CLIENT_SECRET,VERSION,LIMIT,address)
print('Coordinates of {}: {}'.format(address, coords))

Coordinates of the castro san francisco: (37.7608561, -122.434957)


#### Obtain nearby venues set of coordinates store in dataframe from JSON for mapping later

In [7]:
## Retrieving category type from rows and returning the names
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
## using Foursquare API to explore venues then convert result into a pandas dataframe from JSON
radius=1000
limit=100
geolocator = Nominatim(user_agent="foursquare_agent",timeout=10)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude,
    longitude,
    radius, 
    LIMIT)
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
nearby_venues_df = json_normalize(venues)

## Filter columns by Venue ID, Venue Name, Venue Category & Venue Lat & Long
filtered_columns = ['venue.id','venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues_df =nearby_venues_df.loc[:, filtered_columns]# filter the category for each row
nearby_venues_df['venue.categories'] = nearby_venues_df.apply(get_category_type, axis=1)# clean columns
nearby_venues_df.columns = [col.split(".")[-1] for col in nearby_venues_df.columns]
nearby_venues_df.rename(columns={'lat':'Latitude'}, inplace=True)
nearby_venues_df.rename(columns={'lng':'Longitude'}, inplace=True)
nearby_venues_df

Unnamed: 0,id,name,categories,Latitude,Longitude
0,4089ae00f964a520bff21ee3,Castro Theatre,Indie Movie Theater,37.762044,-122.435022
1,49fe3629f964a5207f6f1fe3,Yoga Tree Castro,Yoga Studio,37.761051,-122.436003
2,432a0b00f964a520de271fe3,Anchor Oyster Bar,Seafood Restaurant,37.759708,-122.43491
3,5613f02d498e46b274c37a65,Philz Coffee,Coffee Shop,37.760104,-122.434829
4,58f18932d7627e564887c486,The Castro Fountain,Ice Cream Shop,37.760052,-122.435024
5,4b1f1fb8f964a5202e2424e3,Eye Gotcha Optometric,Optical Shop,37.759651,-122.434967
6,556b072d498e7bfc836cf039,SoulCycle Castro,Cycle Studio,37.762309,-122.435321
7,5148960ee4b0905172eb88de,Castro Dog Park,Dog Run,37.759811,-122.436394
8,574f605b498ee997a05fa2d7,Dog Eared Books,Bookstore,37.761206,-122.434959
9,52efb33f498e5300bf66e245,Réveille Coffee Co.,Coffee Shop,37.761104,-122.43443


#### Rate Limiting: I have several loops in this notebook that appends data from Foursquare to my dataframe which causes a rate limiting error. For the sake of avoiding the below loop which grabs rating and adds to my data frame, I have taken the results of the API request and add it manually to the dataframe in the code under the commented code below.

In [8]:
#venues = nearby_venues_df.copy()
#venues['rating'] = ""


#def getratings(clientid,secret,vrsn,pd):
#    for x in range(len(venues)):
#        venue_id = venues['id'].iloc[x]
#        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
#        result = requests.get(url).json()
#        venues['rating'].iloc[x] = result['response']['venue']['rating']  
#    print(venues[['name','categories','rating']])
      
#getratings(CLIENT_ID,CLIENT_SECRET,VERSION,venues)

In [9]:
venues = nearby_venues_df.copy()
venues['rating'] = ""
venues['rating'] = [9.4, 9.1, 9.1, 9.1, 9, 9, 8.9, 8.9, 8.9, 8.9, 8.7, 8.7, 8.6, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.4, 8.4, 9.1, 8.3, 8.3, 8.3, 8.2, 8.2, 8.2, 8.2, 8.1 ]
venues.sort_values('rating', ascending=False,inplace=True)
venues

Unnamed: 0,id,name,categories,Latitude,Longitude,rating
0,4089ae00f964a520bff21ee3,Castro Theatre,Indie Movie Theater,37.762044,-122.435022,9.4
2,432a0b00f964a520de271fe3,Anchor Oyster Bar,Seafood Restaurant,37.759708,-122.43491,9.1
3,5613f02d498e46b274c37a65,Philz Coffee,Coffee Shop,37.760104,-122.434829,9.1
21,4b00cb8ef964a520274122e3,Frances,New American Restaurant,37.762765,-122.432198,9.1
1,49fe3629f964a5207f6f1fe3,Yoga Tree Castro,Yoga Studio,37.761051,-122.436003,9.1
4,58f18932d7627e564887c486,The Castro Fountain,Ice Cream Shop,37.760052,-122.435024,9.0
5,4b1f1fb8f964a5202e2424e3,Eye Gotcha Optometric,Optical Shop,37.759651,-122.434967,9.0
6,556b072d498e7bfc836cf039,SoulCycle Castro,Cycle Studio,37.762309,-122.435321,8.9
7,5148960ee4b0905172eb88de,Castro Dog Park,Dog Run,37.759811,-122.436394,8.9
8,574f605b498ee997a05fa2d7,Dog Eared Books,Bookstore,37.761206,-122.434959,8.9


#### We will assume that since the dataframe ranks venues in descending order the last venue in the dataframe has the lowest rating(let's also assume there is a larger rating differentiation & the value is really low)

In [None]:
venue_id = '4a5cf00cf964a520e6bc1fe3' # ID of Anchor Oyster Bar
url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)

result = requests.get(url).json()

try:
    print('The rating of the venue', result['response']['venue']['name'], 'is', result['response']['venue']['rating'])
except quota_exceeded as err:
    print('Rate limited error, please use different creds or wait until the reset')
except ServiceTimedOut as err:
    print('Service Timed out, please execute again or add a timeout to the code')

The rating of the venue Louie's Barber Shop is 8.1


### 2.3	Data Cleansing & Preprocessing

Now that we have done our web scraping and downloads to source our data, we must cleanse & preprocess it. We will identify within our dataframes missing data, duplicates, anomalies, corruption and fix or remove them. We will also rename columns and perform some merging of dataframes

#### San Francisco neighborhood data cleanse/preprocessing

##### We add two empty columns(lat,long) to the San Francisco Neighborhood table to store coordinates of each neighborhood

In [None]:
# Rename columns to more meaningful names
sanfrancisconeighborhood_df.columns = ['Zip Code', 'Neighborhood' , 'Population', 'Drop1', 'Drop2']
# Drop NaN columns 
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.drop(columns=['Drop1', 'Drop2'])
# Drop duplicate values in column Zip Code
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.drop_duplicates(subset='Zip Code', keep='first')
# Sort values by zip code so irrelevant data goes to bottom
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.sort_values('Zip Code')
# Count of remaining records
print(sanfrancisconeighborhood_df.shape[0])
# Drop columns with irrelevant data
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.iloc[0:-6,]
# Fix Formatting by renaming values in Neighborhood column to be able to use Foursquare API
sanfrancisconeighborhood_df['Neighborhood'] = sanfrancisconeighborhood_df['Neighborhood'].replace({'Hayes Valley/Tenderloin/North of Market': 'Tenderloin', 'Polk/Russian Hill (Nob Hill)': 'Nob Hill', 'St. Francis Wood/Miraloma/West Portal': 'Miraloma', 'Visitacion Valley/Sunnydale': 'Visitacion Valley', 'Twin Peaks-Glen Park':'Twin Peaks'})
sanfrancisconeighborhood_df['Neighborhood'] = sanfrancisconeighborhood_df['Neighborhood'].replace({'Inner Mission/Bernal Heights': 'Mission District', 'Ingelside-Excelsior/Crocker-Amazon': 'Excelsior', 'Castro/Noe Valley': 'Castro', 'Western Addition/Japantown': 'Japantown', 'Parkside/Forest Hill':'Forest Hill', 'North Beach/Chinatown':'North Beach'})
# Print result
sanfrancisconeighborhood_df.head(3)
sanfrancisconeighborhood_df['long'] = ""
sanfrancisconeighborhood_df['lat'] = ""
sanfrancisconeighborhood_df

27


Unnamed: 0,Zip Code,Neighborhood,Population,long,lat
1,94102,Tenderloin,28991,,
2,94103,South of Market,23016,,
3,94107,Potrero Hill,17368,,
4,94108,Chinatown,13716,,
5,94109,Nob Hill,56322,,
6,94110,Mission District,74633,,
7,94112,Excelsior,73104,,
8,94114,Castro,30574,,
9,94115,Japantown,33115,,
10,94116,Forest Hill,42958,,


##### We create a function that calls Nominatim(this converts addresses into a location) then we loop through the adresses populating each with a set of latitutde and longitude coordinates back into the dataframe

In [None]:
def getcoords1(clientid,secret,vrsn,lmt,pd):
    geolocator = Nominatim(user_agent="foursquare_agent")
    for x in range(len(sanfrancisconeighborhood_df)):
        location = geolocator.geocode(sanfrancisconeighborhood_df['Neighborhood'].iloc[x],timeout=10)
        sanfrancisconeighborhood_df['lat'].iloc[x] = location.latitude
        sanfrancisconeighborhood_df['long'].iloc[x] = location.longitude
    print(sanfrancisconeighborhood_df)
    
        
getcoords1(CLIENT_ID,CLIENT_SECRET,VERSION,LIMIT,sanfrancisconeighborhood_df)

#### Preprocessing of SF Crime Data

##### We rename the columns & drop blank values in the crime dataset

In [None]:
#Rename columns to names more meaningful
sfcrimebyneighborhood_df.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
sfcrimebyneighborhood_df.rename(columns={'X':'Longitude'}, inplace=True)
sfcrimebyneighborhood_df.rename(columns={'Y':'Latitude'}, inplace=True)
sfcrimebyneighborhood_df.rename(columns={'Descript':'Description'}, inplace=True)
#Drop NaN values
sfcrimebyneighborhood_df=sfcrimebyneighborhood_df.dropna(subset=['Longitude'])
sfcrimebyneighborhood_df=sfcrimebyneighborhood_df.dropna(subset=['Latitude'])
sfcrimebyneighborhood_df.head(3)

### 2.4	Exploratory Data Analysis 

#### The data in the below dataframe are of checkins the most popular checkins ( >10000 checkins ) for a day's worth of data. For the sake of the assignment let's assume all data is current so disregard the date.

In [None]:
# This data in the below dataframe shows checkin data for the most venues checked into
body = client_8c374c33edfe4768aefdf80edd7dcdf1.get_object(Bucket='datascienceprofessionalcertificat-donotdelete-pr-6tfwiuoky1hud9',Key='FoursquareCheckIns.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

checkindatadf = pd.read_csv(body)
checkindatadf.head()


# The data in the below dataframe is the POIs that are checked in to 
body = client_8c374c33edfe4768aefdf80edd7dcdf1.get_object(Bucket='datascienceprofessionalcertificat-donotdelete-pr-6tfwiuoky1hud9',Key='FoursquarePOIs.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

pointsofinterestdf = pd.read_csv(body)
pointsofinterestdf.sort_values('venue', inplace=True)
pointsofinterestdf

pointsofinterestdf['venue'].value_counts()

# The data in the below dataframe is city data where the venues are located
body = client_8c374c33edfe4768aefdf80edd7dcdf1.get_object(Bucket='datascienceprofessionalcertificat-donotdelete-pr-6tfwiuoky1hud9',Key='FoursquareCities.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

citydatadf = pd.read_csv(body)
citydatadf


In [None]:
##### We will explore details of the venue at a specific set of coordinates, what we know is that it is a frequently checked in venue

In [None]:
venue_id = '4fa862b3e4b0ebff2f749f06' # ID of Harry's Italian Pizza Bar
url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)

result = requests.get(url).json()
result['response']['venue']

#### Cluster Neighborhoods with population >  30000

In [None]:
#retrieve the number of rows with population > 30000
sanfrancisconeighborhood_df[["Population"]] = sanfrancisconeighborhood_df[["Population"]].apply(pd.to_numeric)
sanfrancisconeighborhood_highpop_df = sanfrancisconeighborhood_df[(sanfrancisconeighborhood_df["Population"] > 30000)]
sanfrancisconeighborhood_highpop_df

In [None]:
#produce a map that shows neighborhood clusters
clusterneighmap=folium.Map(location=[latitude,longitude],zoom_start=11)
for lat,lng,neighborhood,population in zip(sanfrancisconeighborhood_highpop_df['lat'],sanfrancisconeighborhood_highpop_df['long'],sanfrancisconeighborhood_highpop_df['Neighborhood'],sanfrancisconeighborhood_highpop_df['Population']):
    label='{}, {}'.format(neighborhood,population)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(clusterneighmap)

clusterneighmap

#### Merge both the nearby venues of the San Francisco neighborhood "The Castro" & San Francisco crimes

In [None]:
#merge venues and crimes dataframe
mergeddf = pd.concat([nearby_venues_df, sfcrimebyneighborhood_df], axis=0, join='outer', ignore_index=False)
mergeddf = mergeddf.head(1000)
mergeddf.sort_values(by=['Latitude'], inplace=True)
mergeddf

#### Render nearby venues & add markers for San Francisco crimes within the vinicity of the San Francisco Neighborhood "The Castro"

In [None]:
# generate map centered around the Castro district
#venues_map=folium.Map(location=[latitude, longitude], zoom_start=18)
venues_map=folium.Map(location=[latitude, longitude], zoom_start=30)

# add a red circle marker to represent the castro neighborhood
folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='the castro san francisco',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the venues within the proximity as blue circle markers
for lat, lng, nm, label in zip(nearby_venues_df.Latitude, nearby_venues_df.Longitude, nearby_venues_df.categories, nearby_venues_df.name):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)
    
marker_cluster = MarkerCluster().add_to(venues_map)#adding marker of crime category
for x in range(len(mergeddf)):
    folium.Marker([str(sfcrimebyneighborhood_df['Latitude'].iloc[x]),str(sfcrimebyneighborhood_df['Longitude'].iloc[x])],sfcrimebyneighborhood_df['Category'].iloc[x]).add_to(marker_cluster)
   
# display map
venues_map

#### One Hot Encoding

##### Machine learning algorithms cannot work with categorical data directly.Categorical data must be converted to numbers.We perform this by one hot encoding. We will cluster the venues near our Castro Location

In [None]:
#drop the id column, it is unnecessary for k-means
nearby_venues_df.drop(['id'],axis=1, inplace=True)
#this will perform the one hot encoding
nearby_venues_df_keep = nearby_venues_df[['name','categories']] # stores the two categories we will add in later
nearby_venues_df_encode = pd.get_dummies(nearby_venues_df, prefix=['name','categories'], drop_first=False)
nearby_venues_df_encode.head(5)

###### We will run modeling the k-means algorithm and set 3 clusters.We will add the cluster label to the dataframe

In [None]:
num_clusters = 3

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(nearby_venues_df_encode) #add dataframe
#labels = k_means.labels_
nearby_venues_df_encode["Cluster Labels"] = k_means.labels_
nearby_venues_df_encode_labels=nearby_venues_df_encode
nearby_venues_df_encode_labels.head(5)

In [None]:
#prints cluster labels to see them clustered by 0,1,2... three clusters
labels = k_means.labels_
print(labels)

In [None]:
# add original name & categories column back to dataframe
mergeencode = pd.concat([nearby_venues_df_keep, nearby_venues_df_encode_labels], axis=1, join='outer', ignore_index=False)
mergeencode.head(5)

In [None]:
#group by venue type
mergeencodecatgrouped = mergeencode.groupby(['categories']).mean().reset_index()
mergeencodecatgrouped.head(5)

In [None]:
#sort by cluster labels
mergeencodecatgrouped.sort_values(["Cluster Labels"], inplace=True)
mergeencodecatgrouped.head(5)

#### Map generation of the venue clusters

In [None]:
#produce a map that shows neighborhood clusters
mergeencodecatgrouped.rename(columns={'Cluster Labels':'clusterlabels'}, inplace=True)

venuemapclusters=folium.Map(location=[latitude,longitude],zoom_start=15)
for lat,lng,ven,clus in zip(mergeencodecatgrouped['Latitude'],mergeencodecatgrouped['Longitude'],mergeencodecatgrouped['categories'],mergeencodecatgrouped['clusterlabels']):
    label='{}, {}'.format(ven,clus)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(venuemapclusters)

venuemapclusters

> #### 2.5.1	K-Means Clustering

### 3. Results 

The results that I found after plotting the nearby venues within a 500 mile radius from Castro

### 4. Discussion

### 5. Conclusion