# IBM Applied Data Science Capstone Project

## Final Notebook

#### *Opening a New Gym in the city of Hayward, California*

    - Build a dataframe of neighborhoods in Hayward California
    - Use Web Scraping to extract data from a specific Wikipedia page
    - Get the geographical coordinates of the neighborhoods
    - Using Foursquare API, get venue data for the neighborhoods
    - Survey and cluster the neighborhoods
    - Select the best cluster to open a brand new gym


### 1. **Import Libraries**

In [1]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# library to handle JSON files
import json

# convert an address into latitude and longitude values
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 

!pip install geopy

# library to handle requests
import requests

# library to parse HTML and XML documents
from bs4 import BeautifulSoup

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

print("Libraries Imported")

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries Imported


### 2. **Use Web-Scraping to scrap data from a specified Wikipedia page into a DataFrame**

In [2]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighborhoods_in_Hayward,_California").text

In [3]:
# interpret/parse data from the wikipedia html page into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store neighborhood data
nhoodlist = []

In [5]:
# append the data into the list
for row in soup.findAll("li"):
    nhoodlist.append(row.text)

In [22]:
# create a new DataFrame from the list
hwd_df = pd.DataFrame({"Neighborhood": nhoodlist})

hwd_df=hwd_df.drop([5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52])

hwd_df

Unnamed: 0,Neighborhood
0,Downtown Hayward
1,"Eden Landing, California"
2,"Hayward Heath, California"
3,"Mount Eden, California"
4,"Schafer Park, California"


In [23]:
hwd_df.shape

(5, 1)

### 3. **Get the geographical coordinates of the neighborhoods**

In [24]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_a643b0c77d134b02867209727df4d917 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='YZa5DCjQ5W5KJhFhTmT8AWpNnxyiVqAVN_L_B2iJbaMF',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_a643b0c77d134b02867209727df4d917.get_object(Bucket='recommendationtoopenagyminthecity-donotdelete-pr-qm4ehpty7hqqo8',Key='HaywardNew.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.

df_data_0 = pd.read_csv(body)
df_data_0.head()

Unnamed: 0,Zip,City,State,latitude,longitude
0,94544,Hayward,CA,37.633732,-122.06101
1,94557,Hayward,CA,37.680181,-121.921498
2,94543,Hayward,CA,37.680181,-121.921498
3,94542,Hayward,CA,37.657381,-122.05076
4,94545,Hayward,CA,37.635582,-122.10418


In [25]:
# create temporary dataframe to populate the coordinates into latitude and longitude
hwd_coords = pd.DataFrame(df_data_0, columns=['latitude', 'longitude'])
hwd_coords

Unnamed: 0,latitude,longitude
0,37.633732,-122.06101
1,37.680181,-121.921498
2,37.680181,-121.921498
3,37.657381,-122.05076
4,37.635582,-122.10418
5,37.680181,-121.921498
6,37.674431,-122.08883


In [26]:
# number of rowns and columns using shape
hwd_coords.shape

(7, 2)

In [27]:
# number of dimensions
hwd_coords.ndim

2

In [28]:
# merge the coordinates into the original dataframe
hwd_df['latitude'] = hwd_coords['latitude']
hwd_df['longitude'] = hwd_coords['longitude']

# check the neighborhood and coordinates
print(hwd_df.shape)
hwd_df

(5, 3)


Unnamed: 0,Neighborhood,latitude,longitude
0,Downtown Hayward,37.633732,-122.06101
1,"Eden Landing, California",37.680181,-121.921498
2,"Hayward Heath, California",37.680181,-121.921498
3,"Mount Eden, California",37.657381,-122.05076
4,"Schafer Park, California",37.635582,-122.10418


In [30]:
# DataFrame saved as a CSV File
hwd_df.to_csv("hwd_df.csv", index=False)

### 4. **Create the map of the neighborhoods in city of Hayward**

In [31]:
# Getting the georaphical coordinats of city of hayward

address = 'Hayward, California' 
geolocator = Nominatim(user_agent="ca_explorer") 
location = geolocator.geocode(address) 
latitude = location.latitude 
longitude = location.longitude 
print('The geograpical coordinate of Hayward, California are {}, {}.'.format(latitude, longitude ))

The geograpical coordinate of Hayward, California are 37.6688205, -122.0807964.


In [32]:
# create map of hayward using latitude and longitude values
map_hwd = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(hwd_df['latitude'], hwd_df['longitude'], hwd_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_hwd)  
    
map_hwd

In [33]:
# save the map as HTML file
map_hwd.save('map_hwd.html')

### 5. **Utilize Foursquare API to analyze the neighborhoods**

In [34]:
# define Foursquare Credentials and Version
CLIENT_ID = 'E2AETDOBLPEOVM4KUWVAZ3YBHQPXD1WMZDCGFLSC3AWF5FMD' # your Foursquare ID
CLIENT_SECRET = 'WB0FV1L1GSI1ZHAKMXTZYDF3CICJ2QVMFIVJVAQVRWWBGHD2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: E2AETDOBLPEOVM4KUWVAZ3YBHQPXD1WMZDCGFLSC3AWF5FMD
CLIENT_SECRET:WB0FV1L1GSI1ZHAKMXTZYDF3CICJ2QVMFIVJVAQVRWWBGHD2


In [35]:
# Get top 100 venues that are within the radius of 9000 meters (5 miles)

radius = 9000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(hwd_df['latitude'], hwd_df['longitude'], hwd_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [36]:
# convert the venues list into a new DataFrame
hwdvenues_df = pd.DataFrame(venues)

# define the column names
hwdvenues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(hwdvenues_df.shape)
hwdvenues_df.head()

(500, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Downtown Hayward,37.633732,-122.06101,Bob's Hoagy Steaks,37.629506,-122.048341,Sandwich Place
1,Downtown Hayward,37.633732,-122.06101,Tacos Uruapan,37.622047,-122.056986,Mexican Restaurant
2,Downtown Hayward,37.633732,-122.06101,Pupuseria Las Cabanas,37.627032,-122.043346,Latin American Restaurant
3,Downtown Hayward,37.633732,-122.06101,Bronco Billy's Pizza Palace,37.655525,-122.048817,Pizza Place
4,Downtown Hayward,37.633732,-122.06101,Garin/Dry Creek Pioneer Regional Parks,37.628355,-122.029091,Park


In [37]:
# Check how many neighborhoods has how many venues

hwdvenues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Downtown Hayward,100,100,100,100,100,100
"Eden Landing, California",100,100,100,100,100,100
"Hayward Heath, California",100,100,100,100,100,100
"Mount Eden, California",100,100,100,100,100,100
"Schafer Park, California",100,100,100,100,100,100


In [40]:
# check how mnay unique categories
print('There are {} uniques categories.'.format(len(hwdvenues_df['VenueCategory'].unique())))

There are 102 uniques categories.


In [42]:
# lets pring the list of categories
hwdvenues_df['VenueCategory'].unique()

array(['Sandwich Place', 'Mexican Restaurant',
       'Latin American Restaurant', 'Pizza Place', 'Park', 'Café',
       'Grocery Store', 'Warehouse Store', 'Vietnamese Restaurant',
       'Burger Joint', 'Steakhouse', 'Fast Food Restaurant', 'Food Court',
       'General Entertainment', 'Coffee Shop', 'Pool', 'Smoke Shop',
       'Chinese Restaurant', 'Supermarket', 'Pharmacy', 'Lingerie Store',
       'Breakfast Spot', 'Bar', 'Farmers Market', 'Asian Restaurant',
       'Yoga Studio', "Doctor's Office", 'Bakery', 'Japanese Restaurant',
       'Hunan Restaurant', 'Diner', 'Motorsports Shop', 'BBQ Joint',
       'Hot Dog Joint', 'Movie Theater', 'Golf Course', 'Gym',
       'Frozen Yogurt Shop', 'Thai Restaurant', 'Gym / Fitness Center',
       'Juice Bar', 'American Restaurant', 'Dessert Shop',
       'Deli / Bodega', 'Ramen Restaurant', 'Health & Beauty Service',
       'Ice Cream Shop', 'Sushi Restaurant', 'Indian Restaurant',
       'Dumpling Restaurant', 'Donut Shop', 'Italian Res

In [45]:
# check if the results contain "Gyms"
"Gym" in hwdvenues_df['VenueCategory'].unique()

True

### 6. **Explorer each neighborhoods**

In [46]:
# one hot encoding
hwd_onehot = pd.get_dummies(hwdvenues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
hwd_onehot['Neighborhoods'] = hwdvenues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [hwd_onehot.columns[-1]] + list(hwd_onehot.columns[:-1])
hwd_onehot = hwd_onehot[fixed_columns]

print(hwd_onehot.shape)
hwd_onehot.head()

(500, 103)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Bakery,Bank,Bar,Beer Garden,Beer Store,Breakfast Spot,Bubble Tea Shop,Burger Joint,Burmese Restaurant,Café,Chinese Restaurant,City Hall,Coffee Shop,Comedy Club,Comic Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Doctor's Office,Donut Shop,Dumpling Restaurant,Fabric Shop,Fair,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Food Court,Food Truck,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gastropub,General Entertainment,Golf Course,Grocery Store,Gun Shop,Gym,Gym / Fitness Center,Gymnastics Gym,Hawaiian Restaurant,Health & Beauty Service,Hot Dog Joint,Hotel,Hunan Restaurant,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Lingerie Store,Liquor Store,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Motorsports Shop,Movie Theater,New American Restaurant,Paper / Office Supplies Store,Park,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pub,Ramen Restaurant,Resort,Sandwich Place,Shopping Mall,Smoke Shop,Smoothie Shop,Spa,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Wine Shop,Yoga Studio
0,Downtown Hayward,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Downtown Hayward,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Downtown Hayward,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Downtown Hayward,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Downtown Hayward,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [47]:
# Group rows by neighborhood by taking the mean of each category

hwd_grouped = hwd_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(hwd_grouped.shape)
hwd_grouped

(5, 103)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,Bakery,Bank,Bar,Beer Garden,Beer Store,Breakfast Spot,Bubble Tea Shop,Burger Joint,Burmese Restaurant,Café,Chinese Restaurant,City Hall,Coffee Shop,Comedy Club,Comic Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Doctor's Office,Donut Shop,Dumpling Restaurant,Fabric Shop,Fair,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Food Court,Food Truck,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gastropub,General Entertainment,Golf Course,Grocery Store,Gun Shop,Gym,Gym / Fitness Center,Gymnastics Gym,Hawaiian Restaurant,Health & Beauty Service,Hot Dog Joint,Hotel,Hunan Restaurant,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Lingerie Store,Liquor Store,Market,Martial Arts Dojo,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Motorsports Shop,Movie Theater,New American Restaurant,Paper / Office Supplies Store,Park,Pet Store,Pharmacy,Pizza Place,Playground,Pool,Pub,Ramen Restaurant,Resort,Sandwich Place,Shopping Mall,Smoke Shop,Smoothie Shop,Spa,Sporting Goods Shop,Steakhouse,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Wine Shop,Yoga Studio
0,Downtown Hayward,0.01,0.0,0.02,0.01,0.01,0.0,0.0,0.02,0.03,0.0,0.01,0.0,0.0,0.03,0.0,0.05,0.0,0.03,0.04,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.03,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.04,0.0,0.02,0.01,0.0,0.0,0.01,0.01,0.0,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.06,0.0,0.01,0.01,0.0,0.0,0.05,0.01,0.01,0.02,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.03,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.02
1,"Eden Landing, California",0.01,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.0,0.01,0.01,0.01,0.0,0.04,0.01,0.01,0.0,0.0,0.05,0.01,0.01,0.01,0.03,0.01,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.02,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.04,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.02,0.01,0.02,0.01,0.0,0.01,0.01,0.01,0.04,0.01,0.0,0.02,0.0,0.01,0.01,0.0,0.01,0.05,0.02,0.0,0.01,0.01,0.02,0.0,0.0,0.04,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.01,0.01
2,"Hayward Heath, California",0.01,0.0,0.02,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.0,0.01,0.01,0.01,0.0,0.04,0.01,0.01,0.0,0.0,0.05,0.01,0.01,0.01,0.03,0.01,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.02,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.04,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.02,0.01,0.02,0.01,0.0,0.01,0.01,0.01,0.04,0.01,0.0,0.02,0.0,0.01,0.01,0.0,0.01,0.05,0.02,0.0,0.01,0.01,0.02,0.0,0.0,0.04,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.01,0.01
3,"Mount Eden, California",0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.03,0.0,0.0,0.03,0.01,0.05,0.0,0.04,0.03,0.01,0.08,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.03,0.02,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.01,0.05,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.02,0.02,0.01,0.01,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.06,0.0,0.01,0.01,0.0,0.0,0.06,0.01,0.01,0.01,0.01,0.02,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.01,0.0,0.01,0.01,0.01,0.02,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.01
4,"Schafer Park, California",0.01,0.01,0.01,0.0,0.02,0.0,0.0,0.03,0.01,0.01,0.03,0.0,0.0,0.02,0.0,0.07,0.0,0.03,0.02,0.01,0.06,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.04,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.05,0.0,0.01,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.02,0.01,0.02,0.01,0.02,0.01,0.01,0.0,0.01,0.0,0.0,0.05,0.0,0.01,0.01,0.0,0.0,0.04,0.0,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.02,0.0,0.0,0.02,0.0,0.04,0.01,0.0,0.02


In [48]:
len(hwd_grouped[hwd_grouped["Gym"] > 0])

5

In [50]:
# Create a new DataFrame for Gyms data only

hwd_gym = hwd_grouped[["Neighborhoods","Gym"]]

In [51]:
hwd_gym

Unnamed: 0,Neighborhoods,Gym
0,Downtown Hayward,0.02
1,"Eden Landing, California",0.01
2,"Hayward Heath, California",0.01
3,"Mount Eden, California",0.01
4,"Schafer Park, California",0.01


### 7. **Use k-means to Cluster the Neighborhoods**

In [54]:
# Run k-means into 2 clusters for the neighborhoods in hayward california, since the neighborhood of hayward is quite small.

# set number of clusters
kclusters = 2

hwdkl_clustering = hwd_gym.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(hwdkl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 0, 0, 0, 0], dtype=int32)

In [55]:
# make a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
hwd_merged = hwd_gym.copy()

# add clustering labels
hwd_merged["Cluster Labels"] = kmeans.labels_

In [56]:
hwd_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
hwd_merged.head()

Unnamed: 0,Neighborhood,Gym,Cluster Labels
0,Downtown Hayward,0.02,1
1,"Eden Landing, California",0.01,0
2,"Hayward Heath, California",0.01,0
3,"Mount Eden, California",0.01,0
4,"Schafer Park, California",0.01,0


In [57]:
# merge hwd_grouped with hayward_data to add latitude/longitude for each neighborhood
hwd_merged = hwd_merged.join(hwd_df.set_index("Neighborhood"), on="Neighborhood")

print(hwd_merged.shape)
hwd_merged.head() # check the last columns!

(5, 5)


Unnamed: 0,Neighborhood,Gym,Cluster Labels,latitude,longitude
0,Downtown Hayward,0.02,1,37.633732,-122.06101
1,"Eden Landing, California",0.01,0,37.680181,-121.921498
2,"Hayward Heath, California",0.01,0,37.680181,-121.921498
3,"Mount Eden, California",0.01,0,37.657381,-122.05076
4,"Schafer Park, California",0.01,0,37.635582,-122.10418


In [58]:
# sort the results by Cluster Labels
print(hwd_merged.shape)
hwd_merged.sort_values(["Cluster Labels"], inplace=True)
hwd_merged

(5, 5)


Unnamed: 0,Neighborhood,Gym,Cluster Labels,latitude,longitude
1,"Eden Landing, California",0.01,0,37.680181,-121.921498
2,"Hayward Heath, California",0.01,0,37.680181,-121.921498
3,"Mount Eden, California",0.01,0,37.657381,-122.05076
4,"Schafer Park, California",0.01,0,37.635582,-122.10418
0,Downtown Hayward,0.02,1,37.633732,-122.06101


### **Let's Visualize the resulting clusters**

In [61]:
# create map
hwdmap_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(hwd_merged['latitude'], hwd_merged['longitude'], hwd_merged['Neighborhood'], hwd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(hwdmap_clusters)
       
hwdmap_clusters

In [63]:
# save the map as HTML file
hwdmap_clusters.save('hwdmap_clusters.html')

### 8. **Lets Examine these Clusters**

#### Cluster 0

In [64]:
hwd_merged.loc[hwd_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Gym,Cluster Labels,latitude,longitude
1,"Eden Landing, California",0.01,0,37.680181,-121.921498
2,"Hayward Heath, California",0.01,0,37.680181,-121.921498
3,"Mount Eden, California",0.01,0,37.657381,-122.05076
4,"Schafer Park, California",0.01,0,37.635582,-122.10418


#### Cluster 1

In [65]:
hwd_merged.loc[hwd_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Gym,Cluster Labels,latitude,longitude
0,Downtown Hayward,0.02,1,37.633732,-122.06101


### **Final Observation**

#### *Answer to the business question: Where would you recommend an investor/developer in the city of Hayward to open a brand new Gym?*

*After investigating the data regarding all the gyms in the city of Hayward, California. The highest number of gyms are in cluster 0 and smaller number of gyms are in cluster 1. Cluster 1 checks all the boxes which includes: traffic area (more traffic to your business), downtown area (more appeal to your business) & less competition (Gyms). Property investors/developers with unique selling propositions to stand out from the competition can also open a brand-new Gym in neighborhoods in cluster 1 with less competition. Lastly, property developers are advised to avoid neighborhoods in cluster 0 which already have high concentration of gyms and suffering from intense competition. So to answer the business question I would recommend an investor/developer to open a brand new Gym in the city of Hayward in Cluster 1 (Downtown Hayward).*