# Capstone Project: Boba Bao
## Applied Data Science Capstone by IBM/Coursera
**Author:** Austin Tao  
**Date:** 01/04/2021

## Table of Contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Analysis Part 1:](#analysis)
    * [San Francisco County](#sfcounty)
    * [San Mateo County](#smcounty)
    * [Santa Clara County](#sccounty)
    * [Alameda County](#acounty)
    * [Contra Costa County](#cccounty)
* [Analysis Part 2:](#analysis2)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)




## Introduction/Business Problem: <a name="introduction"></a>
In the Bay Area, the pearl milk tea market within the food and beverage industry is extremely competitive both between pearl milk tea stores and between other beverages. Consider the scenario where you want to open a new pearl milk tea store in the Bay Area.  

**Business Problem: If you are a businessperson looking to start a new bubble tea store in the Silicon Valley area, where would you plan to locate your store?** 

## Data: <a name = "data"></a>
The data that will be worked with will be a combination of Foursquare location data and data from the `bayarea_boba_spots.csv` available [here](https://www.kaggle.com/vnxiclaire/bobabayarea). The csv file will primarily be used to compare between two boba shops, while Foursquare location data will be used to assess generalized venues in areas (i.e. businesses outside of the bubble tea sector). **In short, Foursquare will be used to analyze generalized location data while the aforementioned csv file will be used solely to compare bubble tea data.**

To determine the optimal location for a new store, the following features will be analyzed:  
- Number of stores in the area that sell comparable goods
- Number of other milk tea stores in the area directly competing
- Popularity of said other milk tea stores



# Analysis Part 1: <a name = "analysis"></a>

## Gathering Neighborhood Data:

In [1]:
import pandas as pd
import numpy as np
import geopy
import requests
import folium

In [107]:
df = pd.read_csv("bayarea_boba_spots.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.99385
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705


## Removing Unnecessary Columns:

In [108]:
df = df.drop(['Unnamed: 0', 'id'], axis = 1)
df.head()

Unnamed: 0,name,rating,address,city,lat,long
0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004
1,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.99385
3,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705


## Using geopy and folium to plot map of Bay Area:

In [4]:
from geopy.geocoders import Nominatim

address = 'San Francisco, California'

geolocator = Nominatim(user_agent = "ca_explorer")
location = geolocator.geocode(address)
sf_latitude = location.latitude
sf_longitude = location.longitude
print('The geograpical coordinate of San Francisco, California are {}, {}.'.format(sf_latitude, sf_longitude))

The geograpical coordinate of San Francisco, California are 37.7790262, -122.4199061.


In [5]:
map_bayArea = folium.Map(location = [sf_latitude, sf_longitude], zoom_start = 10)

# add markers to map
for lat, lng, city in zip(df['lat'], df['long'], df['city']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_bayArea)  
    
map_bayArea

# Splitting data for analysis
Given the size of the Bay Area, it makes sense to split the data into groups by county and then perform analysis. **The data will be divided into five counties: San Francisco, San Mateo, Santa Clara, Alameda, and Contra Costa.**

# 1. San Francisco County: <a name = "sfcounty"></a>

In [6]:
df_sf = df[df['city'].isin(['San Francisco', 'san francisco'])].reset_index(drop = True)
df_sf.head()

Unnamed: 0,name,rating,city,lat,long
0,Boba Guys,4.0,San Francisco,37.789943,-122.407306
1,Boba Guys,4.0,San Francisco,37.75994,-122.42112
2,Wonderful Dessert & Cafe,4.0,San Francisco,37.763325,-122.479878
3,Little Sweet,4.0,San Francisco,37.781425,-122.460802
4,Teaspoon,4.0,San Francisco,37.796316,-122.421976


## Define method mapCounty that maps our counties
- @param df_county is the data frame of the county
- @param venue_color is the desired border color for markers
- @param venue_fill_color is the desired fill color for markers

In [7]:
def mapCounty(df_county, venue_color, venue_fill_color):
    map_sf = folium.Map(location = [sf_latitude, sf_longitude], zoom_start = 10)

    # add markers to map
    for lat, lng, city in zip(df_county['lat'], df_county['long'], df_county['city']):
        label = '{}'.format(city)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius = 5,
            popup = label,
            color = venue_color,
            fill = True,
            fill_color = venue_fill_color,
            fill_opacity = 0.7,
            parse_html = False).add_to(map_sf)
    return(map_sf)
    
mapCounty(df_sf, 'red', '#cc2121')

## Using Foursquare data to explore the county:

In [11]:
CLIENT_ID = 'AHPAIMN1XFGEWLT4QRX5FO1ZQVRR5O3ICNCM0GJUOMEGPJBH'
CLIENT_SECRET = '41KRUQ3PIH25L5RVQ5T23F0ZPPVJ2VG0PIPZMDB2IG5ZXBJU'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: AHPAIMN1XFGEWLT4QRX5FO1ZQVRR5O3ICNCM0GJUOMEGPJBH
CLIENT_SECRET:41KRUQ3PIH25L5RVQ5T23F0ZPPVJ2VG0PIPZMDB2IG5ZXBJU


## Define a function to get all venues in the county:

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
sf_venues = getNearbyVenues(names = df_sf['city'],
                                   latitudes = df_sf['lat'],
                                   longitudes = df_sf['long']
                                  )
sf_venues.head(10)

San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco
San Francisco


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,San Francisco,37.789943,-122.407306,The Archive,37.789494,-122.405766,Men's Store
1,San Francisco,37.789943,-122.407306,Williams-Sonoma,37.788377,-122.407446,Kitchen Supply Store
2,San Francisco,37.789943,-122.407306,Tiffany & Co.,37.788598,-122.407708,Jewelry Store
3,San Francisco,37.789943,-122.407306,Maison Margiela,37.788261,-122.405765,Boutique
4,San Francisco,37.789943,-122.407306,The Whisky Shop,37.789494,-122.406554,Liquor Store
5,San Francisco,37.789943,-122.407306,Pure Organic Spa,37.789286,-122.409191,Spa
6,San Francisco,37.789943,-122.407306,Apple Union Square,37.78869,-122.407174,Electronics Store
7,San Francisco,37.789943,-122.407306,Benefit Cosmetics,37.789898,-122.404934,Cosmetics Shop
8,San Francisco,37.789943,-122.407306,Union Square,37.787933,-122.407501,Pedestrian Plaza
9,San Francisco,37.789943,-122.407306,Grand Hyatt San Francisco,37.78907,-122.407168,Hotel


In [14]:
sf_grouped = sf_venues.groupby(['Venue Category']).count()
sf_grouped = sf_grouped.drop(['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Latitude', 'Venue Longitude'], axis = 1)
sf_grouped = sf_grouped.sort_values(by = 'Venue', ascending = False)
sf_grouped.head(5)

Unnamed: 0_level_0,Venue
Venue Category,Unnamed: 1_level_1
Chinese Restaurant,155
Coffee Shop,137
Bakery,120
Bubble Tea Shop,90
Café,89


From the `sf_grouped` data frame above, it's clear that venues that sell goods comparable to pearl milk tea are among the top retail venues in San Francisco county ("Coffee Shop", "Bakery", "Cafe", and "Bubble Tea Shop" all sell comparable goods). The popularity of these stores means competition would be fierce when opening a new pearl milk tea store, making SF county not the most desirable place for opening a store.

# 2. San Mateo County: <a name = "smcounty"></a>

In [15]:
df_sanMateo = df[df['city'].isin(['Atherton', 'Belmont', 'Brisbane', 'Burlingame', 'Colma', 'Daly City', 'East Palo Alto', 'Foster City', 'Half Moon Bay', 'Hillsborough', 'Menlo Park', 'Millbrae', 'Pacifica', 'Portola Valley', 'Redwood City', 'San Bruno', 'San Carlos', 'San Mateo', 'South San Francisco', 'Woodside'])].reset_index(drop = True)
df_sanMateo.head()

Unnamed: 0,name,rating,city,lat,long
0,FrosTea,3.5,Daly City,37.70689,-122.4588
1,The Burrow,4.5,Brisbane,37.683464,-122.402889
2,Boba Guys,4.0,San Carlos,37.50274,-122.25698
3,Teaquation,4.5,Redwood City,37.483951,-122.232654
4,Chatime,4.0,Redwood City,37.487118,-122.229624


## Mapping San Mateo Milk Tea Stores:

In [109]:
mapCounty(df_sanMateo, 'blue', '#1798e8')

## Finding Venues in San Mateo:

In [17]:
sanMateo_venues = getNearbyVenues(names = df_sanMateo['city'],
                                   latitudes = df_sanMateo['lat'],
                                   longitudes = df_sanMateo['long']
                                  )
sanMateo_venues.head()

Daly City
Brisbane
San Carlos
Redwood City
Redwood City
Redwood City
Redwood City
Redwood City
Redwood City
Redwood City
Menlo Park
Redwood City
Redwood City
Redwood City
Menlo Park
South San Francisco
Burlingame
San Bruno
San Mateo
Burlingame
South San Francisco
Burlingame
San Mateo
San Mateo
Millbrae
Pacifica
Millbrae
Burlingame
Half Moon Bay
Millbrae
San Mateo
South San Francisco
San Mateo
San Mateo
San Mateo
San Mateo
Foster City
San Mateo
Foster City
San Mateo
San Mateo
Belmont
San Mateo
San Mateo
San Carlos
Millbrae
San Mateo
San Mateo
Burlingame
San Mateo
Foster City
Foster City
San Mateo
San Mateo
San Mateo
San Mateo
Burlingame
San Bruno
Burlingame
San Mateo
Millbrae
San Mateo
Millbrae
Belmont
Foster City
Foster City
Belmont
San Mateo
Foster City
San Mateo
San Mateo
San Mateo
Redwood City


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Daly City,37.70689,-122.4588,Goldilocks,37.706713,-122.459872,Bakery
1,Daly City,37.70689,-122.4588,Noodle Station,37.707504,-122.455777,Noodle House
2,Daly City,37.70689,-122.4588,Tselogs,37.706864,-122.458855,Lounge
3,Daly City,37.70689,-122.4588,Dollar Tree,37.706397,-122.460073,Discount Store
4,Daly City,37.70689,-122.4588,Bart Grocery,37.705672,-122.464151,Korean Restaurant


## Doing one-hot-encoding now that there are multiple cities:

In [18]:
def oneHotEncoding(county_venues):
    # one hot encoding
    county_onehot = pd.get_dummies(county_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    county_onehot['Neighborhood'] = county_venues['Neighborhood'] 

    # move neighborhood column to the first column
    fixed_columns = [county_onehot.columns[-1]] + list(county_onehot.columns[:-1])
    county_onehot = county_onehot[fixed_columns]

    return county_onehot.groupby('Neighborhood').mean().reset_index()

sanMateo_grouped = oneHotEncoding(sanMateo_venues)
sanMateo_grouped.head()

Unnamed: 0,Neighborhood,ATM,Afghan Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Auditorium,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Belmont,0.019417,0.0,0.0,0.0,0.0,0.0,0.0,0.009709,0.0,...,0.0,0.0,0.009709,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Brisbane,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Burlingame,0.0,0.0,0.018576,0.0,0.0,0.0,0.01548,0.03096,0.0,...,0.0,0.003096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003096
3,Daly City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Foster City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03268,0.006536,0.006536,0.0,0.0,0.0,0.0,0.0,0.0


## Finding the top venues in each neighborhood:

In [19]:
num_top_venues = 5

def findTopVenues(county_grouped, num_top_venues):
    for hood in county_grouped['Neighborhood']:
        print("----"+hood+"----")
        temp = county_grouped[county_grouped['Neighborhood'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        print('\n')
        
findTopVenues(sanMateo_grouped, num_top_venues)

----Belmont----
              venue  freq
0     Grocery Store  0.05
1    Sandwich Place  0.04
2  Sushi Restaurant  0.04
3        Nail Salon  0.04
4       Coffee Shop  0.04


----Brisbane----
                   venue  freq
0                   Park  0.12
1  Vietnamese Restaurant  0.08
2     Mexican Restaurant  0.08
3         Sandwich Place  0.04
4       Sushi Restaurant  0.04


----Burlingame----
             venue  freq
0      Pizza Place  0.06
1  Bubble Tea Shop  0.05
2   Sandwich Place  0.04
3      Coffee Shop  0.04
4           Bakery  0.04


----Daly City----
                venue  freq
0                Café  0.08
1      Discount Store  0.04
2  Burmese Restaurant  0.04
3     Bubble Tea Shop  0.04
4   Convenience Store  0.04


----Foster City----
                venue  freq
0                Park  0.05
1         Pizza Place  0.05
2  Chinese Restaurant  0.04
3     Bubble Tea Shop  0.04
4      Shipping Store  0.04


----Half Moon Bay----
                venue  freq
0       Grocery Store 

Of the 14 cities, 9 of them (approximately 64%) contained a top venue comparable to pearl milk tea. This is better than San Francisco County, but is not enough to determine that this is an appropriate county. 

# 3. Santa Clara County: <a name = "sccounty"></a>

In [113]:
df_scc = df[df['city'].isin(['Campbell', 'Cupertino', 'Gilroy', 'Los Altos', 'Los Altos Hills', 'Los Gatos', 'Milpitas', 'Monte Sereno', 'Morgan Hill', 'Mountain View', 'Palo Alto', 'San Jose', 'Santa Clara', 'Saratoga', 'Sunnyvale'])].reset_index(drop = True)
df_scc.head()

Unnamed: 0,name,rating,address,city,lat,long
0,Tea Era,4.0,271 Castro St,Mountain View,37.392956,-122.079281
1,Teaspoon,4.0,4546 El Camino Real,Los Altos,37.401254,-122.114469
2,Tastea,4.0,1160 N Capitol Ave,San Jose,37.3878,-121.86019
3,Tea Villa,4.0,1679 N Milpitas Blvd,Milpitas,37.454735,-121.911572
4,Pop Tea Bar,4.0,456 Cambridge Ave,Palo Alto,37.42655,-122.14632


## Mapping Santa Clara County Boba Stores:

In [114]:
mapCounty(df_scc, 'yellow', '#e6ed13')

## Venues in Santa Clara County:

In [115]:
scc_venues = getNearbyVenues(names = df_scc['city'],
                                   latitudes = df_scc['lat'],
                                   longitudes = df_scc['long']
                                  )
scc_venues.head()

KeyError: 'groups'

## One-hot-encoding:

In [116]:
scc_grouped = oneHotEncoding(scc_venues)
scc_grouped.head()

NameError: name 'scc_venues' is not defined

## Finding top venues in each city:

In [None]:
findTopVenues(scc_grouped, num_top_venues)

Every single one of the cities in Santa Clara county had either coffee shop, cafe, or milk tea shop, meaning that there would be extremely strong competition opening a new pearl milk tea store in this county. 

# 4. Alameda County: <a name = "acounty"></a>

In [22]:
df_alameda = df[df['city'].isin(['Alameda', 'Albany', 'Berkeley', 'Dublin', 'Emeryville', 'Fremont', 'Hayward', 'Livermore', 'Newark', 'Oakland', 'Piedmont', 'Pleasanton', 'San Leandro', 'Union City'])].reset_index(drop = True)
df_alameda.head()

Unnamed: 0,name,rating,city,lat,long
0,99% Tea House,4.5,Fremont,37.56295,-122.01004
1,One Tea,4.5,Fremont,37.489067,-121.929414
2,Royaltea USA,4.0,Fremont,37.551315,-121.99385
3,TECO Tea & Coffee Bar,4.5,Fremont,37.553694,-121.981043
4,T-LAB,4.0,Fremont,37.576149,-122.043705


## Mapping Boba Stores in Alameda County:

In [111]:
mapCounty(df_alameda, 'grey', '#C0C0C0')

## Finding venues in Alameda County: 

In [24]:
alameda_venues = getNearbyVenues(names = df_alameda['city'],
                                   latitudes = df_alameda['lat'],
                                   longitudes = df_alameda['long']
                                  )
alameda_venues.head()

Fremont
Fremont
Fremont
Fremont
Fremont
Newark
Fremont
Fremont
Fremont
Fremont
Fremont
Fremont
Fremont
Fremont
Newark
Fremont
Fremont
Newark
Newark
Fremont
Fremont
Newark
Fremont
Fremont
Fremont
Union City
Fremont
Fremont
Fremont
Fremont
Union City
Union City
Newark
Union City
Union City
Fremont
Fremont
Union City
San Leandro
Oakland
Oakland
Oakland
Berkeley
Oakland
Berkeley
Alameda
Oakland
Oakland
Berkeley
Berkeley
Berkeley
Alameda
Berkeley
Alameda
Oakland
Oakland
Alameda
Oakland
San Leandro
Oakland
Berkeley
Oakland
Oakland
Alameda
Alameda
Oakland
Oakland
Oakland
Oakland
San Leandro
Oakland
Berkeley
Oakland
Pleasanton
Pleasanton
Dublin
Pleasanton
Pleasanton
Dublin
Dublin
Pleasanton
Dublin
Dublin
Pleasanton
Pleasanton
Pleasanton
Pleasanton
Dublin
Dublin
Pleasanton
Pleasanton
Dublin
Dublin
Dublin
Pleasanton
Pleasanton
Dublin
Dublin
Pleasanton
Pleasanton
Dublin
Dublin
Dublin
Dublin
Livermore
Pleasanton
Dublin
Dublin
Dublin
Dublin
Dublin
Dublin
Dublin
Dublin
Newark
Union City
Newark
Newar

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Fremont,37.56295,-122.01004,Satomi Sushi,37.562808,-122.010448,Sushi Restaurant
1,Fremont,37.56295,-122.01004,Suju's Coffee & Tea,37.562694,-122.009259,Coffee Shop
2,Fremont,37.56295,-122.01004,Bun Appétit,37.559883,-122.009978,Donut Shop
3,Fremont,37.56295,-122.01004,Dino's Family Restaurant,37.561186,-122.012218,Diner
4,Fremont,37.56295,-122.01004,Kyain Kyain (Main Main Kyay Ohe),37.5629,-122.010341,Soup Place


## One-hot-encoding for Alameda County:

In [25]:
alameda_grouped = oneHotEncoding(alameda_venues)
alameda_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Terminal,American Restaurant,Andhra Restaurant,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio
0,Alameda,0.007949,0.007949,0.0,0.011129,0.00159,0.0,0.0,0.023847,0.0,...,0.012719,0.0,0.0,0.019078,0.014308,0.0,0.004769,0.0,0.0,0.0
1,Albany,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berkeley,0.0,0.015385,0.0,0.0,0.0,0.0,0.0,0.015385,0.0,...,0.007692,0.0,0.0,0.0,0.0,0.0,0.003077,0.0,0.0,0.013846
3,Dublin,0.004963,0.0,0.0,0.003722,0.0,0.0,0.0,0.006203,0.0,...,0.007444,0.0,0.0,0.0,0.0,0.0,0.002481,0.0,0.0,0.0
4,Fremont,0.003425,0.0,0.0,0.0,0.0,0.0,0.0,0.008562,0.0,...,0.017979,0.003425,0.0,0.0,0.003425,0.0,0.005993,0.0,0.0,0.0


## Finding top venues in Alameda:

In [27]:
findTopVenues(alameda_grouped, num_top_venues)

----Alameda----
                venue  freq
0      Sandwich Place  0.03
1     Bubble Tea Shop  0.03
2                 Bar  0.03
3  Chinese Restaurant  0.03
4         Pizza Place  0.02


----Albany----
               venue  freq
0    Thai Restaurant  0.07
1  Indian Restaurant  0.07
2                Bar  0.04
3        Pizza Place  0.04
4          Pet Store  0.04


----Berkeley----
             venue  freq
0      Coffee Shop  0.05
1             Café  0.04
2  Thai Restaurant  0.03
3   Sandwich Place  0.02
4    Hot Dog Joint  0.02


----Dublin----
                 venue  freq
0          Coffee Shop  0.05
1       Sandwich Place  0.04
2  Japanese Restaurant  0.04
3               Bakery  0.03
4        Grocery Store  0.03


----Fremont----
                  venue  freq
0    Chinese Restaurant  0.07
1       Bubble Tea Shop  0.05
2           Pizza Place  0.04
3           Coffee Shop  0.04
4  Fast Food Restaurant  0.03


----Hayward----
                venue  freq
0  Mexican Restaurant  0.07
1    

Only 7 of the 12 (approximately 58%) of the cities in Alameda County have a venue that sells goods that compete with pearl milk tea in their top 5 venues. This is a good sign and means that Alameda County is potentially a good contender for opening a new store.

# 5. Contra Costa County: <a name = "cccounty"></a>

In [28]:
df_ccc = df[df['city'].isin(['Antioch', 'Brentwood', 'Clayton', 'Concord', 'Danville', 'El Cerrito', 'Hercules', 'Lafayette', 'Martinez', 'Moraga', 'Oakley', 'Orinda', 'Pinole', 'Pittsburg', 'Pleasant Hill', 'Richmond', 'San Pablo', 'San Ramon', 'Walnut Creek'])].reset_index(drop = True)
df_ccc.head()

Unnamed: 0,name,rating,city,lat,long
0,Milk Tea Lab,4.0,Pleasant Hill,37.948571,-122.058179
1,i-Tea,4.0,Walnut Creek,37.898132,-122.059291
2,T4,3.5,San Ramon,37.73021,-121.93009
3,i-Tea,4.0,San Ramon,37.7241,-121.94435
4,Pho Saigon Noodle House,3.5,San Ramon,37.729389,-121.931442


## Mapping Boba Stores in Contra Costa County:

In [112]:
mapCounty(df_ccc, 'purple', '#800080')

## Venues in Contra Costa County:

In [31]:
ccc_venues = getNearbyVenues(names = df_ccc['city'],
                                   latitudes = df_ccc['lat'],
                                   longitudes = df_ccc['long']
                                  )
ccc_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Pleasant Hill,37.948571,-122.058179,Plaza Cafe,37.948306,-122.059647,Café
1,Pleasant Hill,37.948571,-122.058179,Milk Tea Lab,37.948532,-122.058172,Bubble Tea Shop
2,Pleasant Hill,37.948571,-122.058179,Nation's Giant Hamburgers,37.948692,-122.059679,Burger Joint
3,Pleasant Hill,37.948571,-122.058179,Three Brothers From China,37.949411,-122.060963,Chinese Restaurant
4,Pleasant Hill,37.948571,-122.058179,Starbucks,37.948967,-122.057949,Coffee Shop


## One-hot-encoding for Contra Costa County:

In [32]:
ccc_grouped = oneHotEncoding(ccc_venues)
ccc_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Afghan Restaurant,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Auto Dealership,...,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Antioch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010526,0.0,0.0,0.031579,0.0,0.0,0.0,0.010526,0.0,0.0
1,Brentwood,0.0,0.008621,0.0,0.017241,0.0,0.0,0.0,0.008621,0.0,...,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.008621,0.008621,0.0
2,Concord,0.010158,0.004515,0.0,0.002257,0.006772,0.0,0.007901,0.014673,0.004515,...,0.0,0.001129,0.001129,0.022573,0.002257,0.0,0.0,0.0,0.001129,0.0
3,Danville,0.010638,0.0,0.010638,0.042553,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.010638,0.0,0.0,0.010638
4,El Cerrito,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,...,0.0,0.0,0.010204,0.010204,0.0,0.010204,0.0,0.010204,0.0,0.020408


## Finding top venues in Contra Costa County cities:

In [33]:
findTopVenues(ccc_grouped, num_top_venues)

----Antioch----
                  venue  freq
0                  Bank  0.05
1  Fast Food Restaurant  0.05
2           Pizza Place  0.05
3        Sandwich Place  0.04
4         Grocery Store  0.04


----Brentwood----
            venue  freq
0  Clothing Store  0.07
1  Ice Cream Shop  0.05
2     Pizza Place  0.05
3  Sandwich Place  0.04
4   Grocery Store  0.03


----Concord----
                venue  freq
0  Mexican Restaurant  0.09
1      Sandwich Place  0.03
2         Coffee Shop  0.03
3          Laundromat  0.02
4   Indian Restaurant  0.02


----Danville----
                venue  freq
0         Pizza Place  0.05
1      Sandwich Place  0.05
2  Italian Restaurant  0.04
3     Thai Restaurant  0.04
4       Grocery Store  0.04


----El Cerrito----
                venue  freq
0         Coffee Shop  0.04
1   Mobile Phone Shop  0.04
2         Pizza Place  0.04
3    Sushi Restaurant  0.03
4  Chinese Restaurant  0.03


----Hercules----
                venue  freq
0         Pizza Place  0.09
1  

10 out of 16 of these stores (approximately 65%) of the cities in Contra Costa county had some beverage venue comparable to pearl milk tea as one of their top venues. This is certainly better than, for instance Santa Clara County, but also does not prove anything definitive yet.

# Analysis Part 2: <a name = "analysis2"></a>

From the first part of our analysis, we were able to determine the popularity of milk tea stores in each county in the bay area. Here is a summary table of our results:

|   County     | Percentage of cities with high competition |
| :---------   | :----------------------------------------: |
|    S.F.      |        100% (only one city in county)      |
| San Mateo    |          64%                               |
| Santa Clara  |        100% (with 15 cities in county)     |
| Alameda      |          58%                               |
| Contra Costa |            65%                             |

Clearly, S.F. County and Santa Clara County are not great options for opening a new pearl milk tea store due to how strong the competition is in the cities of the county. Unless one was opening a new location of a well-established chain, it would be very difficult to succeed in these counties. This reduces our potential list of locations to San Mateo county, Alameda county, and Contra Costa county.  

**Now, to narrow it down further to a single county, let us compare the average ratings of milk tea stores in each county:**

## Finding average ratings:

In [37]:
def getMeanRating(df_county):
    mean = df_county['rating'].mean()
    
    return mean

In [44]:
sanMateoAVG = getMeanRating(df_sanMateo)
alamedaAVG = getMeanRating(df_alameda)
cccAVG = getMeanRating(df_ccc)

print("San Mateo Average: " + str(sanMateoAVG))
print("Alameda Average: " + str(alamedaAVG))
print("Contra Costa Average: " + str(cccAVG))

San Mateo Average: 3.863013698630137
Alameda Average: 3.721590909090909
Contra Costa Average: 3.7686567164179103


We note that Alameda county has the lowest average rating for their milk tea stores. **We will make the assumption that this means the competition in Alameda county is the lowest. This is not a guarantee!** The following analysis can be performed for any of the above counties, so it can be repeated and tried again. For simplicity, we will stick to using Alameda county in the below analysis.

Now that we have selected Alameda county, we need to further determine which city we would like to go into. Once again, the map of the boba shops in the area is below:

In [45]:
mapCounty(df_alameda, 'grey', '#C0C0C0')

## Converting top venues in Alameda into a dataframe:

In [48]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [57]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = alameda_grouped['Neighborhood']

for ind in np.arange(alameda_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(alameda_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Alameda,Bubble Tea Shop,Bar,Sandwich Place,Chinese Restaurant,Thai Restaurant,American Restaurant,Coffee Shop,Mexican Restaurant,Italian Restaurant,Sushi Restaurant
1,Albany,Thai Restaurant,Indian Restaurant,Liquor Store,Pet Store,Pizza Place,Breakfast Spot,Café,Mexican Restaurant,Bar,Coffee Shop
2,Berkeley,Coffee Shop,Café,Thai Restaurant,Japanese Restaurant,Pizza Place,Pool,Bookstore,Bubble Tea Shop,Chinese Restaurant,Ice Cream Shop
3,Dublin,Coffee Shop,Sandwich Place,Japanese Restaurant,Bakery,Pizza Place,Grocery Store,Fast Food Restaurant,Asian Restaurant,Chinese Restaurant,Indian Restaurant
4,Fremont,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop


## K-Means-Clustering:

In [94]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

alameda_grouped_clustering = alameda_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(alameda_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 1, 4, 0, 0, 3, 3, 0, 4, 3], dtype=int32)

In [95]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Label', kmeans.labels_)

alameda_merged = df_alameda

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
alameda_merged = alameda_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on = 'city')

alameda_merged.head(100) # check the last columns!

Unnamed: 0,name,rating,city,lat,long,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,99% Tea House,4.5,Fremont,37.562950,-122.010040,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
1,One Tea,4.5,Fremont,37.489067,-121.929414,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
2,Royaltea USA,4.0,Fremont,37.551315,-121.993850,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
3,TECO Tea & Coffee Bar,4.5,Fremont,37.553694,-121.981043,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
4,T-LAB,4.0,Fremont,37.576149,-122.043705,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,99 Ranch Market,3.5,Pleasanton,37.699560,-121.873260,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
96,Menchie's Frozen Yogurt,4.0,Dublin,37.703721,-121.851480,0,0,0,0,Coffee Shop,Sandwich Place,Japanese Restaurant,Bakery,Pizza Place,Grocery Store,Fast Food Restaurant,Asian Restaurant,Chinese Restaurant,Indian Restaurant
97,Blossom Bee,4.0,Dublin,37.710858,-121.926420,0,0,0,0,Coffee Shop,Sandwich Place,Japanese Restaurant,Bakery,Pizza Place,Grocery Store,Fast Food Restaurant,Asian Restaurant,Chinese Restaurant,Indian Restaurant
98,Berry Delight,3.0,Pleasanton,37.694629,-121.931563,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant


In [96]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[sf_latitude, sf_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(alameda_merged['lat'], alameda_merged['long'], alameda_merged['city'], alameda_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [97]:
alameda_merged.loc[alameda_merged['Cluster Labels'] == 0, alameda_merged.columns[[1] + list(range(5, alameda_merged.shape[1]))]]

Unnamed: 0,rating,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,4.5,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
1,4.5,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
2,4.0,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
3,4.5,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
4,4.0,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,4.0,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
128,3.5,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
139,4.0,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop
140,4.0,0,0,0,0,Chinese Restaurant,Bubble Tea Shop,Coffee Shop,Pizza Place,Fast Food Restaurant,Sandwich Place,Grocery Store,Indian Restaurant,Bakery,Dessert Shop


In [98]:
alameda_merged.loc[alameda_merged['Cluster Labels'] == 1, alameda_merged.columns[[1] + list(range(5, alameda_merged.shape[1]))]]

Unnamed: 0,rating,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
143,3.5,1,1,1,1,Thai Restaurant,Indian Restaurant,Liquor Store,Pet Store,Pizza Place,Breakfast Spot,Café,Mexican Restaurant,Bar,Coffee Shop


In [99]:
alameda_merged.loc[alameda_merged['Cluster Labels'] == 2, alameda_merged.columns[[1] + list(range(5, alameda_merged.shape[1]))]]

Unnamed: 0,rating,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
30,4.0,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
31,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
33,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
34,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
37,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
115,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
118,4.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
123,3.5,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant
126,4.0,2,2,2,2,Vietnamese Restaurant,Grocery Store,Asian Restaurant,Thai Restaurant,Chinese Restaurant,Bakery,Café,Ramen Restaurant,Bank,Fast Food Restaurant


In [100]:
alameda_merged.loc[alameda_merged['Cluster Labels'] == 3, alameda_merged.columns[[1] + list(range(5, alameda_merged.shape[1]))]]

Unnamed: 0,rating,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
38,4.0,3,3,3,3,Pizza Place,Sushi Restaurant,Chinese Restaurant,Bank,Mexican Restaurant,Burger Joint,Nail Salon,Pharmacy,Coffee Shop,Salon / Barbershop
58,4.0,3,3,3,3,Pizza Place,Sushi Restaurant,Chinese Restaurant,Bank,Mexican Restaurant,Burger Joint,Nail Salon,Pharmacy,Coffee Shop,Salon / Barbershop
69,3.5,3,3,3,3,Pizza Place,Sushi Restaurant,Chinese Restaurant,Bank,Mexican Restaurant,Burger Joint,Nail Salon,Pharmacy,Coffee Shop,Salon / Barbershop
73,4.0,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
74,4.5,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
76,3.5,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
77,4.0,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
80,4.0,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
83,3.5,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant
84,2.5,3,3,3,3,Coffee Shop,Pizza Place,Clothing Store,Grocery Store,Chinese Restaurant,Sandwich Place,Ice Cream Shop,Vietnamese Restaurant,Sushi Restaurant,Mexican Restaurant


In [101]:
alameda_merged.loc[alameda_merged['Cluster Labels'] == 4, alameda_merged.columns[[1] + list(range(5, alameda_merged.shape[1]))]]

Unnamed: 0,rating,Label,Labels,Cluster Label,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
39,4.0,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
40,3.5,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
41,4.0,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
42,4.0,4,4,4,4,Coffee Shop,Café,Thai Restaurant,Japanese Restaurant,Pizza Place,Pool,Bookstore,Bubble Tea Shop,Chinese Restaurant,Ice Cream Shop
43,4.0,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
44,4.0,4,4,4,4,Coffee Shop,Café,Thai Restaurant,Japanese Restaurant,Pizza Place,Pool,Bookstore,Bubble Tea Shop,Chinese Restaurant,Ice Cream Shop
45,4.0,4,4,4,4,Bubble Tea Shop,Bar,Sandwich Place,Chinese Restaurant,Thai Restaurant,American Restaurant,Coffee Shop,Mexican Restaurant,Italian Restaurant,Sushi Restaurant
46,3.0,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
47,4.0,4,4,4,4,Chinese Restaurant,Café,Coffee Shop,Vietnamese Restaurant,Sandwich Place,Bakery,Mexican Restaurant,Bar,Vegetarian / Vegan Restaurant,Bubble Tea Shop
48,4.0,4,4,4,4,Coffee Shop,Café,Thai Restaurant,Japanese Restaurant,Pizza Place,Pool,Bookstore,Bubble Tea Shop,Chinese Restaurant,Ice Cream Shop


# Results and Discussion <a name = "results"></a>

We began with dividing up our data set into 5 Bay Area counties in order to make analysis more specialized. From there, the top venues in each county were identified to determine which areas have particularily strong showings of beverage industry competition. This ruled out San Francisco county and Santa Clara County, both of which contained bubble tea stores or another comparable good in the top 5 venues for every single city in the counties. 


This narrowed our search to San Mateo, Alameda, and Contra Costa counties, where we used the `bayarea_boba_spots.csv` in order to see the popularity data of the milk tea stores in the counties. From this, we determined that Alameda has the lowest ratings, hence we assume that the competition is the weakest there (this assumption had to be made for this project's sake; this project focuses on data science, not necessarily the business strategy side). This analysis lead us to selecting Alameda county as our desired county. 


With Alameda county, we performed K-means-clustering to see which particular city to open our new store in. The results of the K-Means-clustering grouped areas together that share similar popular venues. Doing this allows for us to see which areas in Alameda county have the fiercest competition, which we want to avoid. For example, cluster 4 represents areas with chinese restaurants and cafes as their top venues, while cluster 2 grouped areas with vietnamese restaurants as their top venues. 


Out of these 5 clusters, **cluster 2 appears to be the optimal place to open a pearl milk tea store given that there seems to be the least competition there** (the first instance of a comparable good venue appears as 7th most common, whereas the first instance appears earlier for all the other clusters e.g. cluster 0). ***Reexamining the dataframe, this corresponds to Union City.***

# Conclusion <a name = "conclusion"></a>

The purpose of this project was to take a large data set of the milk tea stores in the Bay Area and determine where the optimal location for a new milk tea store would be if one was a new business owner. Using the Foursquare API, we were able to divide the Bay Area into 5 major counties, then further focus in on individual cities using the `bayarea_boba_spots.csv` to compare rating data. K-means-clustering was used to then cluster these stores by characteristics to see which area would be most competitive, thus allowing us to find the area with least competition.

Our analysis showed that when considering the popularity of pearl milk tea stores and other beverage industry companies that sell similar goods, **the optimal location for opening a new, non-franchised store in the immediate Bay Area is in Union City, Alameda County.** 