# Boba Shop Data Analysis: Part 1
### Python, Pandas, and Statistical Modeling






<a id='dataset'></a>
# Dataset

The dataset was found on kaggle.com. The raw dataset has not been cleaned, so we began the project with cleaning and prepping the data for our analysis.

<a id='import'></a>
## Importing

The data was presented in .csv formats, and we want to bring the dataset into the notebook as a Pandas DataFrame.

In [6]:
import pandas as pd
#Fill in the next line if your file is a CSV
df = pd.read_csv('bayarea_boba_spots.csv')
df

Unnamed: 0.1,Unnamed: 0,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705
...,...,...,...,...,...,...,...,...
598,598,munch-hayward,Munch,4.0,27560 Tampa Ave,Hayward,37.631869,-122.075384
599,599,foodnet-supermarket-san-leandro-2,Foodnet Supermarket,3.5,1960 Lewelling Blvd,San Leandro,37.679500,-122.154790
600,600,yo-bowl-hayward,Yo Bowl,4.0,8 Southland Mall,Hayward,37.651128,-122.101296
601,601,yogurt-hill-hayward-4,Yogurt Hill,4.0,1081 B St,Hayward,37.673550,-122.081140


<a id='clean'></a>
## Cleaning

Taking a look at the DataFrame you created, there are rows and columns that we don't need for certain part of our analysis. There might also exists some rows with missing information, or duplicate cells, so we want to clean the data for further use.

<a id='in_place'></a>
### In-place?

  In terms of whether to use the in_place = True parameter, I usually prefer save in place to clear out null or duplicate values and making a copy to delete certain columns for different analytical cases. 

Generally, here are the methods that are useful for data cleaning:
 - Characters surrounding a value, i.e., %#value
     - Look up the .strip() method
 - Duplicates
     - Look up the df.drop_duplicates() method
 - Empty Cells
     - Look up the df.fill_na() and df.drop_na() methods but be careful about dropping rows/columns
 - Numbers stored as strings
     - Look up the int() and float() functions
 - Uppercase or lowercase strings
     - Look up the .lower() and .upper() methods

<a id='pandas'></a>
# Pandas

Now that we have our data cleaned and ready to use, we are going to start looking at it a little more in depth!

<a id='inspecting'></a>
## Inspecting

Let's use some pandas methods to start inspecting our dataset.

In [7]:
#columns, rows, info
#First, lets take a look at the first 3 and last 3 rows of the dataset.
first3 = df.head(3)
last3 = df.tail(3)

first3, last3

(   Unnamed: 0                      id           name  rating  \
 0           0  99-tea-house-fremont-2  99% Tea House     4.5   
 1           1       one-tea-fremont-2        One Tea     4.5   
 2           2    royaltea-usa-fremont   Royaltea USA     4.0   
 
                    address     city        lat        long  
 0        3623 Thornton Ave  Fremont  37.562950 -122.010040  
 1  46809 Warm Springs Blvd  Fremont  37.489067 -121.929414  
 2       38509 Fremont Blvd  Fremont  37.551315 -121.993850  ,
      Unnamed: 0                                  id                    name  \
 600         600                     yo-bowl-hayward                 Yo Bowl   
 601         601               yogurt-hill-hayward-4             Yogurt Hill   
 602         602  alohana-hawaiian-grill-san-leandro  Alohana Hawaiian Grill   
 
      rating           address         city        lat        long  
 600     4.0  8 Southland Mall      Hayward  37.651128 -122.101296  
 601     4.0         1081 B S

In [8]:
#Let's see how large our dataset is. 
rows_columns = df.shape

rows_columns

(603, 8)

In [8]:
#Now that we know how many rows and columns we have, lets take a look at what columns we have.
columns = df.columns

columns

Index(['Unnamed: 0', 'id', 'name', 'rating', 'address', 'city', 'lat', 'long'], dtype='object')

<a id='mutation'></a>
## Mutation

At this point, we might find it easier to change the index, some column names, or even remove some irrelevant columns/rows if we feel the need.

In [9]:
# It's usually good to have an index column, and we already do! 
# Renaming the index column

df.rename(columns={'Unnamed: 0': 'index'}, inplace=True)
df.head()

Unnamed: 0,index,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.99385
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705


In [10]:
#With large datasets, it is very common to have missing, or NaN values
#We are removing rows with missing data or impute a value

df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
copy = df
df.shape

(597, 8)

In [11]:
#Here, we are removing a few columns to get a clear and concise dataset of just the boba shops' basic info

shops = df.drop(columns=['lat', 'long', 'address'])
shops.shape

(597, 5)

<a id='sorting'></a>
## Sorting

Now that we know how our data is arranged and what data types are in each column, lets start sorting and looking for trends!

In [12]:
# We can sort a text column in descending order
shop_names = shops.sort_values(by='city', ascending=False)
shop_names.head()

Unnamed: 0,index,id,name,rating,city
477,477,ice-monster-walnut-creek,Ice Monster,4.0,Walnut Creek
478,478,t4-walnut-creek,T4,3.5,Walnut Creek
479,479,chalogy-tea-bar-walnut-creek-4,CHALOGY Tea Bar,3.5,Walnut Creek
480,480,mr-green-bubble-walnut-creek-2,Mr. Green Bubble,3.5,Walnut Creek
484,484,t4-and-poke-walnut-creek,T4 And Poke,3.5,Walnut Creek


In [13]:
# We can also sort a numerical column in descending order, and then again in ascending order.
best_shops = shops.sort_values(by='rating', ascending=False)
best_shops.head(10)  #top 10 boba places

Unnamed: 0,index,id,name,rating,city
505,505,golden-bakery-pittsburg,Golden Bakery,5.0,Pittsburg
147,147,bobateani-san-jose,Bobateani,5.0,San Jose
89,89,puppy-bobar-san-francisco,Puppy Bobar,5.0,San Francisco
533,533,honey-bear-smoothie-tea-and-dessert-hayward,Honey Bear Smoothie Tea & Dessert,5.0,Hayward
426,426,waterfront-cafe-burlingame,Waterfront Cafe,5.0,Burlingame
397,397,i-tea-burlingame-2,i-Tea,5.0,Burlingame
368,368,mr-green-bubble-sunnyvale,Mr. Green Bubble,5.0,Sunnyvale
128,128,qteabar-oakland,QTeaBar,5.0,Oakland
365,365,taza-deli-and-cafe-redwood-city,Taza Deli & Cafe,5.0,Redwood City
82,82,keep-it-san-francisco-6,Keep it,4.5,San Francisco


In [14]:
#worst 10 boba places!
worst_shops = shops.sort_values(by='rating', ascending=True)
worst_shops.head(5)  

Unnamed: 0,index,id,name,rating,city
578,578,loving-tea-san-leandro,Loving Tea,2.0,San Leandro
530,530,quickly-kobe-bento-richmond-2,Quickly - Kobe Bento,2.0,Richmond
371,371,panda-express-mountain-view-2,Panda Express,2.0,Mountain View
587,587,china-kitchen-express-san-leandro,China Kitchen Express,2.0,San Leandro
585,585,leisure-cafe-san-leandro-2,Leisure Cafe,2.5,San Leandro


We can pick significant attributes from our dataframe, and  do a groupby on these attributes to find the relationship of a sertain category of objects. Here, we are grouping by boba shop names and cities.

We do have to make sure that we use a numerical column for one of these functions in order to apply an aggregation function, and here we are using the rating column and the index column

In [17]:
import numpy as np
#What brands have the most boba shops?
brands = shops.drop(columns=['id', 'rating', 'city'])
brands.groupby(['name']).agg(np.size).sort_values(by='index', ascending=False).head(10)


Unnamed: 0_level_0,index
name,Unnamed: 1_level_1
Quickly,25
T4,18
i-Tea,16
Sharetea,15
Teaspoon,9
BAMBU,9
Gong Cha,8
Happy Lemon,7
Tapioca Express,6
Boba Guys,6


In [18]:
# What cities have an average rating of boba shops above 4.5?
df = shops.drop(columns=['id', 'index', 'name'])
df = df.groupby(['city']).agg(np.mean).sort_values(by='rating', ascending=False)
best_city = df[df['rating'] >= 4.5]

best_city

Unnamed: 0_level_0,rating
city,Unnamed: 1_level_1
Brisbane,4.5
Corte Madera,4.5


(This is probably a skewed result though due to the limited sample sizes of boba shops in those cities. In reality, if we want to find out the best rating boba cities, we should take the sample size from each of the cities into account as well)

In [20]:
#Sort your dataframe by some column in any order. Now take the top 10 rows of the resulting dataframe.

#Best 10 shops in Berkeley
Berkeley = shops[shops['city'] == 'Berkeley']
Berkeley = Berkeley.drop(columns=['id', 'index'])
Berkeley.head(10)

Unnamed: 0,name,rating,city
103,U Cha,4.0,Berkeley
105,Asha Tea House,4.0,Berkeley
109,Sharetea,4.0,Berkeley
110,Happy Lemon Berkeley,4.0,Berkeley
112,Boba Ninja,4.0,Berkeley
114,Purple Kow,3.5,Berkeley
123,Tea Press,4.0,Berkeley
135,Bubble Tea Share Time,4.0,Berkeley
528,Sweetheart Cafe,3.5,Berkeley
529,TeaOne Berkeley,3.5,Berkeley


<a id='Berkeley'></a>
## Around Berkeley

What are the best boba shops in Berkeley or SF? Does that sound right?

In [21]:
# There are actually 174 boba shops in either SF or Berkeley!
near_me = shops[shops['city'].isin(['San Francisco', 'Berkeley'])]
near_me = near_me.drop(columns=['id', 'index'])
near_me.size

174

Out of 174 boba shops around me, here are the ones with the highest rating!  
However, we do need to consider the <b>bias</b> that come from the dataset. The dataset doesn't tell us how many ratings they averaged to find the rating of each boba shop. Some boba shop may have a high rating because only a few die heart fans commented, but actually most people haven't even heard about the place, let alone giving it a rating. This is an embedded bias within the dataset, and we can't do much about it here.

In [22]:
best_near_me = near_me.sort_values(by='rating', ascending=False)
best_near_me[best_near_me['rating'] >= 4.5]

Unnamed: 0,name,rating,city
89,Puppy Bobar,5.0,San Francisco
71,The Boba Shop,4.5,San Francisco
61,OMG Tea,4.5,San Francisco
91,Little Heaven Deli,4.5,San Francisco
82,Keep it,4.5,San Francisco
77,Tancca,4.5,San Francisco
75,Tea Hut,4.5,San Francisco
93,Good Earth Cafe,4.5,San Francisco
70,Wondertea,4.5,San Francisco
92,5 Sweets,4.5,San Francisco


In [24]:
#Which city has the best boba?
shops = shops.groupby(['city']).agg(np.mean).sort_values(by='rating', ascending=False)
shops.head()

Unnamed: 0_level_0,index,rating
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Brisbane,129.0,4.5
Corte Madera,525.0,4.5
Burlingame,405.0,4.357143
San Pablo,489.0,4.25
San Carlos,334.0,4.25


Most of the times, a pivot table can be helpful in finding a numerical value that we want to aggregate and find filtering by two attributes. This dataset don't have a lot of useful numerical values to create a pivot table, and we'll see why.

In [25]:
pivot = copy.pivot_table(copy, 'rating', 'name', np.sum)
pivot

#Since most shops only have one rating, most of the values are NAN
#The sum reflects the sum of longitudes given the same shop and rating, but it's not very meaningful

Unnamed: 0_level_0,index,index,index,index,index,index,index,index,index,index,...,long,long,long,long,long,long,long,long,long,long
name,360 Crepes,5 Sweets,50 Tea,8-Twelve Oriental Market,85°C Bakery Cafe,99 Ranch Market,99% Tea House,Alice Street Bakery Café,Aloha Pure Water Shaved Ice,Alohana Hawaiian Grill,...,Yogurt Hill,Yogurt Shop,Yogurtland,Youji Fresh Rolls Wine & Tea,Yummi Tea Cafe,Yumygurt,i-Tea,i-Tea - Dublin,pokéLOVE,uRbain tea
rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2.0,,,,,,,,,,,...,,,,,,,,,,
2.5,,,,,,,,,,,...,,,,-121.944165,,,,,,
3.0,,,,,,,,,,602.0,...,,,,,,,,,,
3.5,,,122.0,,862.0,713.0,,,,,...,,,,,,,-244.452364,,-122.160532,-121.937048
4.0,208.0,,,537.0,,,,,,,...,-122.08114,-121.918741,-488.103331,,-121.875899,-122.285332,-1342.238727,-121.865862,,
4.5,,92.0,,,,,0.0,132.0,524.0,,...,,,,,,,-244.530775,,,
5.0,,,,,,,,,,,...,,,,,,,-122.346889,,,


<a id='retrieving'></a>
## Retrieving Data

Now that we have the exact dataset we want, cleaned and filtered as we chose, lets start to extract some data. Calculating the mean and the median are useful exercises. 

In [28]:
#Extract the mean and median values of a numerical column in your dataframe below.

mean = df.mean()
median = df.median()

mean, median

(rating    3.774755
 dtype: float64,
 rating    3.764706
 dtype: float64)

In [29]:
copy.head()

Unnamed: 0,index,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.99385
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705


In [27]:
#Extract the fourth column from the right and the fifth row from the top. 


four_five = copy.iloc[5][3]
four_five
#this cell should represent the rating of T-LAB

4.0

<a id='cutting'></a>
## Cutting Down the Data

At this point, we can begin to filter data by a specific column value, or even multiple! We can also start to slice the dataframe so that we only see a subset of the rows that we actually want to see.

In [30]:
#Using the loc function, we will cut down our data frame into a smaller data frame which includes the values 
#for the first 3 rows in the first column.

df = copy.loc[[0,1,2], ['id']]
df

Unnamed: 0,id
0,99-tea-house-fremont-2
1,one-tea-fremont-2
2,royaltea-usa-fremont


In [31]:
#Next, cut down the data frame into a SERIES which includes the value for the first row in the same column.
s = df.loc[0]
s

id    99-tea-house-fremont-2
Name: 0, dtype: object

In [32]:
#Using the iloc function let’s cut down our data frame into a smaller data frame which includes the values 
#for the first 3 rows in any two columns. 
#Use slicing for your arguments to the loc function, not lists.
copy.iloc[0:3, 2:4]

Unnamed: 0,name,rating
0,99% Tea House,4.5
1,One Tea,4.5
2,Royaltea USA,4.0


Now lets return to some more mutation. We are going to add a column to the dataframe which is the sum of any two numerical columns and name this new column: “Sum of (column 1), (column 2)”. Here, we are trying to add the longitude to the attitude. This doesn't provide the most useful information, but interesting to see indeed.

In [34]:
location = copy.drop(columns=['id', 'city', 'rating','address'])
location['sum of lat and long'] = location['lat'] + location['long']
location.head()

Unnamed: 0,index,name,lat,long,sum of lat and long
0,0,99% Tea House,37.56295,-122.01004,-84.44709
1,1,One Tea,37.489067,-121.929414,-84.440347
2,2,Royaltea USA,37.551315,-121.99385,-84.442535
3,3,TECO Tea & Coffee Bar,37.553694,-121.981043,-84.427348
4,4,T-LAB,37.576149,-122.043705,-84.467556


Finally, lets do some filtering. We are going to find the boba shops with latitude over 38.5

In [36]:
#Using boolean operators, filter the dataframe by a numerical column value being between a certain threshold. 

larger_lat = location[location['lat'] >= 38.5]
larger_lat.size

50

There are 50 boba shops above that latitude! Now, we are going to do something that's actually useful. What are some of the highly rated Quickly shops? We are using the similar technique of filtering.

In [37]:
#Here, we are using multiple boolean operators, filter the dataframe in two ways: by a numerical column and a string column.

Good_Quickly = copy[(copy['name'] == 'Quickly') & (copy['rating'] > 3.0)]
Good_Quickly

Unnamed: 0,index,id,name,rating,address,city,lat,long
120,120,quickly-oakland-4,Quickly,4.5,1243 33rd Ave,Oakland,37.77651,-122.225059
134,134,quickly-oakland-5,Quickly,3.5,3306 Lakeshore Ave,Oakland,37.810622,-122.243926
205,205,quickly-pleasanton,Quickly,3.5,1 Stoneridge Mall Rd,Pleasanton,37.695654,-121.929318
353,353,quickly-sunnyvale-4,Quickly,3.5,415 N Mary Ave,Sunnyvale,37.390076,-122.042185
409,409,quickly-san-mateo,Quickly,3.5,142 E 3rd Ave,San Mateo,37.564362,-122.323524
418,418,quickly-burlingame-2,Quickly,3.5,1407 Burlingame Ave,Burlingame,37.577188,-122.348793
432,432,quickly-millbrae-4,Quickly,4.5,325 El Camino Real,Millbrae,37.60147,-122.39148
441,441,quickly-vallejo-12,Quickly,3.5,145 Plaza Dr,Vallejo,38.134083,-122.219154
475,475,quickly-antioch-2,Quickly,3.5,212 E 18th St,Antioch,38.004497,-121.799923
583,583,quickly-san-lorenzo,Quickly,3.5,17940 Hesperian Blvd,San Lorenzo,37.672989,-122.12214
