# Final Project: Yelp and Food Safety
#### Exploring the San Francisco Restaurant World

In this project, we will investigate a subset of the restaurants and related information from them adapted from Yelp data located in San Francisco, California. You will first explore some of the data about the restaurants themselves, calculating some summary statistics and trying to figure out some patterns in the data. Next we will merge that with a list of health inspection scores and violations that have been [made available by the San Francisco Department of Public Health](https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i). Lastly, we will run some linear regression analysis to see if there is any meaningful relationship between health inspection scores, and other aspects of each restaurant

If you have any questions or get stuck or anything, feel free to come to office hours. 

**Helpful Resource:**

* [Python Reference](https://docs.google.com/document/d/1zpTTl47NoGf2A3_oE1YusLyb-cF2sZMALdCMM5dpYIA/edit): Cheat sheet for Python and other functions used in this course


To get started on the final project, first run the following cell to import some necessary packages, and have fun! 


In [5]:
# importing some helful libraries
import pandas as pd
import numpy as np
from functions import * 

# downloading necessary data 

# !wget 

from project_helper import * 
def check(*args):
    return None

# **1. San Francisco Restaurant Data**

In this section you'll be learning a few extra useful features of _dataframes_, which we previously used in lab 3 as a way to managing data for analysis. 

As you might have noticed, the package we are using is called _Pandas_, which is the most commonly used package to clean and analyze data. You will learn some of the most important features of manipulating data using Pandas, and get a feel for exploring data using Python. 

## Part One: Loading the Data

As mentioned in lecture, we can use Pandas to read many differe types of data format and read it into a table. The most common are `.csv` files, which stand for comma-separated-values. 

Run the following cell to download two `.csv` files that contain the data you will be working with in this project.

In [1]:
# !wget ... 


As a side note, when you reopen this project in Google Colab, your code will remain, however it will delete any files you downloaded in the previous session. Just a friendly reminder to rerun that cell block to download the files each time you restart Google Colab.



### Question 1:

Now, load the files, named `businesses.csv` and `inspections.csv` into Pandas dataframes named `bus`, and `ins` respectively. 

Run the cell afterwards to check if you did this correctly.

In [None]:
## Your Code Here...

bus = ...
ins = ...

In [6]:
check('q1a', [bus, ins])

In [9]:
# delete cell
bus = pd.read_csv('data/businesses.csv')
ins = pd.read_csv('data/inspections.csv')

Now that you've read in the files, let's try some `pd.DataFrame` methods ([docs](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html)).
We can use the `DataFrame.head` method to show the top few lines of the `bus` and `ins` dataframes. To show multiple return outputs in one single cell, you can use `display()`.

Run the following cell to display the both data frames. 

In [10]:
display(bus.head(), ins.head())

Unnamed: 0,name,display_address,type,rating,review_count,price,latitude,longitude,bid
0,House of Prime Rib,"1906 Van Ness Ave, San Francisco, CA 94109",American (Traditional),4.52,7912,$$$,37.793338,-122.422827,3350
1,Burma Superstar,"309 Clement St, San Francisco, CA 94118",Burmese,4.69,7145,$$,37.783004,-122.462539,1977
2,B Patisserie,"2821 California St, San Francisco, CA 94115",Bakeries,4.77,3006,$$,37.788014,-122.440756,71696
3,Kokkari Estiatorio,"200 Jackson St, San Francisco, CA 94111",Greek,4.81,4843,$$$,37.796918,-122.399864,2858
4,San Tung,"1031 Irving St, San Francisco, CA 94122",Chinese,4.58,7497,$$,37.763891,-122.468805,67330


Unnamed: 0,iid,date,score,type,bid,timestamp,year
0,100504_20190411,04/11/2019 12:00:00 AM,88,Routine - Unscheduled,100504,2019-04-11,2019
1,100504_20190619,06/19/2019 12:00:00 AM,-1,New Ownership,100504,2019-06-19,2019
2,100504_20190927,09/27/2019 12:00:00 AM,-1,Reinspection/Followup,100504,2019-09-27,2019
3,100992_20190517,05/17/2019 12:00:00 AM,-1,Non-inspection site visit,100992,2019-05-17,2019
4,100992_20190621,06/21/2019 12:00:00 AM,-1,New Ownership,100992,2019-06-21,2019


You can also use the `DataFrame.describe` method to learn about the numeric columns of each dataframe. It can be handy for computing summaries of various statistics of our dataframes. 

Try it out with our two dataframes.

In [11]:
# Try displaying the DataFrame.describe outputs for bus and ins

## Your code here...

In [12]:
# delete this cell 
display(bus.describe(), ins.describe())

Unnamed: 0,rating,review_count,latitude,longitude,bid
count,658.0,658.0,658.0,658.0,658.0
mean,4.086611,1239.18693,-4645.03543,-4730.493422,55116.506079
std,0.408482,977.693558,5010.965175,4930.982011,34643.693046
min,2.88,157.0,-9999.0,-9999.0,31.0
25%,3.79,616.25,-9999.0,-9999.0,15755.0
50%,4.09,966.0,37.749223,-122.477503,68375.0
75%,4.38,1564.5,37.782897,-122.420564,82940.5
max,4.98,7912.0,37.807854,-122.388189,102398.0


Unnamed: 0,score,bid,year
count,3056.0,3056.0,3056.0
mean,47.749673,54402.915249,2017.900851
std,45.423231,34729.741573,0.922701
min,-1.0,31.0,2016.0
25%,-1.0,7786.0,2017.0
50%,78.0,68394.0,2018.0
75%,90.0,82909.0,2019.0
max,100.0,102398.0,2019.0


From its name alone, we expect the `bus.csv` file to contain information about the restaurants. Let's do some Exploratory Data Analysis (EDA), and see if we can get a better understanding of the data. 



## Part 2: Exploring the Data

In lab 3, we refered to the data in a column as an array. Another term that it can be called is a `Series`, which is just a fancier version of an array. 

The nice thing about Series' is that they have lots of [built in functions within them](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) that are called methods. 


- The [`Series.unique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) method returns an array of all the unique entries inside of a Series. 

- The [`Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) method returns a new series that lists the number of occurencies of each unique element in a Series. 

Read the documentation is you want a deeper look at these functions, you can also look at some examples of how they are used. 

### Question 2a:

Notice that there are two different identifiers for businesses in our dataset, `bid` which is an identification number, as well as `name`. 

For both of these variables, figure out the number of unique entires, and assign them to `n_bus`, and `n_bid`.

In [19]:
## Your Code Here...
n_bus = ...
n_bid = ...

print(' Number of Unique Businesses: ', n_bus, "\n Number of Unique Business ID: ", n_bid)

 Number of Unique Businesses:  Ellipsis 
 Number of Unique Business ID:  Ellipsis


In [20]:
#delete
n_bid = len(bus['bid'].unique())
n_bus = len(bus['name'].unique())

print(' Number of Unique Businesses: ', n_bus, "\n Number of Unique Business ID: ", n_bid)

 Number of Unique Businesses:  607 
 Number of Unique Business ID:  658


In [None]:
check('q2a', n_bus)

Interesting. There are more unique `bid`'s than there are `name`'s. As you might have guessed, this is because there might be more than one location of a restaurant, yet both the `bid` and `name` can be used to identify restaurants. Since `bid` also distinguishes between the locations of a restaurant, we say `bid` is more _granular_ in data.

### Question 2b:

Find the name of the restaurant with the most number of occurences in our dataset, and assign it's name as a string to `most_locations`. 

In [24]:
#delete me
bus['name'].value_counts() 

One Market Restaurant          9
Cake Coquette                  5
Philz Coffee                   5
Tonga Room & Hurricane Club    4
Bake Cheese Tart               4
                              ..
Yummy Yummy                    1
Purple Kow                     1
Waraku                         1
15 Romolo                      1
Amber India                    1
Name: name, Length: 607, dtype: int64

In [None]:
## Your Answer Here...
most_locations = ...

In [None]:
check('q2b', most_locations)

### Question 2c:

The cool thing about Series is that when you apply comparison operators to them, it does it for each entry in the Series. Figure out how many restaurants chains have more than one location, and assign that to `num_mult_locations`. 

_Hint: Remember that True and 1 and the same. First try getting a series of booleans and then use that to find the number of locations._

In [25]:
## Your Code Here... 
num_mult_locations = ...

In [None]:
check('q2c', num_mult_locations)

You can also use other comparisons to return a series of booleans, refer to Lecture 2 for a list of comparison operators. This is very useful for filtering data from dataframes, which we shall do in the next problem.

### Part 3: Exploring the Data (cont.)

So far, you've had a chance to select and analyze data from a single column of a dataframe. This is useful when we want to analyze information accross the observations we have (for example, accross all restaurants in our dataset). Often times we also want only consider a certain subset of our observations (for example only selecting the Italian restaurants). 

There are [many ways to select subsets of data](https://pandas.pydata.org/docs/user_guide/indexing.html), but we will focus on boolean-indexing. 


In [30]:
bus.head()

Unnamed: 0,name,display_address,type,rating,review_count,price,latitude,longitude,bid
0,House of Prime Rib,"1906 Van Ness Ave, San Francisco, CA 94109",American (Traditional),4.52,7912,$$$,37.793338,-122.422827,3350
1,Burma Superstar,"309 Clement St, San Francisco, CA 94118",Burmese,4.69,7145,$$,37.783004,-122.462539,1977
2,B Patisserie,"2821 California St, San Francisco, CA 94115",Bakeries,4.77,3006,$$,37.788014,-122.440756,71696
3,Kokkari Estiatorio,"200 Jackson St, San Francisco, CA 94111",Greek,4.81,4843,$$$,37.796918,-122.399864,2858
4,San Tung,"1031 Irving St, San Francisco, CA 94122",Chinese,4.58,7497,$$,37.763891,-122.468805,67330


It is easier to show than to explain. Let's say I really liked _Burma Superstar_ and want all other restaurants with `type == 'Burmese'`. 

First, I can extract the `type` Series similarly to how we have done in Lab 3.

In [32]:
# Just run this cell
types = bus['type']
types


0      American (Traditional)
1                     Burmese
2                    Bakeries
3                       Greek
4                     Chinese
                ...          
653              Coffee & Tea
654                      Poke
655                      Bars
656                      Thai
657        Breakfast & Brunch
Name: type, Length: 658, dtype: object

You'll notice that on the left of the Series output, there are numbers that each correspond to a specific level of price. This is called the index, and it corresponds to the index (also on the left) in the `bus` dataframe. 

Next, like in part 2c, I can use a comparison operator to find all indices that are equal to `'Burmese'`

In [33]:
burmese = types == 'Burmese'
burmese

0      False
1       True
2      False
3      False
4      False
       ...  
653    False
654    False
655    False
656    False
657    False
Name: type, Length: 658, dtype: bool

_Burma Superstar_, in index position 1 returns true, as expected. We can now use this Series to index into the `bus` dataframe. Whichever indices were `True` in the "indexer" will be kept, and all falses will be dropped. This will not change the original `bus` dataframe, so we have to reassign it to a new variable if we want to keep using it. 


In [34]:
burmese_restaurants = bus[burmese]
burmese_restaurants

Unnamed: 0,name,display_address,type,rating,review_count,price,latitude,longitude,bid
1,Burma Superstar,"309 Clement St, San Francisco, CA 94118",Burmese,4.69,7145,$$,37.783004,-122.462539,1977
90,B Star,"127 Clement St, San Francisco, CA 94118",Burmese,3.92,1995,$$,37.783093,-122.460649,33911
179,Burma Love,"211 Valencia St, San Francisco, CA 94103",Burmese,4.17,1452,$$,-9999.0,-9999.0,83643
198,Yamo,"3406 18th St, San Francisco, CA 94110",Burmese,4.04,2027,$,37.761882,-122.419599,1231


We walked through it step by step, but this can be done in one line as follows:


In [36]:
burmese_restaurants = bus[bus['type'] == 'Burmese']
burmese_restaurants

Unnamed: 0,name,display_address,type,rating,review_count,price,latitude,longitude,bid
1,Burma Superstar,"309 Clement St, San Francisco, CA 94118",Burmese,4.69,7145,$$,37.783004,-122.462539,1977
90,B Star,"127 Clement St, San Francisco, CA 94118",Burmese,3.92,1995,$$,37.783093,-122.460649,33911
179,Burma Love,"211 Valencia St, San Francisco, CA 94103",Burmese,4.17,1452,$$,-9999.0,-9999.0,83643
198,Yamo,"3406 18th St, San Francisco, CA 94110",Burmese,4.04,2027,$,37.761882,-122.419599,1231


### Question 3a:

Create a new dataframe that only contains rows about the restaurant you found in part 2b (the string you assigned to `most_locations`) and assign it to `most_locations_df`. 

_Hint: There should be 

In [38]:
## Your Code Here...
most_locations_df = ...

In [39]:
check('q3a', most_locations_df)