# Final Project: Yelp and Food Safety
#### Exploring the San Francisco Restaurant World

In this project, we will investigate a subset of the restaurants and related information from them adapted from Yelp data located in San Francisco, California. You will first explore some of the data about the restaurants themselves, calculating some summary statistics and trying to figure out some patterns in the data. Next we will merge that with a list of health inspection scores and violations that have been [made available by the San Francisco Department of Public Health](https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i). Lastly, we will run some linear regression analysis to see if there is any meaningful relationship between health inspection scores, and other aspects of each restaurant

If you have any questions or get stuck or anything, feel free to come to office hours. 

**Helpful Resource:**

* [Python Reference](https://docs.google.com/document/d/1zpTTl47NoGf2A3_oE1YusLyb-cF2sZMALdCMM5dpYIA/edit): Cheat sheet for Python and other functions used in this course


To get started on the final project, first run the following cell to import some necessary packages, and have fun! 


In [None]:
# importing some helful libraries
import pandas as pd
import numpy as np
from functions import * 

# downloading necessary data 

# !wget 

from project_helper import * 
def check(*args):
    return None

# **1. San Francisco Restaurant Data**

In this section you'll be learning a few extra useful features of _dataframes_, which we previously used in lab 3 as a way to managing data for analysis. 

As you might have noticed, the package we are using is called _Pandas_, which is the most commonly used package to clean and analyze data. You will learn some of the most important features of manipulating data using Pandas, and get a feel for exploring data using Python. 

## Part One: Loading the Data

As mentioned in lecture, we can use Pandas to read many differe types of data format and read it into a table. The most common are `.csv` files, which stand for comma-separated-values. 

Run the following cell to download two `.csv` files that contain the data you will be working with in this project.

In [None]:
# !wget ... 


As a side note, when you reopen this project in Google Colab, your code will remain, however it will delete any files you downloaded in the previous session. Just a friendly reminder to rerun that cell block to download the files each time you restart Google Colab.



### Question 1:

Now, load the files, named `businesses.csv` and `inspections.csv` into Pandas dataframes named `bus`, and `ins` respectively. 

Run the cell afterwards to check if you did this correctly.

In [None]:
## Your Code Here...

bus = ...
ins = ...

In [None]:
check('q1a', [bus, ins])

In [None]:
# delete cell
bus = pd.read_csv('data/businesses.csv')
ins = pd.read_csv('data/inspections.csv')

Now that you've read in the files, let's try some `pd.DataFrame` methods ([docs](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html)).
We can use the `DataFrame.head` method to show the top few lines of the `bus` and `ins` dataframes. To show multiple return outputs in one single cell, you can use `display()`.

Run the following cell to display the both data frames. 

In [None]:
display(bus.head(), ins.head())

You can also use the `DataFrame.describe` method to learn about the numeric columns of each dataframe. It can be handy for computing summaries of various statistics of our dataframes. 

Try it out with our two dataframes.

In [None]:
# Try displaying the DataFrame.describe outputs for bus and ins

## Your code here...

In [None]:
# delete this cell 
display(bus.describe(), ins.describe())

From its name alone, we expect the `bus.csv` file to contain information about the restaurants. Let's do some Exploratory Data Analysis (EDA), and see if we can get a better understanding of the data. 



## Part 2: Exploring the Data

In lab 3, we refered to the data in a column as an array. Another term that it can be called is a `Series`, which is just a fancier version of an array. 

The nice thing about Series' is that they have lots of [built in functions within them](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) that are called methods. 


- The [`Series.unique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) method returns an array of all the unique entries inside of a Series. 

- The [`Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) method returns a new series that lists the number of occurencies of each unique element in a Series. 

Read the documentation is you want a deeper look at these functions, you can also look at some examples of how they are used. 

### Question 2a:

Notice that there are two different identifiers for businesses in our dataset, `bid` which is an identification number, as well as `name`. 

For both of these variables, figure out the number of unique entires, and assign them to `n_bus`, and `n_bid`.

In [None]:
## Your Code Here...
n_bus = ...
n_bid = ...

print(' Number of Unique Businesses: ', n_bus, "\n Number of Unique Business ID: ", n_bid)

In [None]:
#delete
n_bid = len(bus['bid'].unique())
n_bus = len(bus['name'].unique())

print(' Number of Unique Businesses: ', n_bus, "\n Number of Unique Business ID: ", n_bid)

In [None]:
check('q2a', n_bus)

Interesting. There are more unique `bid`'s than there are `name`'s. As you might have guessed, this is because there might be more than one location of a restaurant, yet both the `bid` and `name` can be used to identify restaurants. Since `bid` also distinguishes between the locations of a restaurant, we say `bid` is more _granular_ in data.

### Question 2b:

Find the name of the restaurant with the most number of occurences in our dataset, and assign it's name as a string to `most_locations`. 

In [None]:
#delete me
bus['name'].value_counts() 

In [None]:
## Your Answer Here...
most_locations = ...

In [None]:
check('q2b', most_locations)

### Question 2c:

The cool thing about Series is that when you apply comparison operators to them, it does it for each entry in the Series. Figure out how many restaurants chains have more than one location, and assign that to `num_mult_locations`. 

_Hint: Remember that True and 1 and the same. First try getting a series of booleans and then use that to find the number of locations._

In [None]:
## Your Code Here... 
num_mult_locations = ...

In [None]:
check('q2c', num_mult_locations)

You can also use other comparisons to return a series of booleans, refer to Lecture 2 for a list of comparison operators. This is very useful for filtering data from dataframes, which we shall do in the next problem.

### Part 3: Exploring the Data (cont.)

So far, you've had a chance to select and analyze data from a single column of a dataframe. This is useful when we want to analyze information accross the observations we have (for example, accross all restaurants in our dataset). Often times we also want only consider a certain subset of our observations (for example only selecting the Italian restaurants). 

There are [many ways to select subsets of data](https://pandas.pydata.org/docs/user_guide/indexing.html), but we will focus on boolean-indexing. 


Let's walk through a short little example:

In [None]:
# Output data frame for convenience
bus.head()

Say I really liked _Burma Superstar_ and want all other restaurants with `type == 'Burmese'`. 

First, I can extract the `type` column as a Series, similarly to how we have done in Lab 3.

In [None]:
# Just run this cell
types = bus['type']
types


You'll notice that on the left of the Series output, there are numbers that each correspond to a specific level of price. This is called the index, and it corresponds to the index (also on the left) in the `bus` dataframe. 

_Indices don't have to be in ascending order, and they also do not have to be numbers either, but more on this later._

Next, like in part 2c, I can use a comparison operator to find all indices that are equal to `'Burmese'`

In [None]:
burmese = types == 'Burmese'
burmese

_Burma Superstar_, in index position 1 returns true, as expected. We can now use this Series to index into the `bus` dataframe. Rows that correspond `True` indices in the "indexer" will be kept, and all falses will be dropped. This will not change the original `bus` dataframe, so we have to reassign it to a new variable if we want to keep using it. 


In [None]:
burmese_restaurants = bus[burmese]
burmese_restaurants

We walked through it step by step, but this can be done in one line as follows:


In [None]:
burmese_restaurants = bus[bus['type'] == 'Burmese']
burmese_restaurants

### Question 3a:

Using boolean-indexing, create a new dataframe that only contains the rows in `bus` for the restaurant you found in part 2b (the string you assigned to `most_locations`) and assign it to `most_locations_df`. 


In [None]:
## Your Code Here...
most_locations_df = ...

In [None]:
check('q3a', most_locations_df)

We can also do more complicated selects over multiple different columns. As we've mentioned, the syntax in Python is very similar to English. 

Say I wanted to find restaurants that were both `type == 'Chinese'` AND had `price == '$$'`. The syntax would be exactly that! 

One finicky note however is that you cannot use `and` or `or`, instead you use the ampersand `&` and the pipe symbol `|` respectively.

In [None]:
# run this cell and see what it does
chinese_2 = bus[(bus['type'] == 'Chinese') & (bus['price'] == '$$')]
chinese_2

### Question 3b: 

Create a dataframe that contains all restaurants that have a less than 4.0 rating and more than 1,000 reviews. Then, using this subset, figure out how many restaurants are each of the four price categoreies ('$', '$$', '$$$', and '$$$$') and assign it as a Series to the variable `q3b`. 

Your answer should have the indices be the corresponding price categories, and the counts of each categories as the values for each index. 


In [None]:
## Your Code Here...

q3b = ...

In [None]:
check('q3b', q3b)

### Question 3c:

You might have noticed that some of the longitude and latitude data is -9999. This is typically a way to indicate that the data is missing when dealing with numerical data instead of just leaving the space blank. 

#### Part 1:

Filter out the data that has missing coordinate data from `bus` and assign it to the dataframe `bus_coords`. 

In [None]:
## Your Code Here...

bus_coords = ...
bus_coords

In [None]:
bus_coords = bus[(bus['latitude'] != -9999) & (bus['longitude'] != -9999)]

In [None]:
check('q3c1', bus_coords)

#### Part 2

Next, we'll use a new package called Seaborn to plot the coordinates on a graph. The cool thing about Seaborn is that it allows for easy way to encode new information to aspects of the plot, like color! 

We've imported Seaborn for you, and made a basic plot of all the restaurants using the `scatterplot` function, and coded their `review_count` into the color of each point. Take a look at the [documentation](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) and have some fun plotting data from the `bus_location` dataframe.

In [None]:
# A little example
import seaborn as sns

sns.scatterplot(data = bus_coords, 
                x = 'latitude', 
                y = 'longitude', 
                hue = 'review_count')

Some potential ideas:
- Look at the distributions of some select cuisines, are they clustered around each other? (Probably subset the data before plotting)
- Plot the locations of highly rated restaurants, and encode the size of each data point to the correspoding price rating 
- Encode the rating to the color of the data, and see if things are clustered together! 


Feel free to implement one of the ideas above, or try something new. 

Create your graph in the following code cell, and write down your findings as a comment in the same cell! 

In [61]:
## Your Code Here



## Write down your discoveries as a comment! 

# 2. Health Inspection Data

In this next section, we're going to merge the health inspection data with our business data. We will be doing some more statistics in this part in addition to exploring the data. 

In [64]:
ins.head()

Unnamed: 0,iid,date,score,type,bid,timestamp,year
0,100504_20190411,04/11/2019 12:00:00 AM,88,Routine - Unscheduled,100504,2019-04-11,2019
1,100504_20190619,06/19/2019 12:00:00 AM,-1,New Ownership,100504,2019-06-19,2019
2,100504_20190927,09/27/2019 12:00:00 AM,-1,Reinspection/Followup,100504,2019-09-27,2019
3,100992_20190517,05/17/2019 12:00:00 AM,-1,Non-inspection site visit,100992,2019-05-17,2019
4,100992_20190621,06/21/2019 12:00:00 AM,-1,New Ownership,100992,2019-06-21,2019


Let's examine the inspection scores `ins['score']`

In [63]:
ins['score'].value_counts().head()

-1     1410
 96     193
 90     177
 94     160
 92     148
Name: score, dtype: int64

It looks like there are a lot of inspections with the `'score'` of `-1`. In fact, only health inspections of the 'Routine - Unscheduled' type are scored. 

In the following cell, we used the [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function from pandas to join the inspection data with the business data. You don't know to know how it works, but if you're curious read more in the documentation. The merged datafram is called `ins_named`

In [66]:
ins_named = ins.merge(right = bus, how = 'inner', on = 'bid')
ins_named.head()

Unnamed: 0,iid,date,score,type_x,bid,timestamp,year,name,display_address,type_y,rating,review_count,price,latitude,longitude
0,100504_20190411,04/11/2019 12:00:00 AM,88,Routine - Unscheduled,100504,2019-04-11,2019,Holy Gelato!,"1392 9th Ave, San Francisco, CA 94122",Gelato,4.65,1100,$$,-9999.0,-9999.0
1,100504_20190619,06/19/2019 12:00:00 AM,-1,New Ownership,100504,2019-06-19,2019,Holy Gelato!,"1392 9th Ave, San Francisco, CA 94122",Gelato,4.65,1100,$$,-9999.0,-9999.0
2,100504_20190927,09/27/2019 12:00:00 AM,-1,Reinspection/Followup,100504,2019-09-27,2019,Holy Gelato!,"1392 9th Ave, San Francisco, CA 94122",Gelato,4.65,1100,$$,-9999.0,-9999.0
3,100992_20190517,05/17/2019 12:00:00 AM,-1,Non-inspection site visit,100992,2019-05-17,2019,District Tea,"2154 Mission St, San Francisco, CA 94110",Bubble Tea,4.62,324,$$,-9999.0,-9999.0
4,100992_20190621,06/21/2019 12:00:00 AM,-1,New Ownership,100992,2019-06-21,2019,District Tea,"2154 Mission St, San Francisco, CA 94110",Bubble Tea,4.62,324,$$,-9999.0,-9999.0


### Question 4a:
Filter out the non-Routine Unscheduled inspections and assign it to the variable, `scores`.

In [78]:
## You code here...

scores = ...
scores



Ellipsis

In [79]:
check('q4a', scores)

### Question 4b: 

#### Part 1:
Next, plot a bar chart of distribution of scores. There should be a bar for each of the discrete scores (a histogram would mask the details of the distribution)


In [None]:
## Your Code Here....

#### Part 2:

Describe the qualities of the distribution of the inspection scores based on your histogram. Consider the skewness, the mean, the median, or any anomalous values. Are they any unusual features about this distribution? 

_Write your answer in this cell:_

### Question 4c: 

Let's figure out which restaurant had the worst score in our sample of data. Use `ins_named` to find the lowest score.

A method that might be useful is [`DataFrame.sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html). 

Then assign the name of the worst restaurant to `worst_restaurant`

In [68]:
## Your Code Here...

worst_restaurant = ...
worst_restaurant

Ellipsis

In [69]:
check('q4c', worst_restaurant)

### Question 4d: 

Let's see which restaurant has had the most extreme improvement in its health inspection rating, aka scores. Let the "swing" of a restaurant be defined as the difference between its highest-ever and lowest-ever rating. **Only consider restaurants with at least 3 ratings, aka rated for at least 3 times (3 scores)!** 

*Note*: The "swing" is of a specific business. There might be some restaurants with multiple locations; each location has its own "swing".

#### Part 1:

First, assign the Series of unique restaurant `bid`'s with more than three health ratings to the variable `unique_bids`. 

In [77]:
## Your Code Here...
unique_bids = ...
unique_bids

Ellipsis

### Part 2: 

Next, make a for loop that loops through all of the `unique_bids`. In each iteration of the loop, you should create a subset of `scores`, then calculate the swing for that bid and append it to the list `swings`, which we have created for you. 

After running your code, you should have an array of numbers that represents the swing of each `bid` in the order of `unique_bids`. 

_Hint: you can use `list.append(x)` to append a number `x` to a list._


In [85]:
swings = []

## Your Code Here...



In [84]:
check('q4d2', swings)

In the following cell, we've made a new dataframe for you that combines `unique_bids` and `swings`, called `swings_df`.

In [86]:
swings_df = pd.DataFrame({'bid':unique_bids, "swing":swings})
swings_df

Unnamed: 0,bid,swing
