# Homework 4: Functions, Tables, and Groups

Please complete this notebook by filling in the cells provided. 

**Helpful Resource:**
- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Readings**: 

* [Visualizing Numerical Distributions](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)
* [Functions and Tables](https://www.inferentialthinking.com/chapters/08/Functions_and_Tables.html)

## Instructions

  - For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. 

  - Directly sharing answers is not okay, but discussing problems with your instructor or with other students is encouraged. 

  - You should start early so that you have time to get help if you're stuck.

## 1. Burrito-ful San Diego

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.filterwarnings("ignore")

Mira, Sofia, and Sara are trying to use Data Science to find the best burritos in San Diego! Their friends Jessica and Sonya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data

The following cell loads a table called `ratings` which contains names of burrito restaurants, their Yelp rating, Google rating, and their overall rating. The `Overall` rating is not an average of the `Yelp` and `Google` ratings, but rather it is the overall rating of the customers that were surveyed in the study above.

It also loads a table called `burritos_types` which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant.

In [None]:
# Just run this cell
ratings = Table.read_table("ratings.csv")
print("Table #1: ratings")
ratings.show(5)
burritos_types = Table.read_table("burritos_types.csv").drop(0)
print("Table #2: burritos_types")
burritos_types.show(5)

**Question 1.** It would be easier if we could combine the information in both tables. Assign `burritos` to the result of joining the two tables together, so that we have a table with the ratings for every corresponding menu item from every restaurant. Each menu item has the same rating as the restaurant from which it is from. **(4 Points)**

*Note:* It doesn't matter which table you put in as the argument to the table method.

*Hint:* If you need help on using the `join` method, refer to the [Python Reference Sheet](http://data8.org/sp22/python-reference.html) or [Section 8.4](https://www.inferentialthinking.com/chapters/08/4/Joining_Tables_by_Columns.html) in the textbook.


In [None]:
burritos = ...
print("Table #3: burritos")
burritos.show(5)

<!-- BEGIN QUESTION -->

**Question 2.** Let's look at how the Yelp scores compare to the Google scores in the `burritos` table. 

  - First, assign `yelp_and_google` to a table only containing the columns `Yelp` and `Google`. 
  - Then, make a scatter plot with Yelp scores on the x-axis and the Google scores on the y-axis. 
  
**(8 Points)**


In [None]:
yelp_and_google = ...
...

# Don't change/edit/remove the following line.
# To help you make conclusions, we have added the y=x line on the scatterplot
plt.plot(np.arange(2.5,5,.5), np.arange(2.5,5,.5));

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.** Looking at the scatter plot you just made in Question 1.2, do you notice any pattern(s) (e.g., is one of the two types of scores consistently higher than the other one)? If so, describe those patterns **briefly** in the cell below. **(4 Points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

To prepare for the next question, review how `.group` works. 

1. Here's a link to the relevant [textbook](https://www.inferentialthinking.com/chapters/08/2/Classifying_by_One_Variable.html) section. 
2. You can also refer to the [Python Reference](https://www.data8.org/sp22/python-reference.html). Use Ctrl-F to find occurrences of `tbl.group` on that webpage.

You might also want to keep in mind Section 6.2.5 of the textbook, which lists the various conditions available for use with calls to the `where` method: `original_table_name.where(column_label_string, are.condition)`

**Question 4.** There are so many types of California burritos in the `burritos` table! Sara wants to know which type is the highest rated across all restaurants. Since we do not have the ratings of individual menu items at these restaurants, we will substitue the `Overall` restaurant rating. For example, the California burrito at Albertacos will be considered to have a rating of 3.45 (refer to Table \#3: burritos, printed previously).

Create a table with two columns: the first column should include the distinct names of the "California" burritos and the second column should contain the average overall ratings of those burrito (averaged across the various restaurants). **In your calculations, you should only compare burritos that contain the word "California".** For example, there are "California" burritos, "California Breakfast" burritos, "California Surf And Turf" burritos, etc. **(8 Points)**

*Hint:* If multiple restaurants serve the "California - Chicken" burrito, what table method can we use to aggregate those together and find the average overall rating?

*Note:* Feel free to break up the solution into multiple lines and steps, if you find that helpful; just make sure you assign the final output table to `california_burritos`! 


In [None]:
california_burritos = ...
print("Final output table contains", california_burritos.num_rows, "rows.")
california_burritos.show()

If all went well in your work for Question 4, you should have found that the final output table has 18 rows and the "California - Chicken" burrito has an Overall mean of 3.48831

**Question 5.** Given this new table `california_burritos`, Sara can figure out the name of the California burrito with the highest overall average rating! Assign `best_california_burrito` to a line of code that outputs the string that represents the name of the California burrito with the highest overall average rating. (In case of a tie for highest-rated, you can output any one of them.) **(4 Points)**


In [None]:
best_california_burrito = ...
best_california_burrito

<!-- BEGIN QUESTION -->

**Question 6.** Mira thinks that burritos in San Diego are cheaper (and taste better) than the burritos where she lives. Plot a histogram that visualizes the distribution of the costs of the burritos from San Diego using the `burritos` table. Also use the provided `my_bins` variable (in the cell below) when making your histogram, so that the histogram is more visually informative. **(4 Points)**


In [None]:
my_bins = np.arange(0, 15, 1)

# draw a histogram for burrito ; use my_bins for the bins
...

#### <!-- END QUESTION -->

**Question 7.** What percentage of burritos in San Diego (according to the `burritos` table) cost less than $6? Assign `burritos_less_than_6` to your answer, which should be an integer between 0 and 100. **You should only use the histogram above to answer the question.** Do not use code on the table to find the answer, just eyeball the bar heights! **(4 Points)**

*Note*: Your answer does not have to be exact, but it should be within a couple percentage points of the exact answer. This is the sort of question you might see on the midterm exam.


In [None]:
burritos_less_than_6 = ...

## 2. San Francisco City Employee Salaries

This exercise is designed to give you practice with using the Table methods `.pivot` and `.group`. Here is a link to the [Python Reference Sheet](http://data8.org/sp22/python-reference.html) in case you need a quick refresher. 

The data source we will use within this portion of the homework is [publicly provided](https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd/data) by the City of San Francisco. We have filtered it to retain just the relevant columns and restricted the data to the calendar year 2019. Run the following cell to load our data into a table called `full_sf`.

In [None]:
full_sf = Table.read_table("sf2019.csv")
full_sf.show(10)

The table has one row for each of the 44,525 San Francisco government employees in 2019.

The first four columns describe the employee's job. For example, the employee in the third row of the table had a job called "IS Business Analyst-Senior". We will call this the employee's *position* or *job title*. The job was in a Job Family called Information Systems (hence the IS in the job title), and was in the Adult Probation Department that is part of the Public Protection Organization Group of the government. You will mostly be working with the `Job` column.

The next three columns contain the dollar amounts paid to the employee in the calendar year 2019 for salary, overtime, and benefits. Note that an employee’s salary does not include their overtime earnings.

The last column contains the total compensation paid to the employee. It is the sum of the previous three columns:

$$\text{Total Compensation} = \text{Salary} + \text{Overtime} + \text{Benefits}$$

For this homework, we will be using the following columns:
1. `Organization Group`: A group of departments. For example, the **Public Protection** Org. Group includes departments such as the Police, Fire, Adult Protection, District Attorney, etc.
2. `Department`: The primary organizational unit used by the City and County of San Francisco.
3. `Job`: The specific position that a given worker fills.
4. `Total Compensation`: The sum of a worker's salary, overtime, and benefits in 2019.


Run the following cell to select the relevant columns and create a new table named `sf`.

In [None]:
sf = full_sf.select("Job", "Department", "Organization Group",  "Total Compensation")
sf.show(10)

We want to use this table to generate arrays with the job titles of the members of each **Organization Group**.

**Question 1.** Set `job_titles` to a table with two columns. The first column should be called `Organization Group` and have the name of every "Organization Group" once, and the second column should be called `Jobs` with each row in that second column containing an *array* of the names of all the job titles within that "Organization Group". Don't worry if there are multiple of the same job titles. **(4 Points)**

*Hint:* Think about how `group` works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to `group`. 


In [None]:
# Use ONE of the two functions defined below in your call to group.
def first_item(array):
    '''Returns the first item'''
    return array.item(0)

def full_array(array):
    '''Returns the array that is passed through'''
    return array 

# Make a call to group using ONE of the functions above when you define job_titles
job_titles = ...
job_titles

**Understanding the code you just wrote in 2.1 is important for moving forward with the class! If you made a lucky guess, take some time to look at the code, step by step. Office hours is always a great resource!**

**Question 2.** Set `department_ranges` to a table containing departments as the rows, and the organization groups as the columns. The values in each row should correspond to a total compensation range, where range is defined as the **difference between the highest total compensation and the lowest total compensation** in the department for that organization group. **(8 Points)**

*Hint 1:* First you'll need to define a new function `compensation_range` which takes in an array of compensations and returns the range of compensations in that array.

*Hint 2:* Which table function allows you to specify the rows and columns of a new table? 


In [None]:
# Define compensation_range first
def compensation_range(array):
    """Returns the range (difference between max and min) of the given array"""
    ...

# A Small Test
rng = compensation_range(make_array(7, 3, 1, 9, 15, 2))
print("rng should be 14:", rng)

department_ranges = ...
department_ranges.show()

<!-- BEGIN QUESTION -->

**Question 3.** Give an explanation as to why some of the row values are `0` in the `department_ranges` table from the previous question. **(4 Points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 4.** Write code to find the number of departments appearing in the `sf` table that have an average total compensation of greater than 125,000 dollars; assign this value to the variable `num_over_125k`. **(8 Points)**

*Hint:* The variable names provided are meant to help guide the intermediate steps and general thought process. Feel free to delete them if you'd prefer to start from scratch, but make sure your final answer is assigned to `num_over_125k`.


In [None]:
depts_and_comp_tbl = ...
depts_avg_comp_tbl = ...
num_over_125k = ...
num_over_125k

## 3. Finish Line

Congratulations, you're done with Homework 4!  

1. Make sure you have run all the cells in your notebook in order, so that all images/graphs appear in the output. 
2. Save and Checkpoint your file (Ctrl-S).
3. Download a copy as HTML.
4. Upload your HTML file to the assignment activity on Moodle.