In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# DSC 80 - Lab 03

### Due Date: Monday October 18, 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Note**: Labs will have public tests and private tests. The public "smoke tests" that you will run below and which appear on Gradescope are generally worth no points. After the due date, we will replace these tests with private tests that will determine your grade. This is different from DSC 10, where labs only had public tests!

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the Notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for *projects* will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file. You can write code here, but make sure that all of your real work is in the .py file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from lab import *

In [3]:
import os
import io
import pandas as pd
import numpy as np

# Hypothetically speaking...

In this section we'll develop an intuition for the terms and structure of hypothesis testing -- it's nothing to be afraid of!

The first step is always to define what you're looking at, create your hypotheses, and set a level of significance.  Once you've done that, you can find a p-value which is related to your test statistic.

If all of these words are scary: look at the lecture notebook, the textbook references, and don't forget to think about the real-world meaning of these terms!  The following example describes a real-world scenario, so you can think of it in a normal lens.

**Question 1: Faulty tires**

A tire manufacturer claims that their tires are so good, they will bring a Honda CRV from 60 mph to a complete stop in under 106 feet, 97% percent of the time.

Now, you own a Honda CRV and this exact set of tires, and you decide to test this claim. You take your car to an empty Walmart parking lot, speed up to exactly 60 mph, hit the brakes, and measure your stopping distance. You repeat this 50 times (to the dismay of the other shoppers and the Walmart manager) and find that you stopped in under 106 feet only 47 of the times.

Livid, you call the tire manufacturer and say that their claim was false. They say, no, that you were just unlucky: your experiment is consistent with their claim. But they didn't realize that they are dealing with a *data scientist*.

To settle the matter, you decide to unleash the power of the hypothesis test.

You will set up a hypothesis test in order to test your suspicion that the tires are are actually worse than claimed. Which of the following are valid null and alternative hypotheses for this hypothesis test?

1. The tires will stop your car in under 106 feet exactly 97% of the time.
0. The tires will stop your car in under 106 feet less than 97% of the time.
0. The tires will stop your car in under 106 feet greater than 97% of the time.
0. The tires will stop your car in more than 106 feet exactly 3% of the time.
0. The tires will stop your car in more than 106 feet less than 3% of the time.
0. The tires will stop your car in more than 106 feet greater than 3% of the time.

Write a function `car_null_hypoth` which takes zero arguments and returns a list of the valid null hypotheses.
Write a function `car_alt_hypoth` which takes zero arguments and returns a list of the valid alternative hypotheses.

Which of the following are valid test statistics for our question?

1. The number of times the car stopped in under 106 feet in 50 attempts.
1. The average number of feet the car took to come to a complete stop in 50 attempts.
1. The number of attempts it took before the car stopped in under 95 feet.
1. The proportion of attempts the car successfully stopped in under 106 feet.

Write a function `car_test_stat` which takes zero arguments and returns a list of valid test statistics.

The p-value is calculated as how likely it is to find something as extreme or more extreme than our observed test statistic.  To do this, we assume the null hypothesis is true, and then define "extremeness" based on the alternative hypothesis.

Why don't we just look at the probability of finding our observed test statistic?

1. Because our observed test statistic isn't extreme.
4. Because our null hypothesis isn't suggesting equality.
5. Because our alternative hypothesis isn't suggesting equality.
2. Because the probability of finding our observed test statistic equals the probability of finding something more extreme.
3. Because if we run more and more trials (where a trial is speeding up the car then stopping), the probability of finding *any* observed test statistic gets closer and closer to zero, so if we did this we would always reject the null with more trials even if the null is true.


Write a function `car_p_value` which takes zero arguments and returns the correct reason.

In [None]:
grader.check("q1")

# Grouping: Google Play Store

The questions below analyze a dataset of Google Play Store apps. The dataset has been preprocessed slightly for your convenience.

Columns:
* `App`: App Name
* `Category`: App Category
* `Rating`: Average App Rating
* `Reviews`: Number of Reviews
* `Size`: Size of App
* `Installs`: Binned Number of Installs
* `Type`: Paid or Free
* `Price`: Price of App
* `Content Rating`: Age group the app is targeted at
* `Last Updated`: Last Updated Date


Link: https://www.kaggle.com/lava18/google-play-store-apps

**Question 2**

First, we'd like to do some basic cleaning to this dataset to better analyze it.
In the function `clean_apps`, which takes the Play Store dataset as input, clean as follows and return the cleaned df:
* Keep `Reviews` as type int.
* Strip all letters from the ends of `Size`, convert all units to unit kilobyte, and convert the column to type float (Hint: all Sizes end in either M (megabyte) or k (kilobyte); a helper function may be useful here).
* Strip the '+' from the ends of `Installs`, remove the commas, and convert it to type int.
* Since `Type` is binary, change all the 'Free's to 1 and the 'Paid's to 0.
* Strip dollar mark in `Price` and convert it to correct numeric data type.
* Strip all but the year (e.g. 2018) from `Last Updated` and convert it to type int.

Please return a *copy* of the original dataframe; don't alter the original.

**Question 2 (Continued)**

Now, we can do some basic exploration.

In the function `store_info`, find the following using the **cleaned** dataframe:
* Find the year with the highest median `Installs`, among all years with at least 100 apps.
* Find the `Content Rating` with the highest minimum `Rating`.
* Find the `Category` has the highest average price.
* Find the `Category` with lowest average rating, among apps that have at least 1000 reviews.

and return these values in a list.

*Remark:* Note that the last question is asking you to compute the *average of averages* (the 'Rating' column contains the average rating of an app) -- such analyses are prone to occurrences of Simpson's Paradox. Considering apps with at least 1000 reviews helps limit the effect of such [ecological fallacies](https://afraenkel.github.io/practical-data-science/05/understanding-aggregations.html#reversing-aggregations-ecological-fallacies).
* You can assume there is no ties.


In [25]:
# don't change this cell -- it is needed for the tests to work
fp = os.path.join('data', 'googleplaystore.csv')
df = pd.read_csv(fp)
cleaned = clean_apps(df)

info = store_info(cleaned)

In [None]:
grader.check("q2")

### Transforming Apps review count by App category

A reasonable question that we may ask after cleaning the apps dataset is that how popular each app is. One way of measuring popularity of apps is by studying its review count within their respective category. 

**Question 3**
* Create a function `std_reviews_by_app_cat` that takes in a **cleaned** dataframe and outputs a dataframe with 
    - the same rows as the input,
    - two columns given by `['Category', 'Reviews']`,
    - where the `Reviews` columns are *standardized by app category* -- that is, the number of reviews for every app is put into the standard units for the category it belongs to. For a review of standard units, see the [DSC 10 Textbook](https://www.inferentialthinking.com/chapters/15/1/Correlation)
    - *Hint*: use the methoc `groupby` and `transform`.
* Lastly, create a function `su_and_spread` that returns a list of two items (hard-code your answers):
    - Consider the following scenario: half of the apps in the category 'FAMILY' receives ratings of 0 stars while the other
    half has rating of 5 stars. Similarly, the ‘MEDICAL' category has half 1-star and half 4-star apps.
    Which app would have a higher rating after standarization? The five stars in the family category or the four stars in the
    medical one. Answer with the name of the corresponding category ('FAMILY'/'MEDICAL') or use 'equal' if you think both
    rating would be the same after standarization. (Don't worry about the uppercase but do be careful with the spelling). 
    - Which category type has the biggest "spread" of review count?
    - Note: When calculating the standard deviation by hand, use the formula with `n` in the denominator. NumPy's `.std()` by default uses that formula, while `pd.Series().std()` by default uses the formula with `n - 1` in the denominator.
    

In [46]:
# do not edit this cell -- it is needed for the tests
fp = os.path.join('data', 'googleplaystore.csv')
play = pd.read_csv(fp)
cleaned = clean_apps(play)
reviews_out = std_reviews_by_app_cat(cleaned)

su_and_spread_out = su_and_spread()

In [None]:
grader.check("q3")

### Facebook Friends

**Question 4**

A group of students decided to send out a survey to their Facebook friends. Each student asks 1000 of their friends for their first and last name, the company they currently work at, their job title, their email, and the university they attended. Combine all the data contained in the files `survey*.csv` (within the `responses` folder within the data folder) into a single dataframe. The number of files and the number of rows in each file may vary, so don't hardcode your answers!

Create a function `read_survey` which takes in a directory path (containing files `survey*.csv`), and outputs a dataframe with six columns titled: `first name`, `last name`, `current company`, `job title`, `email`, `university` (in that order). 

*Hint*: You can list the files in a directory using `os.listdir`.

*Remark: You may have to do some cleaning to make this possible!*

Create a function `com_stats` that takes in in a dataframe and returns a (hardcoded) list containing: 
- The number of employees at the company that hired the most employees
- The number of emails that end in ".edu"
- The job title that has the longest name (there are no ties)
- The number of managers (hint: you may want to look through all the job titles to make sure you get all of them!)

In [61]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'responses')
out = read_survey(dirname)
stats_out = com_stats(out)

In [None]:
grader.check("q4")

### Combining Data

**Question 5**

Every week, a professor sends out an extra credit survey asking for students' favorite things (animals, movies, etc). 
- Each student who has completed at least 75% of the surveys receives 5 points of extra credit.
- If at least 90% of the class answers at least one of the questions (ex. favorite animal), *everyone* in the class receives 1 point of extra credit. This overall class extra credit only applies once (ex. If 95% of students answer favorite color and 91% answer favorite animal, the entire class still only receives 1 extra point as a class).

Create a function `combine_surveys` which takes in a directory path (containing files `favorite*.csv`) and combines all of the survey data into one DataFrame, indexed by student ID (a value 1 - 1000).

Create a function `check_credit` which takes in a DataFrame with the combined survey data and outputs a DataFrame of the names of students and how many extra credit points they would receive, indexed by their ID (a value 1-1000)

In [78]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'extra-credit-surveys')
out = combine_surveys(dirname)
check_credit_out = check_credit(out)

In [None]:
grader.check("q5")

### Joining pets and owners

**Question 6**

You are analyzing data from a veterinarian clinic. The datasets contain several types of information from the clinic, including its customers (pet owners), pets, and available procedures and history. The column names are self-explanatory. These dataframes are provided to you:
-  `owners` stores the customer information, where every `OwnerID` is unique (verify yourself).
-  `pets` stores the pet information. Each pet belongs to a customer in `owners`.
-  `procedure_detail` contains a catalog of procedures that are offered by the clinic.
-  `procedure_history` has procedure records. Each procedure is given to a pet in `pets`.

You want to answer the following questions:

1. What is the most popular Procedure Type for all of the pets we have in our `pets` dataset? Note that some pets are registered but haven't had any procedure performed. Also, some pets that have had procedures done, are not registered in `pets`. Create a function `most_popular_procedure` that takes in `pets`, `procedure_history` and returns the name of the most popular Procedure Type as a string.
 
2. What is the name of each customer's pet(s)? Create a function `pet_name_by_owner` that takes in `owners`, `pets` and returns a Series that holds the pet name (as a string) indexed by owner's (first) name. If an owner has multiple pets, the corresponding value should be a list of names as strings.

3. For each city that had owners who had their pets in our procedure history, how much does the city spend in total on procedures? Create a function `total_cost_per_city` that returns a Series that contains the sum of money that a city has spent on pets' procedures, indexed by `City`. Hint: think of what makes a procedure unique in the context of this dataset.

In [93]:
# do not edit this cell -- it is needed for the tests
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
procedure_detail_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')
pets = pd.read_csv(pets_fp)
procedure_history = pd.read_csv(procedure_history_fp)
owners = pd.read_csv(owners_fp)
procedure_detail = pd.read_csv(procedure_detail_fp)

out_01 = most_popular_procedure(pets, procedure_history)
out_02 = pet_name_by_owner(owners, pets)
out_03 = total_cost_per_city(owners, pets, procedure_history, procedure_detail)

In [None]:
grader.check("q6")

### Finish Line

Before submitting your lab, make sure to run the doctests in the terminal with `python -m doctest lab.py`. If all of the tests in the notebook pass, but some fail when uploading to Gradescope, make sure that you've run the doctests in the terminal and they all pass.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()