# Homework 05

In this exercise, you will practice inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cellsby enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). **Please note that Gradescope currently does not render ```\begin{align} ... \end{align}``` and ```\begin{equation} ... \end{equation}```. Consider using ```\begin{aligned} ... \end{aligned}``` as an alternative to save the graders some hassle.**

Show your work and/or briefly explain your answers. In general, you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [1]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

### Question 1

1. A website is trying to increase registration for first-time visitors, exposing a random subset of these visitors to a new site design. Of $752$ randomly sampled visitors over a month who saw the new design, $64$ registered. Construct a $95\%$ confidence interval for the percentage of visitors who would register for the website under the new design using the normal distribution. Save your answer in a tuple `q1_1` with two `numpy.float64` items that `q1_1[0]` is the left bound and `q1_1[1]` is the right bound. Use percentages for both bounds, for example, save $50.0$ in your answer for $50\%$ of visitors.
2. A study examined the average pay for a random sample of men and women entering the workforce as doctors for $21$ different positions. If each gender was equally paid, then we would expect about half of those positions to have men paid more than women and women would be paid more than men in the other half of positions. In the study, men were, on average, paid more in $17$ of the $21$ positions. Complete a hypothesis test using the normal distribution (two-sided or one-sided, just be clear which you are reporting) to examine whether there is significant evidence (at the $0.05$ level) of gender discrimination in pay in these positions. Report your p-value and interpret the result. Save your p-value in `q1_2` as a `numpy.float64` and interpret it in the **Answer 1.2** cell.

In [2]:
# Code for question 1
# Leave these lines here for grading and ease of debugging
n = 752
reg = 64 / 752
new_register = np.sqrt(reg*(1-reg)) / np.sqrt(n)
new_register_conf = stats.norm.interval(alpha = 0.95, loc = reg, scale = new_register)
q1_1 = (100 * new_register_conf[0], 100 * new_register_conf[1])
print(q1_1)

# Leave these lines here for grading and ease of debugging
t_score = np.sqrt(21)*((17/21) - 0.5)/(np.sqrt(0.5*(1-0.5)))
q1_2 = (1-stats.norm.cdf(t_score))
print(q1_2)

(6.516269200219607, 10.505007395525073)
0.0022781749015925445


### Answer 1.2
Write your interpretation for 1.2 here. There is no need to write anything for 1.1.

The p-value is approximately 0.00228, which is less than 0.05, so it is statistically significant. Thus, we reject the null hypothesis that men are paid more in 50% of the positions and women are paid more in the other positions. We can state that there is significant evidence of gender discriminatino in pay in these positions.

<!-- END QUESTION -->

## Movie Ratings Data
In the remainder of this assignment you will work with the movielens dataset of movie ratings that we have seen before. Below we import and preview the data. It consists of 2 tables: `users` has a row for every individual who has rated any movies, `movie-ratings` has a row for every rating of a particular movie by a particular user. This means users with multiple ratings are in the `movie_ratings` multiple times. The data is a random sample of all of the movie ratings made on the movielens service.

In [3]:
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,age,sex,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


In [4]:
movie_ratings = pd.read_csv("movies-all.csv")
movie_ratings.head()

Unnamed: 0,user_id,age,sex,occupation,movie_id,rating,movie_title
0,1,24,M,technician,61,4,Three Colors: White (1994)
1,13,47,M,educator,61,4,Three Colors: White (1994)
2,18,35,F,other,61,4,Three Colors: White (1994)
3,58,27,M,programmer,61,5,Three Colors: White (1994)
4,59,49,M,educator,61,4,Three Colors: White (1994)


### Question 2
1. Compute a $95\%$ confidence interval for the mean `age` of users using the normal distribution. Save your answer in a tuple `q2_1` with two `numpy.float64` items that `q2_1[0]` is the left bound and `q2_1[1]` is the right bound.
2. Compute a $95\%$ confidence interval for the mean `age` of users who have rated the movie `Casablanca (1942)` using the normal distribution. Save your answer in `q2_2`, similar requirements as above.
3. *Casablanca* is an old movie, one might suspect that it has been rated by older individuals on average than the entire dataset. Just looking at the confidence intervals you computed in steps 1 and 2, can you conclude that there is significant evidence for this belief? Why or why not? Put your answer in the **Answer 2.3** cell.

In [5]:
# Code for question 2
# Leave these lines here for grading and ease of debugging
data_age = np.asarray(users["age"])
scale_age = np.std(data_age)/np.sqrt(len(data_age))
conf_age = stats.norm.interval(alpha = 0.95, loc = np.mean(data_age), scale = scale_age)
q2_1 = (conf_age[0], conf_age[1])
print(q2_1)

movie_ratings["casablanca_rated"] = movie_ratings["movie_title"] == "Casablanca (1942)"
movie_ratings["casablanca_age"] = movie_ratings[movie_ratings["casablanca_rated"]]["age"]
data_casablanca = movie_ratings["casablanca_age"]
data_casablanca = np.asarray(data_casablanca[~np.isnan(data_casablanca)])
scale_casablanca = np.std(data_casablanca)/np.sqrt(len(data_casablanca))
conf_cas = stats.norm.interval(alpha = 0.95, loc = np.mean(data_casablanca), scale = scale_casablanca)
q2_2 = (conf_cas[0], conf_cas[1])

# Leave these lines here for grading and ease of debugging
print(q2_2)

(33.27417039488504, 34.829753253047095)
(34.46345637868268, 37.330782304444895)


### Answer 2.3
Write your interpretation for 2.3 here. There is no need to write anything for 2.1 & 2.2.

There is significant evidence of this belief. We can be 95% confident that the mean age of all the movie raters lies between 33.27 and 34.83. We can be 95% confident that the mean age of all the Casablanca raters lies between 34.46 and 37.33. The lower bound for Casablanca raters is higher than the lower bound for all movie raters, as is the upper bound.

<!-- END QUESTION -->

### Question 3
Only $18$ users have rated the movie `Lost in Space (1998)`.
1. Use bootstrapping with $10000$ bootstrap resamples to compute a $95\%$ confidence interval for the average `age` of users who have rated `Lost in Space (1998)`. Save your answer in a tuple `q3_1` with two `numpy.float64` items that `q3_1[0]` is the left bound and `q3_1[1]` is the right bound.
2. One of the advantages of bootstrapping is that we can easily compute confidence intervals for arbitrary measurements of distributions. Use bootstrapping with $10000$ bootstrap resamples to compute a $95\%$ confidence interval for the **median** `rating` of `Lost in Space (1998)`. Note that numpy provides a vectorized function for [calculating the median](https://numpy.org/doc/stable/reference/generated/numpy.median.html) as well as the mean. Save your answer in a tuple `q3_2` with two `numpy.float64` items. Similar requirements as above.

In [6]:
# Code for question 3
# Leave these lines here for grading and ease of debugging
num_bootstraps = 10000
movie_ratings["lost_rated"] = movie_ratings["movie_title"] == "Lost in Space (1998)"
movie_ratings["lost_age"] = movie_ratings[movie_ratings["lost_rated"]]["age"]
data_lost = movie_ratings["lost_age"]
data_lost = np.asarray(data_lost[~np.isnan(data_lost)])
bootstrap_samples = np.random.choice(data_lost, size = (num_bootstraps, len(data_lost)), replace = True)
bootstrap_means = np.mean(bootstrap_samples, axis = 1)
q3_1 = (np.percentile(bootstrap_means, 2.5), np.percentile(bootstrap_means, 97.5))
print(q3_1)

# Leave these lines here for grading and ease of debugging
movie_ratings["lost_rating"] = movie_ratings[movie_ratings["lost_rated"]]["rating"]
data_lost_rating = movie_ratings["lost_rating"]
data_lost_rating = np.asarray(data_lost_rating[~np.isnan(data_lost_rating)])
bootstrap_samples_rating = np.random.choice(data_lost_rating, size = (num_bootstraps, len(data_lost_rating)), replace = True)
bootstrap_medians = np.median(bootstrap_samples_rating, axis = 1)
q3_2 = (np.percentile(bootstrap_medians, 2.5), np.percentile(bootstrap_medians, 97.5))
print(q3_2)

(26.11111111111111, 36.77777777777778)
(2.5, 4.0)


### Question 4
The `Star Wars (1977)` film is quite popular, with a median rating of `5` out of $5$. However, of those that left a rating, male users gave it a slightly higher average rating of about $4.4$ whereas female users gave the same movie an average rating of about $4.2$.

1. Consider the null hypothesis that the average rating of `Star Wars (1977)` is the same for `sex='F'` and `sex='M'` users. The alternative hypothesis is that the average ratings are not equal. Conduct a two-sided t test using [`stats.ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to evaluate this using the sample ratings data. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_1` as a `numpy.float64` and interpret it in the **Answer 4** cell.

2. Consider the null hypothesis that $51\%$ of men would rate `Star Wars (1977)` a `5`. Conduct a two-sided hypothesis test using [`stats.t.cdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) to evaluate this in light of the sample ratings data of male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_2` as a `numpy.float64` and interpret it in the **Answer 4** cell.

3. Consider the null hypothesis that women and men were equally likely to rate `Star Wars (1977)` a `5`. Conduct a two-sided t test using [`stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html) to evaluate this in light of the sample data of female and male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_3` as a `numpy.float64` and interpret it in the **Answer 4** cell. You should observe a different p-value than in step 2 despite the hypotheses under consideration being ostensibly similar. Briefly explain why you observe this difference.

In [7]:
# Code for question 4
# Leave these lines here for grading and ease of debugging
rating_female = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "F"]["rating"]
rating_male = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "M"]["rating"]

q4_1_ttest = stats.ttest_ind(rating_female, rating_male)
q4_1 = q4_1_ttest[1]
print(q4_1)

# Leave these lines here for grading and ease of debugging
rating_male_5 = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "M"][movie_ratings["rating"] == 5]
rating_male_5_count = len(rating_male_5)
rating_male_5_percent = rating_male_5_count / len(rating_male)
x_bar = rating_male_5_percent - 0.51
denom = np.sqrt(0.51 * (1 - 0.51))
test_stat = (x_bar/denom) * (np.sqrt(len(rating_male)))
q4_2 = 2*(1-stats.t.cdf(test_stat, len(rating_male) - 1))
print(q4_2)

# Leave these lines here for grading and ease of debugging
rating_female_5 = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "F"][movie_ratings["rating"] == 5]
rating_female_5_count = len(rating_female_5)
rating_female_5_percent = rating_female_5_count / len(rating_female)
q4_3_test = stats.ttest_ind_from_stats(mean1 = rating_female_5_percent, std1 = np.sqrt(rating_female_5_percent*(1-rating_female_5_percent)), nobs1=len(rating_female),
                                 mean2 = rating_male_5_percent, std2 = np.sqrt(rating_male_5_percent*(1-rating_male_5_percent)), nobs2 = len(rating_male))
q4_3 = q4_3_test[1]
print(q4_3)

0.06606506021398857
0.008010296971218134
0.1717837459146108


  rating_female = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "F"]["rating"]
  rating_male = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "M"]["rating"]
  rating_male_5 = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "M"][movie_ratings["rating"] == 5]
  rating_female_5 = movie_ratings[movie_ratings["movie_title"] == "Star Wars (1977)"][movie_ratings["sex"] == "F"][movie_ratings["rating"] == 5]


### Answer 4

4.1: The p-value is 0.0661, which is greater than 0.05. It is not statistically significant, so we fail to reject the null hypothesis. Thus we cannot claim that the average rating for Star Wars is significantly different for males and females.

4.2: The p-value is 0.008, which is less than 0.05. It is statistically significant, so we reject the null hypothesis. Thus, a proportion that is not 51% of men would rate Star Wars a 5.

4.3: The p-value is 0.172, which is greater than 0.05. It is not statistically significant, so we fail to reject the null hypothesis. Thus we cannot claim that women and men were not equally likely to rate Star Wars a 5. The p-value is different from the previous question because this test is measuring means, while the previous question was looking at proportions. 

<!-- END QUESTION -->

## Submitting

You should make sure any code that you write to answer the questions is included in this notebook. We recommend you go to the Kernel option and choose \"Restart & Run All.\" Double check that your entire notebook runs correctly and generates the expected output. Finally, make sure to save your work (timestamp at the top tells you the last checkpoint and whether there are unsaved changes). When you finish, submit your assignment at [Gradescope](http://gradescope.com/).