Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Problem 2: Statistics

In this problem we'll use Python to do some basic statistics

## Part A: MLE for a Gaussian (2 points)

Recall that with data $X_1,\ldots,X_n$,drawn i.i.d. from a Gaussian distribution with mean $\mu$ and variance $\sigma^2$, the maximum likelihood estimates for $\mu$ and $\sigma^2$ are:

$$\hat{\mu} = \frac{1}{n}\sum_{i=1}^n X_i$$
$$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \hat{\mu})^2$$

Write a function which takes in a list of data and computes the maximum likeihood estimate for the mean and variance.

Your function should return a tuple of (mean, variance)

In [None]:
def compute_mean_and_var(data):
    """Compute the mean and variance"""
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Check that means and variances are computed correctly"""

# Don't worry about the use of numpy, it's just for testing
import numpy as np
# set the seed so it's the same across runs
np.random.seed(1234)

# test data
data = list(np.random.randn(1000))
# compute mean and variance and make sure it's right
mean, var = compute_mean_and_var(data)
np.testing.assert_approx_equal(mean, 0.01574, 3)
np.testing.assert_approx_equal(var, 0.9468, 3)

## Part B: MLE for Log-Normal (2 points)
We say that a random variable $X$ is **Log-Normally distributed** if $\log X$ is normally distributed. If we have data $X_1,\ldots,X_n$ that are log normally distributed, then we can use $\log X_1, \ldots, \log X_n$ to estimate the parameters of the log normal distribution with 
$$\hat{\mu} = \frac{1}{n}\sum_{i=1}^n \log X_i$$
$$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (\log X_i - \hat{\mu})^2$$
Write a function which computes the maximum liklihood estimates.

If any of the data are non-positive, then it cannot be log normally distributed. For this problem we'll assume non-positive numbers are missing, so make sure your function removes negative values. If all values are non-positive, return a `ValueError`

In [None]:
import math

def compute_log_mean_and_var(data):
    """Compute the MLE estimates of the parameters for a log normal"""
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Check that mle parameters are computed correctly"""
# test data
np.random.seed(1234)
data = list(np.exp(np.random.randn(1000))) + [-1]
# compute mean and variance and make sure it's right
mean, var = compute_log_mean_and_var(data)
np.testing.assert_approx_equal(mean, 0.01574, 3)
np.testing.assert_approx_equal(var, 0.9468, 3)

In [None]:
"""Check for correct exception throwing"""
try:
    compute_log_mean_and_var([-1,-4])
except ValueError:
    pass
else:
    raise AssertionError("ValueError not thrown")

# Part C: Confidence intervals and Hypothesis tests (3 points)
For data $X-1,\ldots,X_n$ distributed i.i.d. Normal($\mu$, 1), a 95% confidence interval for the mean is 

$$ \overline{X} \pm 1.96 * \frac{1}{\sqrt{n}}$$

Write a function which takes in data as a list of numbers and returns a 95% confidence interval for $\mu$. The function should return the left and right end points of the confidence interval

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check if CIs are correct"""
# test data
np.random.seed(1234)
data = list(np.random.randn(100))
# compute confidence interval
lower, upper = compute_mean_ci(data)
np.testing.assert_approx_equal(lower, -0.1608, 3)
np.testing.assert_approx_equal(upper, 0.2311, 3)

Using your function to compute confidence intervals, write a function which takes in a list of data and performs a hypothesis test at 5% significance level to reject the null hypothesis $H_0:\mu=0$.

Your function should return `True` if we reject the null hypothesis and `False` otherwise

In [None]:
def hypothesis_test(data):
    """Test H_0:mu = 0"""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Check if tests are correct"""
# test data
np.random.seed(1234)
data1 = list(np.random.randn(100))
# run test
assert hypothesis_test(data1) == False
data2 = list(np.random.randn(100) + 1)
assert hypothesis_test(data2) == True


# Part D: Simulated Data (3 points)
For this question you will read a text file with simulated data from a randomized experiment. In this experiment we randomly select a subset of individuals and apply an intervention to them, leaving the remaining population as a control group. We then measure the response of each individual.

Each row of the text file contains two numbers separated by a comma. The first is an indicator for treatment or control: (0 means control, 1 means treated). The second is the value of the response.

We model the response for treatment and control as coming from two independent normal distributions, with (possibly) different means and variance equal to 1.

Write a function takes in the path to this text file and returns
- A 95% confidence interval for the difference in mean responses between the treated and control groups (as a tuple of (left endpoint, right endpoint)
- The result of running a hypothesis test at 5% significance level that the two means are the same

Statistical note: Remember that the variance for the difference in sample means is different than the variance for the means

In [None]:
def read_and_test(file_path):
    """Compute confidence interval of difference in means and run a test"""
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
"""Check if results of experiment are correct"""
# compute ci and make sure it's right
(lower1, upper1), reject1 = read_and_test("/accounts/class/s159/exchange/experiment1.txt")
(lower2, upper2), reject2 = read_and_test("/accounts/class/s159/exchange/experiment2.txt")

np.testing.assert_approx_equal(lower1, -0.5642, 3)
np.testing.assert_approx_equal(upper1, -0.3323, 3)

np.testing.assert_approx_equal(lower2, -0.1894, 3)
np.testing.assert_approx_equal(upper2, 0.0424, 3)

# check if hypothesis test is right
assert reject1 == True
assert reject2 == False