# Final: Probability and Statistics for Data Science

*Instructions:*
Please answer the following questions and submit your work
by editing this jupyter notebook and submitting it on Canvas.
Questions may involve math, programming, or neither,
but you should make sure to *explain your work*:
i.e., you should usually have a cell with at least a few sentences
explaining what you are doing.

Also, please be sure to always specify units of any quantities that have units,
and label axes of plots (again, with units when appropriate).

In [2]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(123)

# 1. The Challenger


On 28 January 1986 the space shuttle Challenger was destroyed in an
explostion shortly after lauch from Cape Kennedy.  The cause of the
explosion was eventually identified as catastrophic failure of the
O-rings on the solid rocker booster.  The failure almost certainly occurred
because the O-ring material was subjected to a lower temperature at
lauch (31 degress F) than was appropriate.  The material and the solid
rocket joints had never been tested at temperatures this low.  Some
O-ring failures had occurred during other shuttle launches (or engine
statics tests).  The failure data observed prior to the Challenger
launch is given at http://pages.uoregon.edu/dlevin/DATA/shuttle.csv

 
(a) Fit an appropriate model to the data which allows you to predict the probability of failure at different temperatures.  

(b) Construct a graph of the data and display the fitted model.  Discuss how well the
model fits the data.

(c) What is the estimated failure probability at 50 degrees? at 75 degrees? and at 31 degrees?

 
(d) Notice that there is extrapolation involved in obtaining this
estimate.  What influence would that have on your recommendation about
lauching the shuttle?

 

 



# 2. The three surveyors

Moe, Larry, and Curly are measuring the heights of the trees in a field.
To do this, Moe climbs the tree and holds a tape measure at the top of the tree,
Larry looks at the position on the tape measure and yells it to Curly,
and Curly writes it down.
However, each make mistakes:
if the true height of the tree is $h$ (in feet),
then the number written down is
$$
    X = h + M + L + C ,
$$
where $M$, $L$, and $C$ are the errors introduced by each;
these are independent and
$$\begin{aligned}
  M &\sim \text{Exponential}(\text{mean}=1.0) \\
  L &\sim \text{Normal}(\text{mean}=0, \text{sd}=2.5) \\
  C &\sim \text{Normal}(\text{mean}=3.0, \text{sd}=2.0) .
\end{aligned}$$

In the following, suppose they have measured a tree with true height $h=20$.
The expected value of $X$ is, by linearity of expectation,
$$ \mathbb{E}[X] = \mathbb{E}[h + M + L + C] = h + \mathbb{E}[M] + \mathbb{E}[L] + \mathbb{E}[C] = 20 + 1 + 0 + 3 = 24 \text{ feet}. $$

*(a)* Write a function to simulate from $X$,
    and use it to make a histogram of the distribution of $X$.
    (*Note: as a check, the mean should be close to 23.*)
    
*(b)* Use simulation to estimate $\text{sd}[X]$.

*(c)* Find $\text{sd}[X]$ using math (i.e., using the probability rules from class
 and properties of the probability distributions above, which you may look up on Wikipedia).
 Make sure to state any facts that you use.

# 3. Jitter bugs

We are studying how activity levels of springtails (a type of very small arthropod)
depend on ambient temperature.
To do this, we have measured how far twenty *[Sminthurus viridis](https://www.chaosofdelight.org/collembola-springtails)*
traveled along a small tube over the course of 10 minutes, each measured at a different temperature.
Here are the data (temperatures in degrees C, distances in mm):

In [39]:
temperature = np.array([1.4, 2.8, 1.2, 10., 0.4, 6.3, 1.2, 0.6, 0.7, 2.6, 2.4,
                   0.4 , 1. , 2.3, 2.8, 7.4, 2.6, 1.6, 9.4, 11.4])
distance = np.array([12.1, 10.6,  9.5,  7.4,  9.8, 13.3, 11.1, 11.3, 10.3,  9.2,  9.7,
                   9.5,  9.5, 10.8,  7.7,  7.5,  8.6, 11.5,  8.9, 14.1])

Our observation is that although the *average* distance is fairly consistent across
temperatures, the *variability* between springtails differs substantially at different temperatures.
So, we'd like to fit the following model: if $D$ is the distance and $T$ is the temperature,
$$ D \sim \text{Normal}(\text{mean}=a, \text{sd}=\exp(b T) ). $$
(In words: although mean distance does not change with temperature,
the standard deviation does.)

*(a)* Write the log-likelihood function for this model.
    The function should have three arguments: the tuple of parameters, $(a, b)$;
    the array of temperatures; and the array of distances.

*(b)* Estimate the values of $a$ and $b$ from the data
    by minimizing the negative log-likelihood.
    
*(c)* Use simulation
    to obtain 95% confidence intervals for your two estimates.
    Do this by: 
- creating several new, simulated data sets $D$ from this model using the estimated values of $a$ and $b$, but the same temperatures $T$ found in the data set,
- re-applying the estimation procedure on each simulated data set to get new $a$ and $b$ values
- and reporting a 95% interval of the resulting estimates. 

# 4. Polynomial fitting

Here is a small [data set](http://pages.uoregon.edu/dlevin/DATA/xy2023.txt) with several points (x, y). Model y as a polynomial function of x.

 (a) Plot the data, along with fitted polynomials of degrees 1 through 7.
 
 (b) Using cross-validation, estimate the MSE prediction for each degree of polynomial model 
 
 (c) Plot the mean squared error as a function of degree.
 
 (d) Interpret your results in parts (b) and (c) in plain language. What degree polynomial should be the best at predicting $y$ based on the value of $x$ for new data?
