In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/benjaminfigueiredo/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Precept_5

# Review session

### Please use the following Google Form to submit answers: https://forms.gle/haCRkk8g75vEhoBRA

#### Problem 1: Catching a data thief

You have been tasked by Princeton's academic integrity review board to look at an accusation of unauthorized use of data. Dr. W and Dr. L are two long time collaborators in the field of zebra bone studies who recently had a falling out. Dr. L claims that their entire dataset of zebra femur lengths was included in an analysis of Dr. W's recent paper without adequate permission or disclosure in the publication. In fact, the claim is that the fakery has been so shoddy, that Dr. W just inserted Dr. L's dataset, in its original order, in their own dataset.

You have been given access to both datasets in the data folder. In both cases, the data is being given as a CSV table with a header, reporting femur lengths in units of mm.

Since the datasets at hand are large, you decide check this claim algorithmically: you will write a function in Python that accepts as input two numpy arrays and checks whether the second is a subarray of the first.

![](figs/p1.png)

In [None]:
import numpy as np

In [None]:
def has_subarray(arr : np.ndarray, subarr : np.ndarray) -> bool:
    """Check if subarr is a contiguous subarray of arr."""
    # TODO Your code here. Submit this snippet to the form!
    return False

Test your function against a couple of examples! For this, we're going to use the **assert** statement, which throw an error if the following boolean expression does not evaluate to True.

In [None]:
example1 = np.array([1, 2, 3, 4, 5])
subarray1 = np.array([2, 3, 4])
assert has_subarray(example1, subarray1) == True, "Test case 1 failed"

example2 = np.array([1, 2, 3, 4, 5])
subarray2 = np.array([3, 5])
assert has_subarray(example2, subarray2) == False, "Test case 2 failed"

example3 = np.array([1, 2, 3, 4, 5])
subarray3 = np.array([1, 2, 3, 4, 5])
assert has_subarray(example3, subarray3) == True, "Test case 3 failed"

With your subarray function in hand, you can check whether Dr. L's accusation holds:

In [None]:
dr_W_dataset = np.loadtxt('data/femur_length_dr_W.csv', delimiter=',', skiprows=1)
dr_L_dataset = np.loadtxt('data/femur_length_dr_L.csv', delimiter=',', skiprows=1)
print(f"has_subarray(dr_W_dataset, dr_L_dataset) = {has_subarray(dr_W_dataset, dr_L_dataset)}")

### Problem 2: Probing the statistical effect of data manipulation

Beyond the ethical dimension of your finding above, you have to assess: does the inclusion of Dr. L's data in Dr. W's work alter the conclusion of their work? Consulting with Dr. L, you are told a concerning fact: Dr. L's data was taken entirely from measurements of the corpses of zebras grown in captivity, whereas Dr. W claims the dataset was collected from fieldwork. Since these are drastically different conditions, you are concerned the results claimed in the paper, reproduced below, may not hold.

![](figs/p2.png)

As a first sanity check, you decide to compute whether Dr. L's zoo data is plausibly distinct, in distribution, from Dr. W's actual data.

First, you need to extract Dr W's actual dataset. For this, you write a function **remove_subarray** that takes two input arrays and removes the first occurence of a subarray if it occurs, otherwise returning the original array.

![](figs/p2_2.png)

In [None]:
def remove_subarray(arr : np.ndarray, subarr : np.ndarray) -> np.ndarray:
    """Remove the first occurrence of subarr from arr."""
    # TODO Your code here. Submit this snippet to the form!
    return arr

Again, we should test whether this function behaves as expected.

In [None]:
assert np.array_equal(remove_subarray(example1, subarray1), np.array([1, 5])), "Test case 1 failed"
assert np.array_equal(remove_subarray(example2, subarray2), example2), "Test case 2 failed"
assert np.array_equal(remove_subarray(example3, subarray3), np.array([])), "Test case 3 failed"

# Does your function remove every occurance of the subarray, or just the first one?
example4 = np.array([1, 2, 3, 4, 5, 2, 3, 4])
subarray4 = np.array([2, 3, 4])
assert np.array_equal(remove_subarray(example4, subarray4), np.array([1, 5, 2, 3, 4])), "Test case 4 failed"

With this, you can reconstruct the original data Dr. W was presumably working with

In [None]:
import matplotlib.pyplot as plt

In [None]:
dr_W_original_dataset = remove_subarray(dr_W_dataset, dr_L_dataset)

# Look at histograms of the original dataset, Dr. L's dataset, and the modified dataset with the subarray removed, side by side.
# We create 3 subplots in a single row to display the histograms side by side.
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.hist(dr_W_dataset, bins=20, label="Dr. W's dataset")
plt.title("Dr. W's Reported Dataset")
plt.xlabel("Femur Length (mm)")
plt.ylabel("Frequency")
plt.subplot(1, 3, 2)
plt.hist(dr_L_dataset, bins=20, label="Dr. L's dataset", color='orange')
plt.xlabel("Femur Length (mm)")
plt.title("Dr. L's Dataset")
plt.subplot(1, 3, 3)
plt.hist(dr_W_original_dataset, bins=20, label="Dr. W's Reconstructed Dataset", color='green')
plt.xlabel("Femur Length (mm)")
plt.title("Dr. W's Reconstructed Dataset")
plt.tight_layout()
plt.show()



You recall from class that one basic way to test whether two distributions differ *at the level of their mean* is the **two-sample t-test**.

Suppose we have two samples:

- Sample A: \($ x_1, \dots, x_{n_A} $\)
- Sample B: \($ y_1, \dots, y_{n_B} $\)

Let

$
\bar{x} = \text{mean of sample A}, \quad
\bar{y} = \text{mean of sample B}
$

and let

$
s_x^2, \; s_y^2
$

be the sample variances.

The test statistic (Welch’s t-test, which does not assume equal variances) is

$
t = \frac{\bar{x} - \bar{y}} {\sqrt{\frac{s_x^2}{n_A} + \frac{s_y^2}{n_B}}}.
$

Intuitively:

- The numerator measures how far apart the sample means are.
- The denominator estimates the **standard error** of that difference.
- So \( t \) measures the mean difference in units of its expected variability.

Because we estimate the variances from data, the statistic follows a **t-distribution** with approximately

$
\nu = \frac{\left(\frac{s_x^2}{n_A} + \frac{s_y^2}{n_B}\right)^2}{\frac{(s_x^2/n_A)^2}{n_A - 1}+\frac{(s_y^2/n_B)^2}{n_B - 1}}
$

degrees of freedom (Welch–Satterthwaite approximation).

Finally, the **p-value** is computed as

$ p = P\big(|T_\nu| > |t|\big),$

that is, the probability (under the null hypothesis that the two population means are equal) of observing a t-statistic at least as extreme as the one we obtained.

If this p-value is small, it suggests that the observed difference in means is unlikely to be due to random sampling variation alone. 

Since this seems like a good enough basic metric, you decide to test the hypothesis that the two samples have equal mean:

$H_0 : \bar{x^*} = \bar{y}$,

where $x^*$ is the reconstructed sample from Dr. W's data, and $y$ is Dr. L's sample.

In [None]:
# We import the CDF of the t-distribution from scipy.stats to compute the p-value in our t-test function.
# The actual analytical form of the t-distribution is a bit complicated, so we rely on scipy to compute it for us.
# (If you're curious, you can look up the formula for the t-distribution online!)
# https://en.wikipedia.org/wiki/Student%27s_t-distribution
from scipy.stats import t

In [None]:
def t_statistic(sample1 : np.ndarray, sample2 : np.ndarray) -> float:
    """Calculate the t-statistic and Welch-Satterthwaite d.o.f for two samples."""
    sample1_mean = np.mean(sample1)
    sample2_mean = np.mean(sample2)
    # We also need to compute the sample variances for the two samples. 
    # Remember to use ddof=1 to get the sample variance, with the denominator n-1 instead of n!
    sample1_var = np.var(sample1, ddof=1)
    sample2_var = np.var(sample2, ddof=1)
    # Now we have to define the t-statistic formula.
    # TODO Your code here. Submit this snippet to the form!
    return 0.0, 0.0

In [None]:
def t_test(arr1 : np.ndarray, arr2 : np.ndarray) -> float:
    """Perform a t-test to compare the means of arr1 and arr2. Return the t-statistic and p-value."""
    # First we need to calculate the t-statistic using the function we just defined.
    t_stat, dof = t_statistic(arr1, arr2)
    # Now we compute the p-value using the cumulative distribution function (CDF) of the t-distribution.
    # Note that this is a two-tailed test, so we need to multiply the one-tailed p-value by 2.
    p = 2 * (1 - t.cdf(abs(t_stat), df=dof))
    return p

Having defined the p-test, you go ahead, define an $\alpha$ value, and test whether the two datasets have compatible means.

In [None]:
# Define your alpha threshold for statistical significance
alpha = # TODO Your choice here!
# Now we can perform the t-test on our two datasets and check if the p-value is less than our alpha threshold.
p_value = t_test(dr_W_original_dataset, dr_L_dataset)
print(f"P-value: {p_value:.3e}")
if p_value < alpha:
    print("The difference in femur lengths between Dr. W's and Dr. L's datasets is statistically significant.")
else:
    print("The difference in femur lengths between Dr. W's and Dr. L's datasets is NOT statistically significant.")

You also have access to the previously reported values Dr. W mentions in the paper, and test whether the insertion of Dr. L's data changed the result of the statistical analyses.

In [None]:
previous_dataset = np.loadtxt('data/femur_length_previous.csv', delimiter=',', skiprows=1)
# We check if Dr. W's reported dataset is significantly different from the previous study's dataset.
p_value_reported = t_test(dr_W_dataset, previous_dataset)
print(f"P-value comparing Dr. W's dataset to the previous study: {p_value_previous:.3e}")
# Now we check if the original dataset (with the subarray removed) is significantly different from the previous study's dataset.
p_value_original = t_test(dr_W_original_dataset, previous_dataset)
print(f"P-value comparing Dr. W's original dataset to the previous study: {p_value_original:.3e}")

# You decide to plot the histograms of the previous study's dataset and Dr. W's original dataset in overlap
plt.figure(figsize=(10, 5))
plt.hist(previous_dataset, bins=20, alpha=0.5, label='Previous Study', color='blue', density=True)
plt.hist(dr_W_original_dataset, bins=20, alpha=0.5, label="Dr. W's Original Dataset", color='green', density=True)
plt.xlabel("Femur Length (mm)")
plt.ylabel("Probability Density")
plt.title("Comparison of Previous Study and Dr. W's Original Dataset")
plt.legend()
plt.show()

Following your report to the Office of the Dean of Faculty, Dr. W was taken into Princeton custody, never to be seen again.

### Problem 3: Linear regression and homoscedasticity

You are given access to a dataset that looks at students scores on a test as a function of the number of hours of study before the exam.

In [None]:
dataset = np.loadtxt('data/test_scores.csv', delimiter=',', skiprows=1)
hours = dataset[:, 0]
scores = dataset[:, 1]
plt.scatter(hours, scores)
plt.xlabel("Hours Studied")
plt.ylabel("Test Scores")
plt.title("Test Scores vs. Hours Studied")
plt.show()

What does it look like? Evaluate the features of this dataset and propose a functional form

$S = f(H) + \epsilon,$

where $S$ is the score, $H$ is the number of hours studied, $\epsilon$ is noise and $f$ is a function.

In [None]:
def my_model(x, parameter_1, parameter_2):
    """A simple model to predict test scores based on hours studied."""
    # TODO Your code here. Submit this snippet to the form!
    return

As you learned in class, you can fit data to your model using **scipy.optimize.curve_fit**:

In [None]:
from scipy.optimize import curve_fit

# Fit the model to the data using curve_fit
parameter_guess = [1,1] # TODO give a reasonable guess for the parameters
popt, pcov = curve_fit(my_model, hours, scores, p0=parameter_guess)

# Unpack the optimal parameters from popt and plot the fitted curve on top of the original data
a_opt, b_opt = popt
x_fit = np.linspace(0, 10, 100)
y_fit = my_model(x_fit, a_opt, b_opt)
plt.scatter(hours, scores, label='Data')
plt.plot(x_fit, y_fit, color='red', label='Fitted Curve')
plt.xlabel("Hours Studied")
plt.ylabel("Test Scores")
plt.title("Test Scores vs. Hours Studied with Fitted Curve")
plt.legend()
plt.show()


You initially are satisfied with the result, but then you remember one of the assumptions needed in linear regression: homoscedasticity, or constant variance. Does this hold in your dataset? One way to check is to plot the residuals from the regression above!

In [None]:
# Plotting the residuals of the fitted model to check for patterns
y_pred = my_model(hours, a_opt, b_opt)
residuals = scores - y_pred
plt.scatter(hours, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Hours Studied")
plt.ylabel("Residuals")
plt.title("Residuals of the Fitted Model")
plt.show()

What features of the residuals do you notice in this plot? Is our data homoscedastic?

Heteroscesdasticity is a common problem in practice. The effect it has is not on the estimates of our parameters, but on our ability to write down confidence intervals around them. The solution to this issue is quite intuitive: points where we expect our residuals to be higher should be less informative for our fit, so we substitute the least squares algorithm by one which weight points inversely to their variance.

Ordinary least squares (OLS) fits a model by minimizing

$$
\sum_i (y_i - f(x_i))^2.
$$

Weighted Least Squares instead minimizes

$$
\sum_i w_i (y_i - f(x_i))^2,
$$

where the weights are typically chosen as

$$
w_i = \frac{1}{\sigma_i^2}.
$$

Intuitively:

- Points with large variance (less reliable) get **smaller weights**.
- Points with small variance (more reliable) get **larger weights**.

WLS adjusts the regression so that noisier observations influence the fit less, leading to more accurate uncertainty estimates.

In Python, this is done by supplying the weights in curve_fit as a **sigma** parameter, but this itself is not robust to noise and would require a choice of discretization! One way to get around this, is by actually trying to create a model for the residuals.

In [None]:
# We look at the squared residuals, which represent the variance of the errors.
plt.scatter(hours, residuals**2)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Hours Studied")
plt.ylabel("Residual squared")
plt.title("Squared Residuals of the Fitted Model")
plt.show()

Looking at the residuals, you propose a model for the variance:

In [None]:
def my_var_model(x, parameter):
    """A model to predict the variance of the errors based on hours studied."""
    # TODO Your code here. Submit this snippet to the form!
    return

Now we can fit the model to our variance!

In [None]:
var_p0 = [] # TODO give a reasonable guess for the parameters of the variance model
var_popt, var_pcov = curve_fit(my_var_model, hours, residuals**2, p0=var_p0)

# Plot the fitted variance model on top of the squared residuals to see how well it captures the pattern of heteroscedasticity.
c_opt, d_opt = var_popt
y_var_fit = my_var_model(x_fit, c_opt, d_opt)
plt.scatter(hours, residuals**2, label='Squared Residuals')
plt.plot(x_fit, y_var_fit, color='red', label='Fitted Variance Model')
plt.xlabel("Hours Studied")
plt.ylabel("Residuals Squared")
plt.title("Squared Residuals and Fitted Variance Model")
plt.legend()
plt.show()

With this result in hands, you can now estimate the variance of the fitted model taking into account the heterosdasticity.

In [None]:
sigma_model = np.sqrt(my_var_model(hours, c_opt, d_opt))
popt_weighted, pcov_weighted = curve_fit(my_model, hours, scores, sigma=sigma_model, absolute_sigma=True, p0=parameter_guess)