<a href="https://colab.research.google.com/github/gitmystuff/DTSC5502/blob/main/Module_06-Hypothesis_Testing/Module_6_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 6 Assignment

Your Name

## Getting Started

* Colab - get notebook from gitmystuff
* Save a Copy in Drive
* Remove Copy of
* Edit name
* Clean up Colab Notebooks folder
* Submit shared link (view only)

**Objective:** This assignment will hopefully challenge your understanding of fundamental statistical concepts—expected value, probability, the binomial distribution, least squares, and maximum likelihood estimation—by applying them to problems inspired by their historical development. The exercises use **unique, randomized data sets** and requires a demonstration of the underlying Python logic and derived formulas.

**Instructions:** Complete the following sections in this Notebook. For the code problems, you must provide the Python code to solve the problem and clearly print the final result with appropriate labels. For all sections, a brief written explanation of your method and interpretation of the result is required in an accompanying Markdown cell directly following the code solution. There is no right way to do this assignment with the exception of getting the solution correct. Hints have been provided but are not required to follow. The code is randomized for uniqueness but use of assistants and study groups is highly recommended. However, this is not a team assignment, it is individual.

**CRITICAL:** **Replace the placeholder string `YOUR_NAME` in the seed initialization lines of each code cell with your actual name.** Do not change this seed value after your initial run, as your results must be reproducible.

## 1\. Probability and Expected Value: The Birth of Chance





### Historical Context

The formal study of **probability** has its roots in the correspondence between **Pierre de Fermat** and **Blaise Pascal** in 1654, which was sparked by a problem posed by the gambler **Chevalier de Méré**. The problem concerned the fair division of stakes in a game of chance that is interrupted before completion (known as the **Problem of Points**). This correspondence led to the fundamental concept of **expected value** ($E[X]$), which is the average value of a large number of independent trials of an experiment. It is calculated as:

$$E[X] = \sum_{i} x_i P(x_i)$$


### Code Problem: The Interrupted Game
Imagine a two-player game where the first player to win $N$ rounds wins the entire stake of **\$100**. The game is interrupted when Player A needs **$A\_{need}$ wins** and Player B needs **$B\_{need}$ wins** to win the game.

**Task:**

1.  Calculate the **probability** of each player winning the overall game from this point, assuming each round is independent and has an equal probability (0.5) of being won by either player.
2.  Determine the **expected value** (fair share) of the $100 stake for each player based on their probability of winning.

In [None]:
import numpy as np
import math

# --- Set unique seed based on your Name ---
# REPLACE THE STRING YOUR_NAME BELOW WITH YOUR NAME
seed_value = hash("YOUR_NAME") % 2**32
np.random.seed(seed_value)

# Generate unique requirements for A and B
# Player A needs 1 or 2 more wins (A_target)
A_needed = np.random.randint(1, 3)
# Player B needs 2 or 3 more wins (B_target)
B_needed = np.random.randint(2, 4)

total_stake = 100
max_rounds_remaining = A_needed + B_needed - 1

print(f"Game State: Player A needs {A_needed} wins. Player B needs {B_needed} wins.")
print(f"Maximum remaining rounds: {max_rounds_remaining}")



In [None]:
# Hints

# Initialize probability of A winning

# Iterate through all possible rounds 'r' on which A could win (A_needed <= r <= max_rounds_remaining)

# P(B wins) is the complement of P(A wins)

# Expected Value E[X] = stake * P(winning)

# # the following print statements are required
# print(f"Probability Player A wins: {prob_A_win:.4f}")
# print(f"Probability Player B wins: {prob_B_win:.4f}")
# print(f"Expected Value (Fair Share) for Player A: ${expected_value_A:.2f}")
# print(f"Expected Value (Fair Share) for Player B: ${expected_value_B:.2f}")

### Explanation

Explain the combinatorial logic used to calculate the probability of A winning based on the generated $A_{needed}$ and $B_{needed}$ values, and how this leads to the fair division of stakes via the Expected Value.

Explain here:

## 2\. The Binomial Distribution and Probability Mass Function (PMF): Bernoulli Trials and Measuring Chance

### Historical Context

The **Binomial Distribution** is central to discrete probability and was rigorously described by **Jacob Bernoulli** in his posthumously published work *Ars Conjectandi* (The Art of Conjecturing) in 1713. It models the number of successes, $k$, in a fixed number of independent trials, $n$, each with a constant probability of success, $p$. The **Probability Mass Function (PMF)** of the binomial distribution, $P(X=k)$, is given by:

$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$

This framework allowed for the first serious analysis of statistical certainty in contexts far beyond simple games of chance, establishing the concept of **repeated independent trials** (Bernoulli trials), which is foundational to modern sampling theory.

### Code Problem: The Flawed Coin Collection

A collector tests a collection of **$N_{sets}$** ancient coins by flipping each coin **$n_1$ times**. You suspect the coins are weighted.

**Task:**

1.  Assume a coin is weighted with a true probability of Heads of **$p_{weighted}$**.
2.  Calculate the probability of observing exactly **$k_1$ Heads** in a set of **$n_1$ flips** for a single weighted coin.
3.  Calculate the **probability** that at least **$k_2$ out of the $N_{sets}$ coins** result in exactly **$k_1$ Heads**.


In [None]:
from scipy.stats import binom
import numpy as np

# --- Set unique seed based on your Name ---
# REPLACE THE STRING YOUR_NAME BELOW WITH YOUR NAME
seed_value = hash("YOUR_NAME") % 2**32
np.random.seed(seed_value)

# Generate unique parameters
N_sets = np.random.randint(8, 13)     # total number of coins/sets (trials for the second step)
n1 = np.random.randint(4, 7)          # trials per coin
k1 = np.random.randint(2, n1)         # target successes (Heads) in n1 flips
p_weighted = np.round(np.random.uniform(0.55, 0.7), 2) # probability of success (Heads)
k2 = np.random.randint(2, 5)          # target minimum successful coins in N_sets

print(f"Parameters: N_sets={N_sets}, n1={n1}, k1={k1}, p_weighted={p_weighted}, k2={k2}")



In [None]:
# Hints

# --- Step 1 & 2: Calculate the probability of k1 Heads in n1 flips for ONE weighted coin ---

# print the probability of k1 Heads

# --- Step 3: Calculate the probability of AT LEAST k2 successful coins (sets of n1 flips) ---

# # the following print statements are required
# print(f"2. The probability of success for the N trials: {p2:.4f}")
# print(f"3. Probability that at least {k2} out of the {N_sets} coins yield {k1} Heads: {prob_at_least_k2:.4f}")

### Explanation
Explain the two-stage Binomial process, detailing how the result from the first PMF calculation ($P(X_1=k_1)$) becomes the probability of success ($p_2$) for the second Binomial distribution, which is used to find the cumulative probability $P(X_2 \ge k_2)$.

Explain here:

## 3\. Least Squares (OLS): The Astronomical Data Revolution and Minimizing Chance




### Historical Context

The **Method of Least Squares (OLS)** was independently developed by **Adrien-Marie Legendre** (1805) and **Carl Friedrich Gauss** (who claimed to have used it as early as 1795). It was originally applied to problems in **astronomy**, such as estimating the orbit of the dwarf planet Ceres from noisy observations. The core idea is to find a line ($\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$) that minimizes the **Sum of Squared Errors (SSE)**:

$$SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

This provided the first systematic, principled method for fitting a model to data contaminated by measurement error, transitioning statistics from descriptive averages to predictive models.


### Code Problem: Ceres Orbit Estimation

You are provided with a dataset representing five noisy observations of a planet's angle over time.

**Task:**

1.  Use the **analytical formulas** for the OLS coefficients ($\hat{\beta}_1$ and $\hat{\beta}_0$) to find the best-fit linear model ($\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$).
$$\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$ $$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$
2.  Calculate the **Sum of Squared Errors (SSE)** for your resulting line.

**Constraint:** You must implement the OLS formulas manually using basic NumPy operations (mean, sum) and **not** use high-level regression functions.

In [None]:
import numpy as np

# --- Set unique seed based on your Name ---
# REPLACE THE STRING BELOW WITH YOUR NAME
seed_value = hash("YOUR_NAME") % 2**32
np.random.seed(seed_value)

# Fixed Time (x) values
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Randomize the underlying parameters and noise level
true_slope = np.round(np.random.uniform(3.5, 4.5), 1)
true_intercept = np.round(np.random.uniform(0.0, 0.5), 1)
noise = np.random.normal(loc=0, scale=np.random.uniform(0.2, 0.4), size=len(x)) # Random noise level

# The observed noisy angle data
y = true_intercept + true_slope * x + noise
y = np.round(y, 2) # Round to two decimal places for easier inspection

print("Observed Data:")
print(f"Time (x): {x}")
print(f"Angle (y): {y}")



In [None]:
# Hints

# 1. Calculate OLS Coefficients

# Calculate means

# Calculate the numerator for beta_1 (Covariance term)

# Calculate the denominator for beta_1 (Variance term)

# Calculate Beta_1 (Slope)

# Calculate Beta_0 (Intercept)

# 2. Calculate Sum of Squared Errors (SSE)

# # the following print statements are required
# print(f"OLS Slope (Beta_1): {beta_1:.4f}")
# print(f"OLS Intercept (Beta_0): {beta_0:.4f}")
# print(f"Best-fit Line: y = {beta_0:.4f} + {beta_1:.4f}x")
# print(f"Sum of Squared Errors (SSE): {sse:.4f}")

### Explanation

Explain the OLS methodology and how the calculated coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ were derived from the analytical formulas to minimize the SSE for their specific noisy dataset.

Explain here:


## 4\. Maximum Likelihood Estimation (MLE): Fisher's Revolution and Controlling Chance


### Historical Context

The concept of **Maximum Likelihood Estimation (MLE)** was formally introduced and highly promoted by **Sir Ronald A. Fisher** in the early 20th century, becoming the dominant approach in statistical inference. MLE seeks to find the parameter values ($\theta$) for a given probability distribution that **maximize the likelihood function**, $L(\theta | x)$, which is essentially the probability of observing the data ($x$) given the parameters ($\theta$).

For $N$ independent Binomial trials, the **Log-Likelihood** function for the parameter $p$ (the probability of success) is:

$$\ln L(p | \text{data}) = \sum_{i=1}^{N} \left[ k_i \ln(p) + (n_i - k_i) \ln(1-p) + \ln\binom{n_i}{k_i} \right]$$

Maximizing this function provides the most plausible value for the underlying parameter based on the observed data.



### Code Problem: Estimating a Disease Rate

You have observed the number of disease cases, $k_i$, in various populations ($n_i$ trials). The number of cases is assumed to follow a **Binomial Distribution**.

**Task:**

1.  Derive the **Maximum Likelihood Estimator** ($\hat{p}_{\text{MLE}}$) for $p$ by setting the derivative of the log-likelihood with respect to $p$ to zero and solving for $p$. *State the derived formula in your Markdown explanation.*
2.  Use the derived formula to calculate the maximum likelihood estimate of the disease rate, $p$, based on the provided data.

In [None]:
import numpy as np

# --- Set unique seed based on your Name ---
# REPLACE THE STRING YOUR_NAME BELOW WITH YOUR NAME
seed_value = hash("YOUR_NAME") % 2**32
np.random.seed(seed_value)

# Generate unique trial and success counts
num_populations = 4
n_i = np.random.randint(100, 300, size=num_populations) # Number of trials (populations size)
base_rate = np.random.uniform(0.01, 0.03) # Base rate (e.g., between 1% and 3%)
k_i = np.random.binomial(n=n_i, p=base_rate) # Number of successes (disease cases)

# Ensure the problem is solvable (at least one case observed)
if np.sum(k_i) == 0:
    k_i[np.argmax(n_i)] = 1

print("Observed Data:")
print(f"Trials (n_i): {n_i}")
print(f"Cases (k_i): {k_i}")



In [None]:
# Hints

# The MLE for the Binomial parameter p is:
# p_hat_MLE = (Sum of all successes) / (Sum of all trials)

# # the following print statements are required
# print(f"Total number of successes (cases): {total_successes}")
# print(f"Total number of trials (populations): {total_trials}")
# print(f"Maximum Likelihood Estimate (p_hat_MLE) of the disease rate: {p_mle:.5f}")

### Explanation

State the derived MLE formula for $p$: $\hat{p}_{\text{MLE}} = \frac{\sum k_i}{\sum n_i}$. They must then use their generated data to calculate and interpret the estimate, explaining that this value maximizes the likelihood of observing the specific case counts in the given populations.

Explain here:


## 5\. Final Reflection and Submission

### Reflection

Address the following question in 3-5 sentences:

**Question:** How does the historical development from the rules of probability (Pascal/Fermat) to the parameter estimation techniques (Gauss/Fisher) illustrate the transition from describing *randomness* to performing *data-driven inference*?

Answer here:

In [None]:
# REPLACE THE STRING BELOW WITH YOUR NAME TO LOCK THE RESULTS
my_id = "YOUR_NAME"
print(f"Assignment complete for: {my_id}.")
print("Please ensure all Markdown cells contain your explanations and all Code cells have produced the final required output values for your unique dataset.")