# STA130 HW05
**Alexey Albert**\
CHAT SUMMARY (images used)
In this conversation, you asked me to solve question 5 from an image about simulating a p-value using a 50/50 coin-flipping model. The problem involved testing the null hypothesis that humans have no preference for tilting their heads left or right when kissing, based on observed data where 80 out of 124 couples (64.5%) tilted their heads to the right.

I used Python with pandas and numpy to simulate the p-value by running 10,000 trials of head tilts under the null hypothesis. The resulting p-value was approximately 0.0013, which, according to the table provided, indicates strong evidence against the null hypothesis. You mentioned getting a slightly different p-value (0.0021), and I explained that such variation is normal due to the randomness of the simulations.

https://chatgpt.com/share/67187b90-e9c0-800d-be75-9ea0f1af89d0

https://chatgpt.com/share/67187b6f-bdf0-800d-9ab9-067ad7f40770


## PRE-LECTURE:

**(1).**

To be examined statistically, an idea/hypothesis must have some kind of data that can be used to test it. Without data to evaulate it, a hypothesis is just an idea. 

A good null hypothesis must be able to be proven wrong depending on the data, must be specific and able to be tested using the data, and must be the opposite of the alternative hypothesis.

In the context of hypothesis testing, a null hypothesis serves as the status quo that you try to disprove using your data, while the alternative hypothesis is the inverse of the null hypothesis, representing the cases the null doesn't cover. E.g. if the null hypothesis is "less than 50% of the people in sta130 enjoy prof. scott's memes" the alternative hypothesis is "more than or exactly 50% of the people in sta130 enjoy prof. scott's memes"

**(2).**

This means that when we conduct a statistical test, the result we get refers to $\mu$, the true population parameter, and not $\bar{x}$, the sample statistic. I.e. even though we're using sample data $\bar{x}$ to run tests, the goal is to make inferences about the entire population, which is what $\mu$ represents. This is why the test result refers to the population and not just the sample we analyzed

**(3).**

When calculating a p-value we assume that the null hypothesis is true because that allows us to determine how likely it is to observe the data (or something more extreme) purely by chance if the null hypothesis were true. By comparing the actual data to this hypothetical world, we can see if the observed results are unusual enough to reject the null hypothesis or if they are consistent with what we'd expect under it

**(4).**

A small p-value means that you're unlikely to get matching data under the assumption that the null hypothesis is correct. So, the smaller the p-value, the more unrealistic it seems that the null hypothesis accurately reflects reality, making it more likely to be rejected. Essentially a low p-value shows that the data doesn’t fit well with what we'd expect under the null hypothesis.

**(5).**

In [1]:
import numpy as np
import pandas as pd

# Set simulation parameters
n_simulations = 10000  # Number of simulations
n_total = 124  # Total number of couples
p_null = 0.5  # Null hypothesis proportion (50% chance)
n_right_observed = 80  # Observed number of couples who tilt right

# Simulate head tilts under the null hypothesis (50/50 chance)
simulations = np.random.binomial(n_total, p_null, size=n_simulations)

# Calculate the proportion of simulations with results as extreme or more extreme than observed
extreme_count = np.sum(np.abs(simulations - (n_total * p_null)) >= np.abs(n_right_observed - (n_total * p_null)))

# Calculate p-value as the proportion of extreme results
p_value_simulation = extreme_count / n_simulations

p_value_simulation

0.0016

This (p-value of 0.0016) shows strong evidence against the null hypothesis based on the ranges in the chart.

**(6).**

A small p-value can't prove the null hypothesis is false—it just means the data doesn’t fit well with it. For Fido, a p-value can't 100% prove innocence or guilt b/c it only shows how weird the data is if we assume he's innocent. Even if the p-value is super low, it doesn't prove he's guilty—it just gives strong evidence. Same with a high p-value: it doesn’t prove innocence, it just means the evidence isn't strong enough to reject it. The p-value can help make a judgement but won't provide an absolute truth/proof.

In [2]:
import pandas as pd

patient_data = pd.DataFrame({
    "PatientID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Age": [45, 34, 29, 52, 37, 41, 33, 48, 26, 39],
    "Gender": ["M", "F", "M", "F", "M", "F", "M", "F", "M", "F"],
    "InitialHealthScore": [84, 78, 83, 81, 81, 80, 79, 85, 76, 83],
    "FinalHealthScore": [86, 86, 80, 86, 84, 86, 86, 82, 83, 84]
})
patient_data

Unnamed: 0,PatientID,Age,Gender,InitialHealthScore,FinalHealthScore
0,1,45,M,84,86
1,2,34,F,78,86
2,3,29,M,83,80
3,4,52,F,81,86
4,5,37,M,81,84
5,6,41,F,80,86
6,7,33,M,79,86
7,8,48,F,85,82
8,9,26,M,76,83
9,10,39,F,83,84


In [3]:
# First let's format this data in the manner of last week's HW "Prelecture" video
# from IPython.display import YouTubeVideo
# YouTubeVideo('Xz0x-8-cgaQ', width=800, height=500)  # https://www.youtube.com/watch?v=Xz0x-8-cgaQ

patient_data['HealthScoreChange'] = patient_data.FinalHealthScore-patient_data.InitialHealthScore
# why do we do the subtraction in this order?
patient_data

Unnamed: 0,PatientID,Age,Gender,InitialHealthScore,FinalHealthScore,HealthScoreChange
0,1,45,M,84,86,2
1,2,34,F,78,86,8
2,3,29,M,83,80,-3
3,4,52,F,81,86,5
4,5,37,M,81,84,3
5,6,41,F,80,86,6
6,7,33,M,79,86,7
7,8,48,F,85,82,-3
8,9,26,M,76,83,7
9,10,39,F,83,84,1


In [4]:
# Do you get the idea here?
# Can you see what's chaning in the output below??

print(pd.DataFrame({'HealthScoreChange': patient_data['HealthScoreChange'],
                    '> 0 ?': patient_data['HealthScoreChange']>0}))

random_difference_sign = np.random.choice([-1, 1], size=len(patient_data))
pd.DataFrame({'HealthScoreChange': random_difference_sign*patient_data['HealthScoreChange'].abs(),
              '> 0 ?': (random_difference_sign*patient_data['HealthScoreChange'])>0})

   HealthScoreChange  > 0 ?
0                  2   True
1                  8   True
2                 -3  False
3                  5   True
4                  3   True
5                  6   True
6                  7   True
7                 -3  False
8                  7   True
9                  1   True


Unnamed: 0,HealthScoreChange,> 0 ?
0,2,True
1,8,True
2,-3,True
3,-5,False
4,3,True
5,6,True
6,7,True
7,3,False
8,-7,False
9,-1,False


In [5]:
# And then can you see what's happening here???

np.random.seed(1)  # make simulation reproducible
number_of_simulations = 10000  # experiment with this... what does this do?
n_size = len(patient_data)  # 10
IncreaseProportionSimulations_underH0random = np.zeros(number_of_simulations)

# generate "random improvement" proportions assuming H0 (vaccine has no average effect) is true 
# meaning that the "before and after" differences are positive or negative at "random"
for i in range(number_of_simulations):
    
    # why is this equivalent to the suggested idea above?
    random_improvement = np.random.choice([0,1], size=len(patient_data), replace=True)  # <<< `replace=True` ^^^

    # why is .mean() a proportion? 
    IncreaseProportionSimulations_underH0random[i] = random_improvement.mean()
    # why is this the statistic we're interested in? Hint: next section...

In [6]:
# "as or more extreme" relative to the hypothesized parameter of the statistic!
population_parameter_value_under_H0 = 0.5

observed_statistic = (patient_data.HealthScoreChange>0).mean()
simulated_statistics = IncreaseProportionSimulations_underH0random

SimStats_as_or_more_extreme_than_ObsStat = \
    abs(simulated_statistics - population_parameter_value_under_H0) >= \
    abs(observed_statistic - population_parameter_value_under_H0) 
    
print('''Which simulated statistics are "as or more extreme"
than the observed statistic? (of ''', observed_statistic, ')', sep="")

pd.DataFrame({'(Simulated) Statistic': simulated_statistics,
              '>= '+str(observed_statistic)+" ?": ['>= '+str(observed_statistic)+" ?"]*number_of_simulations, 
              '"as or more extreme"?': SimStats_as_or_more_extreme_than_ObsStat})

Which simulated statistics are "as or more extreme"
than the observed statistic? (of 0.8)


Unnamed: 0,(Simulated) Statistic,>= 0.8 ?,"""as or more extreme""?"
0,0.7,>= 0.8 ?,False
1,0.4,>= 0.8 ?,False
2,0.3,>= 0.8 ?,False
3,0.5,>= 0.8 ?,False
4,0.8,>= 0.8 ?,True
...,...,...,...
9995,0.4,>= 0.8 ?,False
9996,0.1,>= 0.8 ?,True
9997,0.4,>= 0.8 ?,False
9998,0.6,>= 0.8 ?,False


In [7]:
# Calculate the p-value
# How many bootstrapped statistics generated under H0 
# are "as or more extreme" than the observed statistic 
# (relative to the hypothesized population parameter)? 

observed_statistic = (patient_data.HealthScoreChange>0).mean()
simulated_statistics = IncreaseProportionSimulations_underH0random

# Be careful with "as or more extreme" as it's symmetric!
SimStats_as_or_more_extreme_than_ObsStat = \
    abs(simulated_statistics - population_parameter_value_under_H0) >= \
    abs(observed_statistic - population_parameter_value_under_H0)
    
p_value = (SimStats_as_or_more_extreme_than_ObsStat).sum() / number_of_simulations

# Calculate the p-value for a one-sided test (greater than)
p_value_one_sided_gt = (simulated_statistics >= observed_statistic).sum() / number_of_simulations

# Calculate the p-value for a one-sided test (less than)
p_value_one_sided_lt = (simulated_statistics <= observed_statistic).sum() / number_of_simulations

print('p-value using old method: ', p_value)
print('one-sided p value using greater than: ', p_value_one_sided_gt)
print('one-sided p value using less than: ', p_value_one_sided_lt)

p-value using old method:  0.068
one-sided p value using greater than:  0.0565
one-sided p value using less than:  0.9893


The one-sided test only checks for extremes in one direction rather than the original two-sided test. While the two-sided test tested for deviations from the null hypothesis, the one-sided test just checked for a deviation in a specific direction. The p-value will be smaller in a one-sided test b/c you're only checking on one side, without the other side (in a two-sided test) to balance it out, resulting a more concentrated/focused (smaller) p-value.

**(8).**

In [8]:
# Import necessary libraries
import numpy as np
import pandas as pd
from scipy.stats import binom

# Setting the seed for reproducibility
np.random.seed(42)

**Problem Introduction:**
We have a sample of 80 STA130 students. The goal is to test if their ability to guess the pouring order (tea or milk first) is better than random guessing (p = 0.5). In the experiment, 49 students correctly guessed the pouring order.

**Population:** STA130 students or any general population capable of taking part in a similar experiment.

**Sample:** 80 students from STA130.

**Parameter of interest:** The probability that a student can correctly guess the pouring order.

**Observed test statistic:** 49 correct guesses out of 80.

In [9]:
# Given data
n = 80  # Total number of students (sample size)
observed_correct = 49  # Number of students who guessed correctly
p = 0.5  # Probability under the null hypothesis (random guessing)

**Hypotheses:**

*Formal Null Hypothesis (H0):* p = 0.5 (students are just guessing, with a 50% chance of getting the correct answer).

*Informal H0:* The students have no real ability to tell the difference and are guessing the order randomly.

*Alternative Hypothesis (HA):* p > 0.5 (students are not guessing and have a better-than-random ability to tell the difference).

**Step 1: Calculate the p-value**
We use a binomial test to compute the likelihood of getting 49 or more correct guesses under H0 (random guessing).


In [10]:
# Calculate the p-value using the binomial distribution
p_value = 1 - binom.cdf(observed_correct - 1, n, p)

# Output the results for interpretation
print(f"P-value: {p_value:.5f}")

P-value: 0.02833


**Explanation of the p-value:** The p-value represents the probability of observing 49 or more correct guesses purely by chance, assuming the students are guessing randomly (H0 is true). If the p-value is less than 0.05, we will reject the null hypothesis and conclude that the students can tell the difference better than random guessing.

**Step 2:** Provide a formal conclusion based on the p-value

In [11]:
alpha = 0.05  # Significance level

if p_value < alpha:
    print("We reject the null hypothesis.")
    print("Conclusion: There is enough evidence to suggest that the students can distinguish between the tea or milk being poured first better than random guessing.")
else:
    print("We fail to reject the null hypothesis.")
    print("Conclusion: The students' correct responses could be due to random chance.")

We reject the null hypothesis.
Conclusion: There is enough evidence to suggest that the students can distinguish between the tea or milk being poured first better than random guessing.


Note: A confidence interval approach could also strengthen the results, but here we're focusing on hypothesis testing.

**(9).**

obligatory 'yes'