<h2> Exploring confidence intervals </h2>

This week, we're going to experiment a bit with confidence intervals and generating them from data. One of the most subtle things about confidence intervals is that they do *not* represent the probability that a parameter $\theta$ is in a particular interval $(\ell, u)$ -- it either is or it isn't. What *is* true is that if we generate a large number of confidence intervals at level $\gamma$, then they should contain the parameter with probability $\gamma$.

Let's demonstrate this with our standard normal. We'll do the following as a single trial:
* Generate $30$ normally distributed numbers from an $N(0, 1)$ distribution using `random.gauss(0, 1)`.
* Compute the mean of these $30$ data points and then the corresponding $95\%$ confidence interval $(\overline{x}_{30} \pm 1.96 / \sqrt{30})$.
* Count this as a success if $0$ is in the confidence interval, because $0$ is the true mean.

We'll then carry out $100,000$ trials of this and see how close we came:

In [2]:
import random
from math import sqrt

def trial():
    # Generate 30 random data points
    data = []
    for _ in range(30):
        data.append(random.gauss())

    # Compute the sample mean
    mean = sum(data) / 30

    # Check if 0 is in the confidence interval
    # Return 1 if true, 0 if false
    w = 1.960 / sqrt(30)
    if mean - w < 0 < mean + w:
        return 1
    else:
        return 0

# Run this 100K times and count the successes
count = 0
for _ in range(100000):
    count += trial()

print(count)

94878


On my first trial of this, I got $94,878$ successful confidence intervals out of $100,000$ trials. This is extremely close to the $95\%$ estimate! 

<h3> Questions </h3>

* Let's take the mean of $10$ data points. Construct the $90\%$ confidence interval and experimentally verify that it's correct.
* Returning to a mean of $10$ data points: replace the $95\%$ confidence interval $(\overline{x}_{10} \pm 1.96 \cdot 1 / \sqrt{30})$ with $(\overline{x}_{10} \pm 1.96 \cdot S_{10} / \sqrt{10})$ where $S_{10}$ is the sample standard deviation. Estimate the corresponding confidence level; is it higher or lower than $95\%$? Does this match your expectation?
* Adapting your code from the previous part, estimate a value of $t$ so that $(\overline{x}_{10} \pm t \cdot S_{10} / \sqrt{10})$ is a $98\%$ confidence interval for the mean. (What you've estimated is the critical $t_{9, 0.01}$ from Table B.2!).


To get you started, some code to generate the sample standard deviation is below:

In [19]:
def sample_sd(data):
    # Return the sample standard deviation of the data
    # passed as a list / array
    N = len(data)
    mean = sum(data) / N

    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

In [None]:
#answer questions:
#1:see code. it is correct
#2:it is lower than 95% yes it matches my expectiation when using std deviation
#3: see code

In [1]:
#90 confidence leevl version 1
import random
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = 1.833  # Critical t-value for a 90% confidence interval with 9 degrees of freedom (N-1)

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


90026


In [2]:
#90 confidence level version 2
import random
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = 1.833  # Critical t-value for a 90% confidence interval with 9 degrees of freedom (N-1)

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


90035


In [2]:
#95 confidence level version 1
import random
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = 1.960  # Critical t-value for a 95% confidence interval with 9 degrees of freedom (N-1)

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


91945


In [3]:
#95 confidence leevl
import random
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = 1.960  # Critical t-value for a 95% confidence interval with 9 degrees of freedom (N-1)

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


91719


In [7]:
import random
from scipy import stats
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    # Calculate the confidence interval based on the sample standard deviation
    lower_bound = mean - 1.96 * sample_std / sqrt(10)
    upper_bound = mean + 1.96 * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

# Calculate the estimated confidence level
estimated_confidence_level = count / 100000
print(f"Estimated Confidence Level: {estimated_confidence_level * 100}%")

# Check if the estimated confidence level is higher or lower than 95%
if estimated_confidence_level > 0.95:
    print("The estimated confidence level is higher than 95%.")
elif estimated_confidence_level < 0.95:
    print("The estimated confidence level is lower than 95%.")
else:
    print("The estimated confidence level is approximately equal to 95%.")


Estimated Confidence Level: 91.79%
The estimated confidence level is lower than 95%.


In [6]:
import random
from scipy import stats

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = stats.t.ppf(0.95, df=9)  # Critical t-value for a 90% confidence interval with 9 degrees of freedom

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


89919


In [5]:
import random
from scipy import stats

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = stats.t.ppf(0.975, df=9)  # Critical t-value for a 95% confidence interval with 9 degrees of freedom

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


94972


In [3]:
#98 confidence leevl
import random
from math import sqrt

def trial():
    data = []
    for _ in range(10):
        data.append(random.gauss(0, 1))

    mean = sum(data) / 10

    sample_std = sample_sd(data)

    t = 2.821  # Estimated critical t-value for a 98% confidence interval with 9 degrees of freedom (N-1)

    # Construct the confidence interval
    lower_bound = mean - t * sample_std / sqrt(10)
    upper_bound = mean + t * sample_std / sqrt(10)

    # Check if 0 is in the confidence interval
    return 1 if lower_bound < 0 < upper_bound else 0

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

count = 0
for _ in range(100000):
    count += trial()

print(count)


97945


In [6]:
#98 confidence leevl
from math import sqrt

def sample_sd(data):
    N = len(data)
    mean = sum(data) / N
    return sqrt(sum([(d - mean)**2 for d in data]) / (N - 1))

# Define your sample data here (replace with your actual data)
data = [1.2, 2.3, 1.8, 1.5, 2.7, 3.0, 1.6, 2.2, 2.8, 1.9]

def find_critical_t(data, confidence_level):
    N = len(data)
    sample_std = sample_sd(data)
    
    for t in range(1, 1000):  # You can adjust the range as needed
        t_value = t / 1000.0  # Convert to decimal
        mean = sum(data) / N
        lower_bound = mean - t_value * sample_std / sqrt(N)
        upper_bound = mean + t_value * sample_std / sqrt(N)
        
        # Check if the confidence level is close to the desired level
        if (1 - (upper_bound - lower_bound)) >= confidence_level:
            return t_value

confidence_level = 0.98
critical_t = find_critical_t(data, confidence_level)
print(f"Critical t-value for {confidence_level*100}% confidence: {critical_t}")


Critical t-value for 98.0% confidence: 0.001


In [4]:
from scipy import stats

confidence_level = 0.98
t = stats.t.ppf((1 + confidence_level) / 2, df=9)  # Critical t-value for a 98% confidence interval with 9 degrees of freedom
print(f"Critical t-value for {confidence_level*100}% confidence: {t}")


Critical t-value for 98.0% confidence: 2.8214379233005493
