1 What is hypothesis testing in statistics?

Answer: Hypothesis testing is a statistical method used to make inferences or decisions about a population parameter based on sample data. It involves formulating a null hypothesis (H₀) and an alternative hypothesis (Hₐ), and then using data to decide whether to reject H₀ in favor of Hₐ.

2 What is the null hypothesis, and how does it differ from the alternative hypothesis?

Answer: The null hypothesis (H₀) is a statement of no effect or no difference, acting as the baseline assumption. The alternative hypothesis (Hₐ) is what you want to test for—it represents a new effect, difference, or relationship. Rejecting H₀ provides evidence in favor of Hₐ.

3 What is the significance level in hypothesis testing, and why is it important?

Answer: The significance level (often denoted as α, e.g., 0.05) is the threshold probability for rejecting the null hypothesis. It quantifies the risk of committing a Type I error (rejecting a true null hypothesis) and is crucial for controlling false positives.

4 What does a P-value represent in hypothesis testing?

Answer: The P-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. It reflects the strength of the evidence against H₀.

5 How do you interpret the P-value in hypothesis testing?

Answer: A small P-value (typically < α) indicates strong evidence against the null hypothesis, leading to its rejection. Conversely, a large P-value suggests insufficient evidence to reject H₀.

6 What are Type I and Type II errors in hypothesis testing?

Answer:

Type I error: Incorrectly rejecting a true null hypothesis (false positive).
Type II error: Failing to reject a false null hypothesis (false negative).

7What is the difference between a one-tailed and a two-tailed test in hypothesis testing?

Answer:

One-tailed test: Tests for an effect in a single direction (e.g., Hₐ: parameter > value or parameter < value).
Two-tailed test: Tests for an effect in both directions (Hₐ: parameter ≠ value).

8 What is the Z-test, and when is it used in hypothesis testing?

Answer: A Z-test is used when the sample size is large (or the population standard deviation is known) to test hypotheses about the population mean or proportion based on the standard normal distribution.

9 How do you calculate the Z-score, and what does it represent in hypothesis testing?

Answer: The Z-score is calculated using

Z
=
x
ˉ
−
μ
σ
/
n
,
Z=
σ/
n
​

x
ˉ
 −μ
​
 ,


where $\bar{x}$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the standard deviation, and $n$ is the sample size. It represents how many standard deviations the sample mean is away from the population mean.

10 What is the T-distribution, and when should it be used instead of the normal distribution?

Answer: The T-distribution is similar to the normal distribution but with heavier tails and is used when the sample size is small and/or the population standard deviation is unknown. As sample size increases, it converges to the normal distribution.

11 What is the difference between a Z-test and a T-test?

Answer: A Z-test uses the standard normal distribution and is applied when the population variance is known or the sample size is large. A T-test uses the T-distribution and is preferred when the population variance is unknown and the sample size is small.

12 What is the T-test, and how is it used in hypothesis testing?

Answer: A T-test is a procedure for testing hypotheses about the mean when the population variance is unknown. There are several types, including one-sample, independent, and paired T-tests, to compare means under different scenarios.

13 What is the relationship between Z-test and T-test in hypothesis testing?

Answer: Both tests compare sample statistics to population parameters but differ mainly in assumptions about variance and sample size. For large samples or known variance, the T-test becomes equivalent to the Z-test.

14 What is a confidence interval, and how is it used to interpret statistical results?

Answer: A confidence interval provides a range of plausible values for a population parameter. It is interpreted as, “we are X% confident that the true parameter lies within this interval”—offering both an estimate and its uncertainty.

15 What is the margin of error, and how does it affect the confidence interval?

Answer: The margin of error is half the width of a confidence interval. It reflects the maximum expected difference between the sample estimate and the true population parameter. A smaller margin yields a narrower, more precise interval.

16 How is Bayes' Theorem used in statistics, and what is its significance?

Answer: Bayes’ Theorem updates prior beliefs with new evidence. It calculates the posterior probability of a hypothesis given observed data, and is significant in Bayesian inference where probabilities are revised as more information becomes available.

17 What is the Chi-square distribution, and when is it used?

Answer: The Chi-square distribution is used to assess variability or goodness of fit. It is commonly used in tests for independence in contingency tables and for determining if an observed frequency distribution differs from a theoretical one.

18 What is the Chi-square goodness of fit test, and how is it applied?

Answer: The Chi-square goodness of fit test compares observed frequencies to expected frequencies under a specified distribution. A significant test result suggests that the observed data do not follow the expected distribution.

19 What is the F-distribution, and when is it used in hypothesis testing?

Answer: The F-distribution is used in the analysis of variance (ANOVA) and in tests comparing two variances. It arises as the ratio of two scaled Chi-square distributed variables.

20 What is an ANOVA test, and what are its assumptions?

Answer: ANOVA (Analysis of Variance) tests the hypothesis that the means of multiple groups are equal. Its main assumptions include independence of observations, normality within groups, and homogeneity of variances across groups.

21 What are the different types of ANOVA tests?

Answer: Common forms of ANOVA include:

One-way ANOVA: Compares means across a single factor with multiple groups.
Two-way ANOVA: Evaluates the effect of two independent factors and their interaction.
Repeated Measures ANOVA: Used when the same subjects are measured under different conditions.

22 What is the F-test, and how does it relate to hypothesis testing?

Answer: An F-test is used to compare two variances by calculating the ratio of the variances and comparing it to the F-distribution. It is integral to ANOVA, where the F-test determines whether group means are significantly different.

#Practical Part - 1

In [None]:
#1 Write a Python program to generate a random variable and display its value

import random

random_variable = random.randint(1, 100) # Generates a random integer between 1 and 100
print("Random variable value:", random_variable)

In [None]:
#2 Generate a discrete uniform distribution using Python and plot the probability mass function (PMF)

import numpy as np
import matplotlib.pyplot as plt

# Define the range of possible outcomes
low = 1
high = 10
outcomes = np.arange(low, high + 1)

# Create a discrete uniform distribution
# For a discrete uniform distribution, the probability of each outcome is equal
n_outcomes = len(outcomes)
probabilities = np.full(n_outcomes, 1 / n_outcomes)

# Plot the PMF
plt.figure(figsize=(8, 6))
plt.bar(outcomes, probabilities, color='skyblue')
plt.xlabel('Outcome')
plt.ylabel('Probability')
plt.title('Probability Mass Function of a Discrete Uniform Distribution')
plt.xticks(outcomes)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#3 Write a Python function to calculate the probability distribution function (PDF) of a Bernoulli distribution

def bernoulli_pdf(k, p):
  """
    Calculates the probability distribution function (PDF) of a Bernoulli distribution.

    Args:
      k: The outcome (0 for failure, 1 for success).
      p: The probability of success (between 0 and 1).

    Returns:
      The probability of the given outcome k.
    """
  if k == 1:
    return p
  elif k == 0:
    return 1 - p
  else:
    return 0

In [None]:
#4 Write a Python script to simulate a binomial distribution with n=10 and p=0.5, then plot its histogram

import matplotlib.pyplot as plt
import numpy as np
# Simulate the binomial distribution
n = 10
p = 0.5
size = 1000  # Number of simulations
binomial_samples = np.random.binomial(n, p, size)

# Plot the histogram
plt.figure(figsize=(10, 6))
plt.hist(binomial_samples, bins=np.arange(-0.5, n + 1.5, 1), rwidth=0.8, density=True, color='salmon', edgecolor='black')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title(f'Histogram of Binomial Distribution (n={n}, p={p})')
plt.xticks(np.arange(0, n + 1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#5 Create a Poisson distribution and visualize it using Python

import matplotlib.pyplot as plt
import numpy as np
# 5 Create a Poisson distribution and visualize it using Python

# Define the average rate (lambda)
lambda_val = 3  # Example average rate of events

# Generate random samples from the Poisson distribution
size = 1000  # Number of samples
poisson_samples = np.random.poisson(lambda_val, size)

# Plot the histogram of the samples
plt.figure(figsize=(10, 6))
plt.hist(poisson_samples, bins=np.arange(-0.5, max(poisson_samples) + 1.5, 1), rwidth=0.8, density=True, color='lightgreen', edgecolor='black')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.title(f'Histogram of Poisson Distribution (λ={lambda_val})')
plt.xticks(np.arange(0, max(poisson_samples) + 1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#6 Write a Python program to calculate and plot the cumulative distribution function (CDF) of a discrete uniform distribution

import matplotlib.pyplot as plt
import numpy as np
# Calculate the cumulative distribution function (CDF)
# The CDF at a point x is the probability that the random variable is less than or equal to x
cdf = np.cumsum(probabilities)

# Plot the CDF
plt.figure(figsize=(8, 6))
plt.step(outcomes, cdf, where='post', color='purple')
plt.xlabel('Outcome')
plt.ylabel('Cumulative Probability')
plt.title('Cumulative Distribution Function of a Discrete Uniform Distribution')
plt.xticks(outcomes)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#7 Generate a continuous uniform distribution using NumPy and visualize it

import matplotlib.pyplot as plt
import numpy as np
# Define the parameters of the continuous uniform distribution
low = 0  # Lower bound
high = 10 # Upper bound

# Generate random samples from the continuous uniform distribution
size = 1000 # Number of samples
uniform_samples = np.random.uniform(low, high, size)

# Plot the histogram of the samples
plt.figure(figsize=(10, 6))
plt.hist(uniform_samples, bins=50, density=True, color='orange', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title(f'Histogram of Continuous Uniform Distribution (low={low}, high={high})')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#8 Simulate data from a normal distribution and plot its histogram

import matplotlib.pyplot as plt
import numpy as np
# Define the parameters for the normal distribution
mu = 0  # Mean
sigma = 1 # Standard deviation

# Simulate data from the normal distribution
size = 1000  # Number of samples
normal_samples = np.random.normal(mu, sigma, size)

# Plot the histogram of the samples
plt.figure(figsize=(10, 6))
plt.hist(normal_samples, bins=50, density=True, color='lightblue', edgecolor='black')

# Plot the PDF of the normal distribution for comparison
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = (1/(sigma*np.sqrt(2*np.pi)))*np.exp(-((x-mu)**2)/(2*sigma**2))
plt.plot(x, p, 'k', linewidth=2)

plt.xlabel('Value')
plt.ylabel('Density')
plt.title(f'Histogram of Normal Distribution (μ={mu}, σ={sigma})')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#9  Write a Python function to calculate Z-scores from a dataset and plot them

import matplotlib.pyplot as plt
import numpy as np
def calculate_and_plot_zscores(data):
  """
  Calculates Z-scores for a dataset and plots their histogram.

  Args:
    data: A list or NumPy array of numerical data.
  """
  # Calculate the mean and standard deviation
  mean = np.mean(data)
  std_dev = np.std(data)

  # Calculate Z-scores
  z_scores = [(x - mean) / std_dev for x in data]

  # Plot the histogram of Z-scores
  plt.figure(figsize=(10, 6))
  plt.hist(z_scores, bins=50, density=True, color='gold', edgecolor='black')

  # Plot the PDF of the standard normal distribution (mean=0, std=1) for comparison
  xmin, xmax = plt.xlim()
  x = np.linspace(xmin, xmax, 100)
  p = (1 / np.sqrt(2 * np.pi)) * np.exp(-(x ** 2) / 2)
  plt.plot(x, p, 'k', linewidth=2)

  plt.xlabel('Z-score')
  plt.ylabel('Density')
  plt.title('Histogram of Z-scores and Standard Normal Distribution PDF')
  plt.grid(axis='y', linestyle='--', alpha=0.7)
  plt.show()

# Example usage:
# Assuming 'normal_samples' was generated in the previous cell
calculate_and_plot_zscores(normal_samples)

In [None]:
#10 Implement the Central Limit Theorem (CLT) using Python for a non-normal distribution

import matplotlib.pyplot as plt
import numpy as np
# Central Limit Theorem (CLT) Implementation with a Non-Normal Distribution (e.g., Exponential)

# Define the parameters of the exponential distribution
lambda_exp = 0.5 # Rate parameter

# Define the parameters for the CLT simulation
sample_size = 30 # Size of each sample
num_samples = 1000 # Number of samples to draw

# Generate samples from the exponential distribution and calculate their means
sample_means = []
for _ in range(num_samples):
  exponential_samples = np.random.exponential(scale=1/lambda_exp, size=sample_size)
  sample_means.append(np.mean(exponential_samples))

# Plot the histogram of the sample means
plt.figure(figsize=(10, 6))
plt.hist(sample_means, bins=50, density=True, color='violet', edgecolor='black')

# Plot the theoretical normal distribution based on the expected mean and standard deviation of the sample means
# For exponential distribution, expected value (mu) = 1/lambda, variance = 1/lambda^2
# According to CLT, mean of sample means approaches population mean (1/lambda)
# According to CLT, standard deviation of sample means approaches population standard deviation / sqrt(sample_size)
# Population standard deviation for exponential is 1/lambda
expected_mean_of_means = 1/lambda_exp
expected_std_dev_of_means = (1/lambda_exp) / np.sqrt(sample_size)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = (1 / (expected_std_dev_of_means * np.sqrt(2 * np.pi))) * np.exp(-((x - expected_mean_of_means) ** 2) / (2 * expected_std_dev_of_means ** 2))
plt.plot(x, p, 'k', linewidth=2, label='Theoretical Normal Distribution')

plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title(f'Distribution of Sample Means (CLT Demonstration) - Exponential Distribution\nSample Size={sample_size}, Number of Samples={num_samples}')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#11 Simulate multiple samples from a normal distribution and verify the Central Limit Theorem

import matplotlib.pyplot as plt
import numpy as np
# Define the parameters for the normal distribution
mu = 10  # Mean
sigma = 2 # Standard deviation

# Define the parameters for the simulation
sample_size = 50 # Size of each sample
num_samples = 1000 # Number of samples to draw

# Simulate multiple samples from the normal distribution and calculate their means
sample_means_normal = []
for _ in range(num_samples):
  normal_sample = np.random.normal(mu, sigma, sample_size)
  sample_means_normal.append(np.mean(normal_sample))

# Plot the histogram of the sample means
plt.figure(figsize=(10, 6))
plt.hist(sample_means_normal, bins=50, density=True, color='teal', edgecolor='black')

# Plot the theoretical normal distribution for the sample means based on CLT
# The mean of the distribution of sample means is the population mean (mu)
# The standard deviation of the distribution of sample means is the population standard deviation divided by the square root of the sample size (sigma / sqrt(sample_size))
expected_mean_of_means_normal = mu
expected_std_dev_of_means_normal = sigma / np.sqrt(sample_size)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = (1 / (expected_std_dev_of_means_normal * np.sqrt(2 * np.pi))) * np.exp(-((x - expected_mean_of_means_normal) ** 2) / (2 * expected_std_dev_of_means_normal ** 2))
plt.plot(x, p, 'k', linewidth=2, label='Theoretical Normal Distribution of Sample Means')

plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.title(f'Distribution of Sample Means (Normal Distribution Source)\nSample Size={sample_size}, Number of Samples={num_samples}')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#12 Write a Python function to calculate and plot the standard normal distribution (mean = 0, std = 1)

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

def plot_standard_normal_distribution():
  """
  Calculates and plots the standard normal distribution (mean = 0, std = 1).
  """
  # Define the range of x values for the plot
  x = np.linspace(-4, 4, 100) # From -4 to 4 standard deviations

  # Calculate the probability density function (PDF) for the standard normal distribution
  # Use scipy.stats.norm.pdf for the standard normal distribution (mean=0, std=1)
  pdf_values = norm.pdf(x, loc=0, scale=1)

  # Plot the standard normal distribution
  plt.figure(figsize=(8, 6))
  plt.plot(x, pdf_values, label='Standard Normal Distribution (μ=0, σ=1)', color='blue')
  plt.xlabel('Value')
  plt.ylabel('Density')
  plt.title('Standard Normal Distribution')
  plt.grid(True)
  plt.legend()
  plt.show()

# Call the function to plot the standard normal distribution
plot_standard_normal_distribution()

In [None]:
#13 Generate random variables and calculate their corresponding probabilities using the binomial distribution

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom

def calculate_binomial_probability(k, n, p):
  """
  Calculates the probability of getting exactly k successes in n trials
  with a success probability of p using the binomial distribution.

  Args:
    k: The number of successes.
    n: The number of trials.
    p: The probability of success in a single trial.

  Returns:
    The probability P(X=k).
  """
  if not 0 <= p <= 1:
    raise ValueError("Probability p must be between 0 and 1.")
  if not 0 <= k <= n:
    raise ValueError(f"Number of successes k must be between 0 and number of trials n ({n}).")

  return binom.pmf(k, n, p)

# Example usage: Generate random variables (number of successes) and calculate probabilities

# Define parameters for the binomial distribution
n_trials = 20  # Number of trials
p_success = 0.4 # Probability of success

# Generate a few random variables (number of successes)
num_random_vars = 5
random_successes = np.random.binomial(n_trials, p_success, num_random_vars)

print(f"Generating {num_random_vars} random variables from Binomial(n={n_trials}, p={p_success}):")
for rv in random_successes:
  probability = calculate_binomial_probability(rv, n_trials, p_success)
  print(f"Random variable: {rv}, Probability P(X={rv}): {probability:.4f}")

# You can also calculate probabilities for a range of possible outcomes
possible_outcomes = np.arange(0, n_trials + 1)
probabilities_for_outcomes = [calculate_binomial_probability(k, n_trials, p_success) for k in possible_outcomes]

# Plot the probability mass function (PMF)
plt.figure(figsize=(10, 6))
plt.bar(possible_outcomes, probabilities_for_outcomes, color='teal')
plt.xlabel('Number of Successes (k)')
plt.ylabel('Probability P(X=k)')
plt.title(f'Binomial Distribution PMF (n={n_trials}, p={p_success})')
plt.xticks(possible_outcomes)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


In [None]:
#14 Write a Python program to calculate the Z-score for a given data point and compare it to a standard normal distribution

import matplotlib.pyplot as plt
import numpy as np
def calculate_z_score(data_point, mean, std_dev):
  """
  Calculates the Z-score for a given data point.

  Args:
    data_point: The value for which to calculate the Z-score.
    mean: The mean of the dataset.
    std_dev: The standard deviation of the dataset.

  Returns:
    The Z-score.
  """
  if std_dev == 0:
    return float('inf') if data_point > mean else float('-inf') if data_point < mean else 0
  return (data_point - mean) / std_dev

# Example usage:
data = [65, 70, 75, 80, 85, 90, 95]
data_point_to_check = 90

# Calculate mean and standard deviation of the data
mean_data = np.mean(data)
std_dev_data = np.std(data)

# Calculate the Z-score for the data point
z_score = calculate_z_score(data_point_to_check, mean_data, std_dev_data)

print(f"Data point: {data_point_to_check}")
print(f"Mean of data: {mean_data:.2f}")
print(f"Standard deviation of data: {std_dev_data:.2f}")
print(f"Z-score for {data_point_to_check}: {z_score:.2f}")

# Compare the Z-score to the standard normal distribution
# We can find the probability of observing a Z-score this extreme or more extreme
# using the cumulative distribution function (CDF) of the standard normal distribution.

# Probability of observing a value less than or equal to the data point
probability_less_than_or_equal = norm.cdf(z_score)
print(f"Probability of observing a value less than or equal to {data_point_to_check} (corresponding to Z-score {z_score:.2f}) in a standard normal distribution: {probability_less_than_or_equal:.4f}")

# Probability of observing a value greater than or equal to the data point
probability_greater_than_or_equal = 1 - probability_less_than_or_equal
print(f"Probability of observing a value greater than or equal to {data_point_to_check} in a standard normal distribution: {probability_greater_than_or_equal:.4f}")

# Visualize the Z-score on the standard normal distribution
plt.figure(figsize=(8, 6))
x = np.linspace(-3, 3, 100)
plt.plot(x, norm.pdf(x, 0, 1), label='Standard Normal Distribution (μ=0, σ=1)', color='blue')

# Mark the calculated Z-score on the plot
plt.axvline(z_score, color='red', linestyle='dashed', linewidth=2, label=f'Z-score = {z_score:.2f}')

# Highlight the area corresponding to the probability
# For example, highlight the area to the right of the Z-score (for probability_greater_than_or_equal)
x_fill = np.linspace(z_score, 3, 100)
plt.fill_between(x_fill, norm.pdf(x_fill, 0, 1), color='red', alpha=0.3, label=f'P(Z >= {z_score:.2f}) = {probability_greater_than_or_equal:.4f}')


plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('Z-score Comparison to Standard Normal Distribution')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
#15 Implement hypothesis testing using Z-statistics for a sample dataset

import numpy as np
from statsmodels.stats.weightstats import ztest

# Sample data (replace with your actual dataset)
# Let's assume this data represents measurements from a sample.
sample_data = np.array([52, 55, 58, 60, 63, 65, 67, 70, 72, 75])

# Define the null hypothesis (H0): The population mean is equal to a specific value (e.g., 60)
# Define the alternative hypothesis (Ha): The population mean is not equal to the specific value (two-tailed test)
null_hypothesis_mean = 60

# Perform the Z-test
# The ztest function returns the Z-statistic and the p-value
z_statistic, p_value = ztest(sample_data, value=null_hypothesis_mean)

# Define the significance level (alpha)
alpha = 0.05

print(f"Sample Data: {sample_data}")
print(f"Null Hypothesis (H0): Population Mean = {null_hypothesis_mean}")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the p-value and significance level
if p_value < alpha:
  print("\nDecision: Reject the null hypothesis.")
  print(f"Conclusion: There is sufficient evidence to suggest that the population mean is significantly different from {null_hypothesis_mean} at the {alpha} significance level.")
else:
  print("\nDecision: Fail to reject the null hypothesis.")
  print(f"Conclusion: There is not enough evidence to suggest that the population mean is significantly different from {null_hypothesis_mean} at the {alpha} significance level.")

# You can also perform one-tailed tests by specifying the 'alternative' parameter:
# 'two-sided': alternative is mu != value (default)
# 'larger': alternative is mu > value
# 'smaller': alternative is mu < value

# Example of a one-tailed test (alternative: population mean is greater than 60)
z_statistic_larger, p_value_larger = ztest(sample_data, value=null_hypothesis_mean, alternative='larger')
print(f"\nOne-tailed test (Ha: mu > {null_hypothesis_mean}):")
print(f"Z-statistic: {z_statistic_larger:.4f}")
print(f"P-value: {p_value_larger:.4f}")

if p_value_larger < alpha:
  print("Decision: Reject the null hypothesis.")
  print(f"Conclusion: There is sufficient evidence to suggest that the population mean is significantly greater than {null_hypothesis_mean}.")
else:
  print("Decision: Fail to reject the null hypothesis.")
  print(f"Conclusion: There is not enough evidence to suggest that the population mean is significantly greater than {null_hypothesis_mean}.")

In [None]:
#16 Create a confidence interval for a dataset using Python and interpret the result

import numpy as np
from scipy import stats

# Assume 'sample_data' is the dataset you want to create a confidence interval for.
# This data is already defined in the preceding code.

# Define the confidence level
confidence_level = 0.95
alpha = 1 - confidence_level

# Calculate the sample mean and sample standard deviation
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1) # Use ddof=1 for sample standard deviation (unbiased estimator)
sample_size = len(sample_data)

# Determine the appropriate distribution for the confidence interval
# Since the sample size is small (< 30) and the population standard deviation is unknown,
# we should use the t-distribution. If sample size was large (> 30) or population
# standard deviation was known, we would use the normal (z) distribution.

# Calculate the t-score (critical value) for the desired confidence level and degrees of freedom
# Degrees of freedom for a one-sample t-interval is n - 1
degrees_of_freedom = sample_size - 1
t_score = stats.t.ppf(1 - alpha/2, degrees_of_freedom) # For a two-tailed interval

# Calculate the standard error of the mean (SEM)
standard_error = sample_std / np.sqrt(sample_size)

# Calculate the margin of error
margin_of_error = t_score * standard_error

# Calculate the confidence interval
confidence_interval_lower = sample_mean - margin_of_error
confidence_interval_upper = sample_mean + margin_of_error

print(f"Dataset: {sample_data}")
print(f"Sample Mean: {sample_mean:.2f}")
print(f"Sample Standard Deviation: {sample_std:.2f}")
print(f"Sample Size: {sample_size}")
print(f"Confidence Level: {confidence_level}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print(f"T-score (critical value): {t_score:.4f}")
print(f"Standard Error of the Mean (SEM): {standard_error:.4f}")
print(f"Margin of Error: {margin_of_error:.4f}")
print(f"\n{confidence_level*100:.0f}% Confidence Interval: ({confidence_interval_lower:.4f}, {confidence_interval_upper:.4f})")

# Interpretation of the result
print("\nInterpretation:")
print(f"We are {confidence_level*100:.0f}% confident that the true population mean lies within the interval [{confidence_interval_lower:.4f}, {confidence_interval_upper:.4f}].")
print("This means that if we were to take many random samples from the same population and construct a confidence interval for each sample,")
print(f"approximately {confidence_level*100:.0f}% of these intervals would contain the true population mean.")
print("It does NOT mean that there is a 95% probability that the true population mean falls within this specific interval.")

In [None]:
#17 Generate data from a normal distribution, then calculate and interpret the confidence interval for its mean

import matplotlib.pyplot as plt
import numpy as np
# Generate data from a normal distribution
mu = 50  # Mean of the normal distribution
sigma = 10 # Standard deviation of the normal distribution
sample_size = 100  # Number of data points

normal_data = np.random.normal(mu, sigma, sample_size)

# Calculate the confidence interval for the mean
confidence_level = 0.95
alpha = 1 - confidence_level

# Calculate sample statistics
sample_mean = np.mean(normal_data)
sample_std = np.std(normal_data, ddof=1) # Use ddof=1 for sample standard deviation (unbiased estimator)
n = len(normal_data)

# Determine the appropriate distribution for the confidence interval
# Since we generated the data, we know the population standard deviation (sigma).
# However, in a real scenario, you'd likely use the sample standard deviation (sample_std)
# and the t-distribution unless the sample size is very large.
# For demonstration with known population standard deviation, we can use the Z-distribution.
# If we were to rely solely on the sample, we'd use the t-distribution as shown in the previous cell.

# Using Z-distribution (assuming population sigma is known or sample size is large)
# Find the critical Z-value
z_critical = norm.ppf(1 - alpha/2) # For a two-tailed interval

# Calculate the standard error of the mean (SEM)
# If using population standard deviation: sem = sigma / np.sqrt(n)
# If using sample standard deviation (more common in practice): sem = sample_std / np.sqrt(n)
standard_error = sample_std / np.sqrt(n) # Using sample standard deviation

# Calculate the margin of error
margin_of_error = z_critical * standard_error

# Calculate the confidence interval
confidence_interval_lower = sample_mean - margin_of_error
confidence_interval_upper = sample_mean + margin_of_error

print(f"Generated Data from Normal Distribution (μ={mu}, σ={sigma}): First 10 values: {normal_data[:10]}...")
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Sample Standard Deviation: {sample_std:.4f}")
print(f"Sample Size: {n}")
print(f"Confidence Level: {confidence_level}")
print(f"Z-score (critical value): {z_critical:.4f}")
print(f"Standard Error of the Mean (SEM): {standard_error:.4f}")
print(f"Margin of Error: {margin_of_error:.4f}")
print(f"\n{confidence_level*100:.0f}% Confidence Interval for the Mean: ({confidence_interval_lower:.4f}, {confidence_interval_upper:.4f})")

# Interpretation of the result
print("\nInterpretation:")
print(f"Based on our sample of {n} data points, we are {confidence_level*100:.0f}% confident that the true mean of the population from which this data was drawn")
print(f"lies within the range [{confidence_interval_lower:.4f}, {confidence_interval_upper:.4f}].")
print(f"Since we generated the data from a normal distribution with a true mean (μ) of {mu},")
print(f"we can check if the true mean falls within our calculated confidence interval.")
print(f"Is the true mean ({mu}) within the interval [{confidence_interval_lower:.4f}, {confidence_interval_upper:.4f}]? {confidence_interval_lower <= mu <= confidence_interval_upper}")
print("In general, if we were to repeat this process of taking samples and calculating confidence intervals many times,")
print(f"approximately {confidence_level*100:.0f}% of those intervals would contain the true population mean.")

# Optional: Visualize the data and the confidence interval on a histogram
plt.figure(figsize=(10, 6))
plt.hist(normal_data, bins=30, density=True, alpha=0.6, color='skyblue', label='Data Histogram')

# Plot the theoretical normal distribution from which data was generated
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, sigma)
plt.plot(x, p, 'k', linewidth=2, label=f'True Normal PDF (μ={mu}, σ={sigma})')

# Mark the sample mean
plt.axvline(sample_mean, color='red', linestyle='dashed', linewidth=2, label=f'Sample Mean: {sample_mean:.2f}')

# Mark the confidence interval
plt.axvline(confidence_interval_lower, color='green', linestyle='dotted', linewidth=2, label=f'{confidence_level*100:.0f}% CI')
plt.axvline(confidence_interval_upper, color='green', linestyle='dotted', linewidth=2)
plt.fill_betweenx([0, plt.ylim()[1]], confidence_interval_lower, confidence_interval_upper, color='green', alpha=0.1)


plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Generated Normal Data and Confidence Interval for the Mean')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#18 Write a Python script to calculate and visualize the probability density function (PDF) of a normal distribution

import matplotlib.pyplot as plt
import numpy as np
def plot_normal_distribution_pdf(mean, std_dev):
  """
  Calculates and plots the probability density function (PDF)
  of a normal distribution.

  Args:
    mean: The mean (μ) of the normal distribution.
    std_dev: The standard deviation (σ) of the normal distribution.
  """
  # Define the range of x values for the plot
  # It's common to plot a few standard deviations around the mean
  x = np.linspace(mean - 4*std_dev, mean + 4*std_dev, 100)

  # Calculate the probability density function (PDF) for the normal distribution
  # Using the formula: PDF(x) = (1 / (σ * sqrt(2*π))) * exp(-((x - μ)^2) / (2 * σ^2))
  # Or using scipy.stats.norm.pdf
  pdf_values = norm.pdf(x, loc=mean, scale=std_dev)

  # Plot the PDF
  plt.figure(figsize=(8, 6))
  plt.plot(x, pdf_values, label=f'Normal Distribution (μ={mean}, σ={std_dev})', color='blue')
  plt.xlabel('Value')
  plt.ylabel('Density')
  plt.title('Probability Density Function of a Normal Distribution')
  plt.grid(True)
  plt.legend()
  plt.show()

# Example usage:
# Plot a normal distribution with mean 0 and standard deviation 1 (standard normal distribution)
plot_normal_distribution_pdf(0, 1)

# Plot a normal distribution with mean 10 and standard deviation 2
plot_normal_distribution_pdf(10, 2)

In [None]:
#19 Use Python to calculate and interpret the cumulative distribution function (CDF) of a Poisson distribution

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import poisson

# Define the average rate (lambda)
lambda_val = 3  # Example average rate of events

# Define the range of possible outcomes for the CDF
# The CDF is typically calculated for integer values k
# We can go up to a certain number of events where the probability becomes very low
max_k = int(poisson.ppf(0.999, lambda_val)) # Find k value up to which CDF is 0.999
possible_k_values = np.arange(0, max_k + 1)

# Calculate the cumulative distribution function (CDF) for each possible k
# The CDF at k is the probability that the number of events is less than or equal to k, P(X <= k)
cdf_values = poisson.cdf(possible_k_values, lambda_val)

print(f"Calculating CDF for Poisson Distribution with λ={lambda_val}:")
for k, cdf in zip(possible_k_values, cdf_values):
  print(f"CDF at k={k} (P(X <= {k})): {cdf:.4f}")

# Plot the CDF
plt.figure(figsize=(10, 6))
plt.step(possible_k_values, cdf_values, where='post', color='darkorange', label=f'Poisson CDF (λ={lambda_val})')
plt.xlabel('Number of Events (k)')
plt.ylabel('Cumulative Probability P(X <= k)')
plt.title(f'Cumulative Distribution Function of a Poisson Distribution (λ={lambda_val})')
plt.xticks(possible_k_values)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend()
plt.show()

# Interpretation of the CDF
print("\nInterpretation of the Poisson CDF:")
print("The CDF at a specific value 'k' represents the probability of observing 'k' or fewer events in a fixed interval or space,")
print(f"given that the average rate of events is {lambda_val}.")
print(f"For example, the CDF at k={max_k // 2} is {poisson.cdf(max_k // 2, lambda_val):.4f}. This means there is a {poisson.cdf(max_k // 2, lambda_val)*100:.2f}% chance of observing {max_k // 2} or fewer events.")
print("As 'k' increases, the CDF value approaches 1, indicating that it becomes almost certain to observe a number of events less than or equal to a large value 'k'.")

In [None]:
#20 Simulate a random variable using a continuous uniform distribution and calculate its expected value

import numpy as np
# Simulate a random variable from a continuous uniform distribution
# Define the parameters of the continuous uniform distribution
low_val = 5  # Lower bound
high_val = 15 # Upper bound

# Generate one random variable from the continuous uniform distribution
random_variable_uniform = np.random.uniform(low_val, high_val)

print(f"Simulated random variable from Continuous Uniform distribution [{low_val}, {high_val}]: {random_variable_uniform:.4f}")

# Calculate the expected value of the continuous uniform distribution
# The expected value (mean) of a continuous uniform distribution U(a, b) is (a + b) / 2
expected_value_uniform = (low_val + high_val) / 2

print(f"Calculated Expected Value of Continuous Uniform distribution [{low_val}, {high_val}]: {expected_value_uniform:.4f}")

# You can also verify this by taking the mean of many samples (demonstrated in a previous cell)
# mean of uniform_samples generated in cell 7 should be close to the expected value
# print(f"Mean of 1000 uniform samples: {np.mean(uniform_samples):.4f}")

In [None]:
#21  Write a Python program to compare the standard deviations of two datasets and visualize the difference

import matplotlib.pyplot as plt
import numpy as np
def compare_std_devs(data1, data2):
  """
    Compares the standard deviations of two datasets and visualizes the difference.

    Args:
      data1: The first dataset (list or NumPy array).
      data2: The second dataset (list or NumPy array).
    """
  std_dev1 = np.std(data1)
  std_dev2 = np.std(data2)

  print(f"Standard Deviation of Dataset 1: {std_dev1:.4f}")
  print(f"Standard Deviation of Dataset 2: {std_dev2:.4f}")

  # Visualize the difference
  plt.figure(figsize=(8, 6))
  std_devs = [std_dev1, std_dev2]
  labels = ['Dataset 1', 'Dataset 2']
  colors = ['blue', 'orange']

  plt.bar(labels, std_devs, color=colors)
  plt.ylabel('Standard Deviation')
  plt.title('Comparison of Standard Deviations')
  plt.grid(axis='y', linestyle='--', alpha=0.7)
  plt.show()

  # Optional: Visualize the distributions themselves to see the spread
  plt.figure(figsize=(10, 6))
  plt.hist(data1, bins=30, density=True, alpha=0.6, color='blue', label='Dataset 1')
  plt.hist(data2, bins=30, density=True, alpha=0.6, color='orange', label='Dataset 2')
  plt.xlabel('Value')
  plt.ylabel('Density')
  plt.title('Distribution of Datasets')
  plt.legend()
  plt.grid(axis='y', linestyle='--', alpha=0.7)
  plt.show()


# Example Usage:
# Create two example datasets with different standard deviations
dataset1 = np.random.normal(loc=10, scale=2, size=100) # Mean 10, Std Dev 2
dataset2 = np.random.normal(loc=10, scale=5, size=100) # Mean 10, Std Dev 5

compare_std_devs(dataset1, dataset2)

# Example with the sample_data from previous cells and another generated dataset
dataset3 = np.random.normal(loc=70, scale=8, size=len(sample_data))

compare_std_devs(sample_data, dataset3)

In [None]:
#22 Calculate the range and interquartile range (IQR) of a dataset generated from a normal distribution

import matplotlib.pyplot as plt
import numpy as np
# Assuming 'normal_samples' was generated in cell 8
# Define the dataset from the normal distribution simulation in cell 8
dataset_for_range = normal_samples

# Calculate the range
data_range = np.max(dataset_for_range) - np.min(dataset_for_range)

# Calculate the Interquartile Range (IQR)
# IQR = Q3 - Q1
Q1 = np.percentile(dataset_for_range, 25)
Q3 = np.percentile(dataset_for_range, 75)
iqr = Q3 - Q1

print(f"Dataset from Normal Distribution (first 10 values): {dataset_for_range[:10]}...")
print(f"Number of data points: {len(dataset_for_range)}")
print(f"Range: {data_range:.4f}")
print(f"First Quartile (Q1): {Q1:.4f}")
print(f"Third Quartile (Q3): {Q3:.4f}")
print(f"Interquartile Range (IQR): {iqr:.4f}")

# Optional: Visualize the quartiles and IQR on a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(dataset_for_range, vert=False, patch_artist=True,
            boxprops=dict(facecolor='lightblue'),
            medianprops=dict(color='red', linewidth=2))
plt.title('Box Plot of the Normal Distribution Dataset')
plt.xlabel('Value')
plt.yticks([]) # Hide y-axis ticks for a single box plot
plt.text(Q1, 1.05, f'Q1\n{Q1:.2f}', ha='center', color='blue')
plt.text(Q3, 1.05, f'Q3\n{Q3:.2f}', ha='center', color='blue')
plt.text(np.median(dataset_for_range), 1.05, f'Median\n{np.median(dataset_for_range):.2f}', ha='center', color='red')
plt.text(Q1 + iqr/2, 0.8, f'IQR = {iqr:.2f}', ha='center', color='black')
plt.show()

In [None]:
#23 Implement Z-score normalization on a dataset and visualize its transformation

import matplotlib.pyplot as plt
import numpy as np
# Implement Z-score normalization on a dataset and visualize its transformation

def z_score_normalize(data):
  """
  Applies Z-score normalization to a dataset.

  Args:
    data: A list or NumPy array of numerical data.

  Returns:
    A NumPy array of the Z-score normalized data.
  """
  mean = np.mean(data)
  std_dev = np.std(data)

  if std_dev == 0:
    # Handle case where standard deviation is zero (all data points are the same)
    return np.zeros_like(data)
  else:
    return (data - mean) / std_dev

# Example usage:
# Create a sample dataset that is not normalized
unnormalized_data = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50])

# Apply Z-score normalization
normalized_data = z_score_normalize(unnormalized_data)

print(f"Original Data: {unnormalized_data}")
print(f"Z-score Normalized Data: {normalized_data}")

# Verify the mean and standard deviation of the normalized data
mean_normalized = np.mean(normalized_data)
std_dev_normalized = np.std(normalized_data)
print(f"Mean of Normalized Data: {mean_normalized:.4f}")
print(f"Standard Deviation of Normalized Data: {std_dev_normalized:.4f}")
# The mean should be very close to 0 and standard deviation very close to 1.

# Visualize the transformation
plt.figure(figsize=(12, 6))

# Plot original data histogram
plt.subplot(1, 2, 1)
plt.hist(unnormalized_data, bins=5, density=True, color='skyblue', edgecolor='black')
plt.xlabel('Original Value')
plt.ylabel('Density')
plt.title('Distribution of Original Data')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Plot normalized data histogram
plt.subplot(1, 2, 2)
plt.hist(normalized_data, bins=5, density=True, color='lightgreen', edgecolor='black')

# Overlay the standard normal distribution PDF for comparison
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, 0, 1) # Standard normal distribution PDF
plt.plot(x, p, 'k', linewidth=2, label='Standard Normal PDF')

plt.xlabel('Normalized Value (Z-score)')
plt.ylabel('Density')
plt.title('Distribution of Z-score Normalized Data')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

In [None]:
# 24 Write a Python function to calculate the skewness and kurtosis of a dataset generated from a normal distribution.

import matplotlib.pyplot as plt
import numpy as np
def calculate_skewness_kurtosis(data):
  """
  Calculates the skewness and kurtosis of a dataset.

  Args:
    data: A list or NumPy array of numerical data.

  Returns:
    A tuple containing the skewness and kurtosis of the dataset.
  """
  skewness = stats.skew(data)
  kurtosis = stats.kurtosis(data) # By default, returns excess kurtosis (kurtosis - 3)
  return skewness, kurtosis

# Example usage:
# Assuming 'normal_samples' was generated in cell 8
# This data is generated from a normal distribution.
dataset = normal_samples

skewness, kurtosis = calculate_skewness_kurtosis(dataset)

print(f"Dataset from Normal Distribution (first 10 values): {dataset[:10]}...")
print(f"Number of data points: {len(dataset)}")
print(f"Skewness: {skewness:.4f}")
print(f"Kurtosis (Excess): {kurtosis:.4f}")

# Interpretation for a normal distribution:
print("\nInterpretation:")
print("For a perfectly normal distribution:")
print("- Skewness should be close to 0 (indicating symmetry).")
print("- Kurtosis (Excess) should be close to 0 (indicating mesokurtic distribution).")
print("\nOur calculated values for the sample from a normal distribution:")
print(f"- Skewness ({skewness:.4f}) is close to 0, as expected for a normal distribution.")
print(f"- Kurtosis ({kurtosis:.4f}) is close to 0, as expected for a normal distribution.")
print("Small deviations from 0 are expected due to sampling variability.")

# You can also visualize the distribution to intuitively see the skewness and kurtosis
plt.figure(figsize=(10, 6))
plt.hist(dataset, bins=50, density=True, color='lightblue', edgecolor='black', alpha=0.7, label='Sample Histogram')

# Plot the theoretical normal distribution PDF for comparison
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
mean_data = np.mean(dataset)
std_dev_data = np.std(dataset)
p = norm.pdf(x, mean_data, std_dev_data)
plt.plot(x, p, 'k', linewidth=2, label='Fitted Normal Distribution PDF')

plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Histogram of Sample and Fitted Normal Distribution')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


#Practical Part - 2

In [None]:
#1 Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results

import matplotlib.pyplot as plt
import numpy as np
# Task: Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results.
# This task is a repetition of a previous task, but we will provide a slightly more structured example.

# Define the known population parameters
population_mean = 70      # μ₀ (Hypothesized population mean under the null hypothesis)
population_std_dev = 10   # σ (Known population standard deviation) - This is crucial for a Z-test

# Define the sample data
sample_data = np.array([68, 72, 75, 65, 71, 73, 69, 70, 74, 67, 76, 66, 70, 72, 71]) # Example sample data

# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_size = len(sample_data)

print(f"Population Mean (H0): {population_mean}")
print(f"Population Standard Deviation: {population_std_dev}")
print(f"Sample Data: {sample_data}")
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Sample Size: {sample_size}")

# Formulate the hypotheses
# H₀: The sample mean is equal to the population mean (μ = μ₀)
# Hₐ: The sample mean is not equal to the population mean (μ ≠ μ₀) (Two-tailed test)

# Calculate the Z-statistic manually
# Z = (sample_mean - population_mean) / (population_std_dev / sqrt(sample_size))
standard_error = population_std_dev / np.sqrt(sample_size)
z_statistic_manual = (sample_mean - population_mean) / standard_error

print(f"\nCalculated Z-statistic (Manual): {z_statistic_manual:.4f}")

# Alternatively, use the statsmodels ztest function (requires sample data, not just stats)
# Note: The `ztest` function in `statsmodels.stats.weightstats` is primarily designed
# for comparing sample mean to a value when population standard deviation is UNKNOWN
# and sample size is large (it uses the sample standard deviation).
# However, for direct comparison to a known population mean with known population std dev,
# the manual calculation is more direct based on the Z-test definition.
# If you use `ztest` with a known population std dev, you'd typically provide that.
# Let's use the manual calculation for clarity matching the Z-test definition
# using the known population std dev.

# Find the p-value for the Z-statistic
# For a two-tailed test, the p-value is 2 * P(Z > |z_statistic|)
p_value = 2 * (1 - norm.cdf(abs(z_statistic_manual)))

print(f"P-value (from Z-statistic): {p_value:.4f}")

# Define the significance level (alpha)
alpha = 0.05

print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the p-value
if p_value < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that the true population mean is significantly different from the hypothesized mean ({population_mean}).")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that the true population mean is significantly different from the hypothesized mean ({population_mean}).")

# Interpretation in context
print("\nInterpretation:")
if p_value < alpha:
  print(f"Our observed sample mean ({sample_mean:.4f}) is statistically different from the hypothesized population mean of {population_mean}.")
  print(f"The probability of observing a sample mean as extreme as or more extreme than {sample_mean:.4f}, assuming the true population mean is {population_mean}, is very small ({p_value:.4f}).")
else:
  print(f"Our observed sample mean ({sample_mean:.4f}) is not statistically different from the hypothesized population mean of {population_mean}.")
  print(f"The probability of observing a sample mean as extreme as or more extreme than {sample_mean:.4f}, assuming the true population mean is {population_mean}, is {p_value:.4f}.")
  print("This value is greater than our significance level, so we do not have enough evidence to reject the initial assumption.")

# Optional: Visualize the Z-test
# Plot the standard normal distribution
plt.figure(figsize=(10, 6))
x = np.linspace(-4, 4, 100)
plt.plot(x, norm.pdf(x, 0, 1), label='Standard Normal Distribution (μ=0, σ=1)', color='blue')

# Mark the calculated Z-statistic
plt.axvline(z_statistic_manual, color='red', linestyle='dashed', linewidth=2, label=f'Z-statistic = {z_statistic_manual:.4f}')

# Mark the critical Z-values for a two-tailed test at alpha
z_critical_lower = norm.ppf(alpha/2)
z_critical_upper = norm.ppf(1 - alpha/2)
plt.axvline(z_critical_lower, color='green', linestyle='dotted', linewidth=2, label=f'Critical Z-values (α={alpha})')
plt.axvline(z_critical_upper, color='green', linestyle='dotted', linewidth=2)

# Shade the rejection regions
x_reject_lower = np.linspace(-4, z_critical_lower, 50)
plt.fill_between(x_reject_lower, norm.pdf(x_reject_lower, 0, 1), color='green', alpha=0.2, label='Rejection Region')
x_reject_upper = np.linspace(z_critical_upper, 4, 50)
plt.fill_between(x_reject_upper, norm.pdf(x_reject_upper, 0, 1), color='green', alpha=0.2)

plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('Z-Test for One Sample Mean')
plt.legend()
plt.grid(True)
plt.show()

```

In [None]:
#2 Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python

import matplotlib.pyplot as plt
import numpy as np
# Task: Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python

# This task has been addressed within the provided code snippet.
# Specifically, task #15 demonstrates hypothesis testing using Z-statistics for a sample dataset
# and calculates the corresponding P-value.

# Re-executing task #15 for clarity:

# Implement hypothesis testing using Z-statistics for a sample dataset

# Sample data (replace with your actual dataset)
# Let's assume this data represents measurements from a sample.
sample_data = np.array([52, 55, 58, 60, 63, 65, 67, 70, 72, 75])

# Define the null hypothesis (H0): The population mean is equal to a specific value (e.g., 60)
# Define the alternative hypothesis (Ha): The population mean is not equal to the specific value (two-tailed test)
null_hypothesis_mean = 60

# Perform the Z-test
# The ztest function returns the Z-statistic and the p-value
# Note: As mentioned previously, ztest from statsmodels uses the sample std dev by default.
# If the population std dev is known, a manual calculation or a different function might be preferred
# if the sample size is small. However, for demonstration and common usage with larger samples,
# ztest is a convenient option. Let's use it as in the original example.
z_statistic, p_value = ztest(sample_data, value=null_hypothesis_mean)

# Define the significance level (alpha)
alpha = 0.05

print(f"Sample Data: {sample_data}")
print(f"Null Hypothesis (H0): Population Mean = {null_hypothesis_mean}")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the p-value and significance level
if p_value < alpha:
  print("\nDecision: Reject the null hypothesis.")
  print(f"Conclusion: There is sufficient evidence to suggest that the population mean is significantly different from {null_hypothesis_mean} at the {alpha} significance level.")
else:
  print("\nDecision: Fail to reject the null hypothesis.")
  print(f"Conclusion: There is not enough evidence to suggest that the population mean is significantly different from {null_hypothesis_mean} at the {alpha} significance level.")

# The P-value ({p_value:.4f}) is the key result showing the strength of evidence against the null hypothesis.
# A smaller P-value means stronger evidence against H0.

# Optional: Visualize the Z-test (reusing the code from task #25, but adapted to the ztest output)
# Note: The `ztest` function uses the sample standard deviation, so the resulting Z-statistic
# might be slightly different from a Z-test where the population standard deviation is truly known.
# However, the interpretation relative to the standard normal distribution remains the same.

# Plot the standard normal distribution
plt.figure(figsize=(10, 6))
x = np.linspace(-4, 4, 100)
plt.plot(x, norm.pdf(x, 0, 1), label='Standard Normal Distribution (μ=0, σ=1)', color='blue')

# Mark the calculated Z-statistic from ztest
plt.axvline(z_statistic, color='red', linestyle='dashed', linewidth=2, label=f'Z-statistic = {z_statistic:.4f}')

# Mark the critical Z-values for a two-tailed test at alpha
z_critical_lower = norm.ppf(alpha/2)
z_critical_upper = norm.ppf(1 - alpha/2)
plt.axvline(z_critical_lower, color='green', linestyle='dotted', linewidth=2, label=f'Critical Z-values (α={alpha})')
plt.axvline(z_critical_upper, color='green', linestyle='dotted', linewidth=2)

# Shade the rejection regions
x_reject_lower = np.linspace(-4, z_critical_lower, 50)
plt.fill_between(x_reject_lower, norm.pdf(x_reject_lower, 0, 1), color='green', alpha=0.2, label='Rejection Region')
x_reject_upper = np.linspace(z_critical_upper, 4, 50)
plt.fill_between(x_reject_upper, norm.pdf(x_reject_upper, 0, 1), color='green', alpha=0.2)

plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('Z-Test for One Sample Mean')
plt.legend()
plt.grid(True)
plt.show()



In [None]:
#3 Implement a one-sample Z-test using Python to compare the sample mean with the population mean

import matplotlib.pyplot as plt
import numpy as np
# Define the sample data (replace with your actual sample data)
# Example: A sample of 10 observations
sample_data = np.array([62, 65, 68, 70, 73, 75, 78, 80, 83, 85])

# Define the known or hypothesized population mean under the null hypothesis
population_mean_h0 = 70      # μ₀

# Define the known population standard deviation
# This is required for a Z-test. If unknown and sample size is small, use a T-test.
population_std_dev = 8      # σ

# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_size = len(sample_data)

print(f"Sample Data: {sample_data}")
print(f"Hypothesized Population Mean (H0): {population_mean_h0}")
print(f"Known Population Standard Deviation: {population_std_dev}")
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Sample Size: {sample_size}")

# Calculate the Z-statistic
# Z = (sample_mean - population_mean_h0) / (population_std_dev / sqrt(sample_size))
standard_error = population_std_dev / np.sqrt(sample_size)
z_statistic = (sample_mean - population_mean_h0) / standard_error

print(f"\nCalculated Z-statistic: {z_statistic:.4f}")

# Calculate the P-value
# For a two-tailed test (Ha: μ ≠ μ₀), the p-value is the probability of observing a Z-statistic
# as extreme as or more extreme than the calculated Z-statistic in either tail of the standard normal distribution.
# p-value = 2 * P(Z > |z_statistic|)
p_value = 2 * (1 - norm.cdf(abs(z_statistic)))

print(f"P-value (two-tailed): {p_value:.4f}")

# Define the significance level (alpha)
alpha = 0.05

print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the P-value
if p_value < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that the true population mean is significantly different from the hypothesized mean ({population_mean_h0}).")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that the true population mean is significantly different from the hypothesized mean ({population_mean_h0}).")

# Interpretation
print("\nInterpretation:")
print(f"The Z-statistic of {z_statistic:.4f} indicates that our sample mean of {sample_mean:.4f} is {abs(z_statistic):.2f} standard errors away from the hypothesized population mean of {population_mean_h0}.")
if p_value < alpha:
  print(f"The P-value of {p_value:.4f} is less than the significance level ({alpha}). This means there is a low probability ({p_value*100:.2f}%) of observing a sample mean as extreme as {sample_mean:.4f} if the true population mean were actually {population_mean_h0}.")
  print("Therefore, we reject the null hypothesis and conclude there is a significant difference.")
else:
  print(f"The P-value of {p_value:.4f} is greater than the significance level ({alpha}). This means there is a high probability ({p_value*100:.2f}%) of observing a sample mean as extreme as {sample_mean:.4f} if the true population mean were actually {population_mean_h0}.")
  print("Therefore, we fail to reject the null hypothesis. We do not have enough evidence to claim a significant difference.")

# Optional: Visualize the Z-test
# Plot the standard normal distribution
plt.figure(figsize=(10, 6))
x = np.linspace(-4, 4, 100)
plt.plot(x, norm.pdf(x, 0, 1), label='Standard Normal Distribution (μ=0, σ=1)', color='blue')

# Mark the calculated Z-statistic
plt.axvline(z_statistic, color='red', linestyle='dashed', linewidth=2, label=f'Z-statistic = {z_statistic:.4f}')

# Mark the critical Z-values for a two-tailed test at alpha
z_critical_lower = norm.ppf(alpha/2)
z_critical_upper = norm.ppf(1 - alpha/2)
plt.axvline(z_critical_lower, color='green', linestyle='dotted', linewidth=2, label=f'Critical Z-values (α={alpha})')
plt.axvline(z_critical_upper, color='green', linestyle='dotted', linewidth=2)

# Shade the rejection regions
x_reject_lower = np.linspace(-4, z_critical_lower, 50)
plt.fill_between(x_reject_lower, norm.pdf(x_reject_lower, 0, 1), color='green', alpha=0.2, label='Rejection Region')
x_reject_upper = np.linspace(z_critical_upper, 4, 50)
plt.fill_between(x_reject_upper, norm.pdf(x_reject_upper, 0, 1), color='green', alpha=0.2)

plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title('One-Sample Z-Test Visualization')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
#4  Perform a two-tailed Z-test using Python and visualize the decision region on a plot

import matplotlib.pyplot as plt
import numpy as np
# Define parameters for the Z-test
mu = 75  # Hypothesized population mean (H0)
sigma = 12 # Known population standard deviation (required for Z-test)
sample_size = 100 # Size of the sample
# Simulate a sample mean - let's make it slightly different from the hypothesized mean
sample_mean = np.random.normal(loc=77, scale=sigma/np.sqrt(sample_size), size=1)[0]

# Calculate the Z-statistic
z_statistic = (sample_mean - mu) / (sigma / np.sqrt(sample_size))

# Define the significance level (alpha) for a two-tailed test
alpha = 0.05

# Find the critical Z-values for the decision region
# For a two-tailed test, the critical values are Z(α/2) and Z(1 - α/2)
z_critical_lower = norm.ppf(alpha / 2)
z_critical_upper = norm.ppf(1 - alpha / 2)

print(f"Hypothesized Population Mean (μ₀): {mu}")
print(f"Known Population Standard Deviation (σ): {sigma}")
print(f"Sample Size (n): {sample_size}")
print(f"Simulated Sample Mean (x̄): {sample_mean:.4f}")
print(f"Calculated Z-statistic: {z_statistic:.4f}")
print(f"Significance Level (α): {alpha}")
print(f"Critical Z-values: ({z_critical_lower:.4f}, {z_critical_upper:.4f})")

# Calculate the P-value (for interpretation)
p_value = 2 * (1 - norm.cdf(abs(z_statistic)))
print(f"P-value (two-tailed): {p_value:.4f}")

# Make a decision based on the Z-statistic and critical values
print("\nDecision:")
if z_statistic < z_critical_lower or z_statistic > z_critical_upper:
  print(f"The calculated Z-statistic ({z_statistic:.4f}) falls within the rejection region.")
  print("Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient evidence at the {alpha} significance level to conclude that the true population mean is significantly different from {mu}.")
else:
  print(f"The calculated Z-statistic ({z_statistic:.4f}) falls outside the rejection region.")
  print("Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough evidence at the {alpha} significance level to conclude that the true population mean is significantly different from {mu}.")

# Visualize the decision region on a plot (Standard Normal Distribution)
plt.figure(figsize=(10, 6))

# Plot the standard normal distribution PDF
x = np.linspace(-4, 4, 200)
pdf_values = norm.pdf(x, 0, 1)
plt.plot(x, pdf_values, label='Standard Normal Distribution (μ=0, σ=1)', color='blue')

# Mark the critical Z-values and shade the rejection region
x_reject_lower = np.linspace(-4, z_critical_lower, 50)
plt.fill_between(x_reject_lower, norm.pdf(x_reject_lower, 0, 1), color='red', alpha=0.3, label=f'Rejection Region (α/2)')

x_reject_upper = np.linspace(z_critical_upper, 4, 50)
plt.fill_between(x_reject_upper, norm.pdf(x_reject_upper, 0, 1), color='red', alpha=0.3)

plt.axvline(z_critical_lower, color='red', linestyle='--', linewidth=1.5)
plt.axvline(z_critical_upper, color='red', linestyle='--', linewidth=1.5)

# Mark the calculated Z-statistic
plt.axvline(z_statistic, color='green', linestyle='-', linewidth=2, label=f'Calculated Z-statistic = {z_statistic:.4f}')

# Mark the non-rejection region
x_non_reject = np.linspace(z_critical_lower, z_critical_upper, 100)
plt.fill_between(x_non_reject, norm.pdf(x_non_reject, 0, 1), color='green', alpha=0.1, label='Non-Rejection Region')

plt.xlabel('Z-score')
plt.ylabel('Density')
plt.title(f'Two-tailed Z-Test: Decision Region (α={alpha})')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
#5 Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing

import matplotlib.pyplot as plt
import numpy as np
def visualize_type1_type2_errors(mu_null, sigma, sample_size, alpha, mu_alt=None):
    """
    Calculates and visualizes Type 1 and Type 2 errors for a one-sample Z-test.

    Args:
        mu_null: The hypothesized population mean under the null hypothesis (H0).
        sigma: The known population standard deviation.
        sample_size: The size of the sample.
        alpha: The significance level (probability of Type 1 error).
        mu_alt: The true population mean under the alternative hypothesis (Ha).
                If None, only Type 1 error visualization is shown.
    """

    # Calculate the standard error of the mean (SEM)
    sem = sigma / np.sqrt(sample_size)

    # Calculate the critical Z-values for the decision rule
    # For a two-tailed test, the critical values are Z(alpha/2) and Z(1 - alpha/2)
    z_critical_lower = norm.ppf(alpha / 2)
    z_critical_upper = norm.ppf(1 - alpha / 2)

    # Convert Z-critical values back to sample means
    # Critical sample mean lower = mu_null + Z(alpha/2) * sem
    # Critical sample mean upper = mu_null + Z(1 - alpha/2) * sem
    critical_mean_lower = mu_null + z_critical_lower * sem
    critical_mean_upper = mu_null + z_critical_upper * sem

    print(f"Hypothesized Population Mean (μ₀): {mu_null}")
    print(f"Known Population Standard Deviation (σ): {sigma}")
    print(f"Sample Size (n): {sample_size}")
    print(f"Significance Level (α): {alpha}")
    print(f"Standard Error of the Mean (SEM): {sem:.4f}")
    print(f"Critical Z-values: ({z_critical_lower:.4f}, {z_critical_upper:.4f})")
    print(f"Critical Sample Means: ({critical_mean_lower:.4f}, {critical_mean_upper:.4f})")

    # --- Visualization of Type 1 Error ---
    plt.figure(figsize=(12, 7))

    # Plot the distribution under the null hypothesis (H0)
    # This distribution is for the sample mean, centered at mu_null with standard deviation SEM
    x = np.linspace(mu_null - 4 * sem, mu_null + 4 * sem, 200)
    pdf_h0 = norm.pdf(x, mu_null, sem)
    plt.plot(x, pdf_h0, label=f'Distribution under H₀ (μ={mu_null}, σ={sem:.4f})', color='blue')

    # Shade the Type 1 Error region (alpha)
    # This is the rejection region under the assumption that H0 is true.
    x_type1_lower = np.linspace(mu_null - 4 * sem, critical_mean_lower, 50)
    plt.fill_between(x_type1_lower, norm.pdf(x_type1_lower, mu_null, sem), color='red', alpha=0.3, label=f'Type 1 Error (α/2 = {alpha/2:.4f})')

    x_type1_upper = np.linspace(critical_mean_upper, mu_null + 4 * sem, 50)
    plt.fill_between(x_type1_upper, norm.pdf(x_type1_upper, mu_null, sem), color='red', alpha=0.3)

    # Mark critical values
    plt.axvline(critical_mean_lower, color='red', linestyle='--', linewidth=1.5, label=f'Critical Values')
    plt.axvline(critical_mean_upper, color='red', linestyle='--', linewidth=1.5)

    plt.xlabel('Sample Mean')
    plt.ylabel('Density')
    plt.title(f'Type 1 Error (α) Visualization (Distribution under H₀)')
    plt.legend()
    plt.grid(True)
    plt.show()

    # --- Visualization of Type 2 Error (if mu_alt is provided) ---
    if mu_alt is not None:
        print(f"\nTrue Population Mean (under Hₐ): {mu_alt}")

        # Plot the distribution under the alternative hypothesis (Ha)
        # This distribution is for the sample mean, centered at mu_alt with standard deviation SEM
        plt.figure(figsize=(12, 7))

        x_alt = np.linspace(min(mu_null, mu_alt) - 4 * sem, max(mu_null, mu_alt) + 4 * sem, 200)
        pdf_h0_for_overlap = norm.pdf(x_alt, mu_null, sem)
        plt.plot(x_alt, pdf_h0_for_overlap, label=f'Distribution under H₀ (μ={mu_null})', color='blue', linestyle='--')

        pdf_ha = norm.pdf(x_alt, mu_alt, sem)
        plt.plot(x_alt, pdf_ha, label=f'Distribution under Hₐ (μ={mu_alt}, σ={sem:.4f})', color='green')

        # Mark critical values again on this plot
        plt.axvline(critical_mean_lower, color='red', linestyle='--', linewidth=1.5, label=f'Critical Values ({critical_mean_lower:.4f}, {critical_mean_upper:.4f})')
        plt.axvline(critical_mean_upper, color='red', linestyle='--', linewidth=1.5)


        # Shade the Type 2 Error region (Beta)
        # This is the non-rejection region under the assumption that Ha is true.
        # It's the area of the Ha distribution that falls within the acceptance region of H0.
        x_type2 = np.linspace(critical_mean_lower, critical_mean_upper, 100)
        plt.fill_between(x_type2, norm.pdf(x_type2, mu_alt, sem), color='orange', alpha=0.5, label='Type 2 Error (β)')

        # Calculate the probability of Type 2 Error (Beta)
        # Beta = P(Fail to reject H0 | H0 is False)
        # Beta = P(Critical Mean Lower <= Sample Mean <= Critical Mean Upper | True Mean is mu_alt)
        # Using CDF of the distribution under Ha: P(X <= critical_upper) - P(X <= critical_lower)
        beta = norm.cdf(critical_mean_upper, loc=mu_alt, scale=sem) - norm.cdf(critical_mean_lower, loc=mu_alt, scale=sem)

        print(f"Calculated Type 2 Error (β): {beta:.4f}")
        print(f"Statistical Power (1 - β): {1 - beta:.4f}")
        plt.text(np.mean([critical_mean_lower, critical_mean_upper]), pdf_ha.max() * 0.8, f'β = {beta:.4f}', ha='center', color='black', fontsize=10)

        plt.xlabel('Sample Mean')
        plt.ylabel('Density')
        plt.title(f'Type 2 Error (β) Visualization (Distribution under Hₐ = {mu_alt})')
        plt.legend()
        plt.grid(True)
        plt.show()
    else:
      print("\nmu_alt not provided, skipping Type 2 error visualization.")


# Example Usage:
# Scenario 1: Visualize Type 1 error only
visualize_type1_type2_errors(mu_null=70, sigma=10, sample_size=30, alpha=0.05)

# Scenario 2: Visualize Type 1 and Type 2 errors with a specific alternative mean
visualize_type1_type2_errors(mu_null=70, sigma=10, sample_size=30, alpha=0.05, mu_alt=75)

# Scenario 3: Another example with different parameters
visualize_type1_type2_errors(mu_null=50, sigma=5, sample_size=50, alpha=0.01, mu_alt=51)


In [None]:
#6 Write a Python program to perform an independent T-test and interpret the results

import matplotlib.pyplot as plt
import numpy as np
# Task: Write a Python program to perform an independent T-test and interpret the results

# Assume we have two independent samples representing two groups.
# For example, test scores of students who used Method A vs. Method B.

# Sample data for Group A
group_a_scores = np.array([85, 88, 90, 82, 87, 89, 86, 84, 91, 83])

# Sample data for Group B
group_b_scores = np.array([78, 80, 83, 76, 81, 84, 79, 82, 85, 77])

print(f"Scores for Group A: {group_a_scores}")
print(f"Scores for Group B: {group_b_scores}")

# Perform the independent T-test
# Null Hypothesis (H₀): The true means of the two groups are equal (μ₁ = μ₂).
# Alternative Hypothesis (Hₐ): The true means of the two groups are not equal (μ₁ ≠ μ₂). (Two-tailed test)

# We use scipy.stats.ttest_ind for independent samples T-test.
# The function returns the T-statistic and the p-value.
# By default, it assumes equal variances (pooled standard deviation).
# If you suspect unequal variances, you can set `equal_var=False` (Welch's T-test).
t_statistic, p_value = stats.ttest_ind(group_a_scores, group_b_scores)

# Define the significance level (alpha)
alpha = 0.05

print(f"\nIndependent Samples T-Test Results:")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the p-value
if p_value < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that the true mean scores of Group A and Group B are significantly different.")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that the true mean scores of Group A and Group B are significantly different.")

# Interpretation in context
print("\nInterpretation:")
mean_a = np.mean(group_a_scores)
mean_b = np.mean(group_b_scores)
print(f"Mean score for Group A: {mean_a:.2f}")
print(f"Mean score for Group B: {mean_b:.2f}")
print(f"Difference in Sample Means: {mean_a - mean_b:.2f}")

print(f"\nThe T-statistic ({t_statistic:.4f}) measures the difference between the two group means relative to the variability within the groups.")
print(f"The P-value ({p_value:.4f}) tells us the probability of observing a T-statistic as extreme as or more extreme than {t_statistic:.4f}, assuming the null hypothesis (that there is no true difference in means) is true.")

if p_value < alpha:
  print(f"Since the P-value ({p_value:.4f}) is less than the significance level ({alpha}), we have strong evidence against the null hypothesis.")
  print("We conclude that the observed difference in sample means is statistically significant, suggesting a real difference between the population means of the two groups.")
else:
  print(f"Since the P-value ({p_value:.4f}) is greater than the significance level ({alpha}), we do not have enough evidence to reject the null hypothesis.")
  print("The observed difference in sample means could reasonably occur by chance if the true population means were equal.")

# Optional: Visualize the distributions
plt.figure(figsize=(10, 6))
plt.hist(group_a_scores, bins=5, density=True, alpha=0.6, color='skyblue', label='Group A')
plt.hist(group_b_scores, bins=5, density=True, alpha=0.6, color='lightgreen', label='Group B')
plt.axvline(mean_a, color='blue', linestyle='dashed', linewidth=2, label=f'Mean A: {mean_a:.2f}')
plt.axvline(mean_b, color='green', linestyle='dashed', linewidth=2, label=f'Mean B: {mean_b:.2f}')
plt.xlabel('Scores')
plt.ylabel('Density')
plt.title('Distribution of Scores for Group A and Group B')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


In [None]:
#7  Perform a paired sample T-test using Python and visualize the comparison results

import matplotlib.pyplot as plt
import numpy as np
# Task: Perform a paired sample T-test using Python and visualize the comparison results

# A paired sample T-test is used when the observations in the two samples are related
# or dependent. This typically occurs when the same subjects are measured twice
# (e.g., before and after an intervention) or when comparing paired individuals
# (e.g., matched pairs in a study).

# Assume we have data from 15 individuals on their performance score before and after
# participating in a training program.

# Performance scores Before the training
scores_before = np.array([75, 80, 88, 72, 79, 85, 76, 81, 90, 78, 82, 87, 74, 83, 89])

# Performance scores After the training for the same individuals
scores_after = np.array([78, 83, 92, 75, 82, 88, 79, 84, 94, 81, 85, 90, 77, 86, 91])

print(f"Scores Before Training: {scores_before}")
print(f"Scores After Training:  {scores_after}")

# Perform the paired sample T-test
# Null Hypothesis (H₀): The true mean difference between the paired observations is zero (μ_diff = 0).
# Alternative Hypothesis (Hₐ): The true mean difference is not zero (μ_diff ≠ 0). (Two-tailed test)

# We use scipy.stats.ttest_rel for paired (related) samples T-test.
# The function calculates the differences between the pairs and performs a one-sample T-test on the differences.
# It returns the T-statistic and the p-value.
t_statistic_paired, p_value_paired = stats.ttest_rel(scores_before, scores_after)

# Define the significance level (alpha)
alpha = 0.05

print(f"\nPaired Samples T-Test Results:")
print(f"T-statistic: {t_statistic_paired:.4f}")
print(f"P-value: {p_value_paired:.4f}")
print(f"Significance Level (alpha): {alpha}")

# Make a decision based on the p-value
if p_value_paired < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that there is a significant difference between the 'Before' and 'After' scores (the training program had a significant effect).")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that there is a significant difference between the 'Before' and 'After' scores (the training program did not have a significant effect).")

# Interpretation in context
print("\nInterpretation:")
mean_before = np.mean(scores_before)
mean_after = np.mean(scores_after)
mean_difference = np.mean(scores_after - scores_before) # Mean of the differences

print(f"Mean score Before: {mean_before:.2f}")
print(f"Mean score After:  {mean_after:.2f}")
print(f"Mean Difference (After - Before): {mean_difference:.2f}")

print(f"\nThe T-statistic ({t_statistic_paired:.4f}) measures how many standard errors the mean difference ({mean_difference:.2f}) is away from zero (the value hypothesized under H₀).")
print(f"The P-value ({p_value_paired:.4f}) is the probability of observing a mean difference as extreme as or more extreme than {mean_difference:.2f} if the true mean difference in the population were actually zero.")

if p_value_paired < alpha:
  print(f"Since the P-value ({p_value_paired:.4f}) is less than the significance level ({alpha}), we have strong evidence against the null hypothesis.")
  print("We conclude that the observed mean difference is statistically significant, suggesting that the training program led to a real change in performance.")
else:
  print(f"Since the P-value ({p_value_paired:.4f}) is greater than the significance level ({alpha}), we do not have enough evidence to reject the null hypothesis.")
  print("The observed mean difference could reasonably occur by chance if the training program had no real effect on performance.")

# Visualize the comparison results

# Option 1: Box plots to show the distribution of scores before and after
plt.figure(figsize=(8, 6))
plt.boxplot([scores_before, scores_after], labels=['Before Training', 'After Training'], patch_artist=True,
            boxprops=dict(facecolor='skyblue'),
            medianprops=dict(color='red', linewidth=2))
plt.ylabel('Performance Score')
plt.title('Comparison of Performance Scores Before and After Training (Box Plot)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Option 2: Histogram of the differences between scores
differences = scores_after - scores_before
plt.figure(figsize=(8, 6))
plt.hist(differences, bins=5, edgecolor='black', alpha=0.7, color='lightcoral')
plt.axvline(np.mean(differences), color='red', linestyle='dashed', linewidth=2, label=f'Mean Difference: {np.mean(differences):.2f}')
plt.xlabel('Difference in Score (After - Before)')
plt.ylabel('Frequency')
plt.title('Distribution of Differences in Performance Scores')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Option 3: Paired plot (connecting individual scores) - Useful for showing individual changes
plt.figure(figsize=(10, 6))
for i in range(len(scores_before)):
    plt.plot([1, 2], [scores_before[i], scores_after[i]], marker='o', color='gray', linestyle='-', alpha=0.5)
plt.plot([1, 2], [np.mean(scores_before), np.mean(scores_after)], marker='o', color='blue', linestyle='-', linewidth=3, label='Mean Change') # Plot mean change
plt.xticks([1, 2], ['Before Training', 'After Training'])
plt.ylabel('Performance Score')
plt.title('Individual and Mean Performance Score Changes Before vs. After Training')
plt.xlim(0.8, 2.2) # Adjust limits to space out points
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
#8 Simulate data and perform both Z-test and T-test, then compare the results using Python

import matplotlib.pyplot as plt
import numpy as np
# Task: Simulate data and perform both Z-test and T-test, then compare the results using Python

# This task involves demonstrating both Z-tests and T-tests and highlighting when each is appropriate
# and how their results might compare.

# --- Scenario Setup ---
# We will simulate data under two main scenarios to illustrate Z-test and T-test conditions:

# Scenario A: Conditions favoring a Z-test (Large sample size and/or known population standard deviation)
# Scenario B: Conditions favoring a T-test (Small sample size and unknown population standard deviation)

# Let's hypothesize a population mean (μ₀) that we will test against.
hypothesized_population_mean = 100 # H₀: μ = 100

# Let's assume a true (but often unknown in real scenarios) population standard deviation (σ).
# For Scenario A (Z-test), we'll pretend we know this value.
# For Scenario B (T-test), we'll treat it as unknown and use the sample standard deviation.
true_population_std_dev = 15

# Let's assume a true population mean (μ_true) which might be different from μ₀.
# This allows us to see how the tests perform when H₀ is false.
true_population_mean = 105 # Let's assume the true mean is slightly higher than H₀

# Define significance level
alpha = 0.05

print(f"Hypothesized Population Mean (H₀): μ₀ = {hypothesized_population_mean}")
print(f"True Population Mean (for simulation): μ_true = {true_population_mean}")
print(f"True Population Standard Deviation: σ = {true_population_std_dev}")
print(f"Significance Level: α = {alpha}")
print("-" * 30)

# --- Scenario A: Large Sample (Z-test is appropriate) ---
print("--- Scenario A: Large Sample Z-test ---")
large_sample_size = 100

# Simulate a large sample from the population with the true mean and std dev
np.random.seed(42) # for reproducibility
large_sample_data = np.random.normal(loc=true_population_mean, scale=true_population_std_dev, size=large_sample_size)

# Calculate sample statistics for the large sample
large_sample_mean = np.mean(large_sample_data)
large_sample_std = np.std(large_sample_data, ddof=1) # Use ddof=1 for sample std dev

print(f"Large Sample Size: {large_sample_size}")
print(f"Large Sample Mean: {large_sample_mean:.4f}")
print(f"Large Sample Std Dev: {large_sample_std:.4f}")

# Perform Z-test using the large sample mean, hypothesized population mean,
# and the KNOWN true population standard deviation.
# Z = (sample_mean - μ₀) / (σ / √n)
sem_large = true_population_std_dev / np.sqrt(large_sample_size)
z_statistic_large = (large_sample_mean - hypothesized_population_mean) / sem_large

# Calculate p-value for the two-tailed Z-test
p_value_z_large = 2 * (1 - norm.cdf(abs(z_statistic_large)))

print(f"\nZ-Test Results (Large Sample, Known σ):")
print(f"Calculated Z-statistic: {z_statistic_large:.4f}")
print(f"P-value: {p_value_z_large:.4f}")

# Decision for Z-test (Large Sample)
print("Decision:")
if p_value_z_large < alpha:
    print("Reject H₀: The population mean is significantly different from", hypothesized_population_mean)
else:
    print("Fail to Reject H₀: There is not enough evidence to say the population mean is different from", hypothesized_population_mean)
print("-" * 30)


# --- Scenario B: Small Sample (T-test is appropriate) ---
print("--- Scenario B: Small Sample T-test ---")
small_sample_size = 20

# Simulate a small sample from the population with the true mean and std dev
np.random.seed(43) # use a different seed
small_sample_data = np.random.normal(loc=true_population_mean, scale=true_population_std_dev, size=small_sample_size)

# Calculate sample statistics for the small sample
small_sample_mean = np.mean(small_sample_data)
small_sample_std = np.std(small_sample_data, ddof=1) # Use ddof=1 for sample std dev

print(f"Small Sample Size: {small_sample_size}")
print(f"Small Sample Mean: {small_sample_mean:.4f}")
print(f"Small Sample Std Dev: {small_sample_std:.4f}")
# Note: In a real T-test scenario, we don't know the population std dev (σ).
# We use the sample std dev (small_sample_std) as an estimate.

# Perform one-sample T-test using the small sample data and hypothesized population mean.
# This test uses the sample standard deviation and the t-distribution.
# T = (sample_mean - μ₀) / (sample_std / √n)
t_statistic_small, p_value_t_small = stats.ttest_1samp(small_sample_data, hypothesized_population_mean)

print(f"\nT-Test Results (Small Sample, Unknown σ - uses sample std dev):")
print(f"Calculated T-statistic: {t_statistic_small:.4f}")
print(f"P-value: {p_value_t_small:.4f}")
print(f"Degrees of Freedom: {small_sample_size - 1}")

# Decision for T-test (Small Sample)
print("Decision:")
if p_value_t_small < alpha:
    print("Reject H₀: The population mean is significantly different from", hypothesized_population_mean)
else:
    print("Fail to Reject H₀: There is not enough evidence to say the population mean is different from", hypothesized_population_mean)
print("-" * 30)


# --- Comparison of Results ---
print("\n--- Comparison ---")
print(f"Hypothesized Population Mean (H₀): {hypothesized_population_mean}")
print(f"True Population Mean (for simulation): {true_population_mean}")
print(f"Significance Level (α): {alpha}")
print("\nZ-Test (Large Sample, Known σ):")
print(f"  Sample Mean: {large_sample_mean:.4f}")
print(f"  Z-statistic: {z_statistic_large:.4f}")
print(f"  P-value: {p_value_z_large:.4f}")
print(f"  Decision: {'Reject H₀' if p_value_z_large < alpha else 'Fail to Reject H₀'}")

print("\nT-Test (Small Sample, Unknown σ - uses sample std dev):")
print(f"  Sample Mean: {small_sample_mean:.4f}")
print(f"  T-statistic: {t_statistic_small:.4f}")
print(f"  P-value: {p_value_t_small:.4f}")
print(f"  Decision: {'Reject H₀' if p_value_t_small < alpha else 'Fail to Reject H₀'}")


# --- Explanation of Comparison ---
print("\n--- Key Differences and Why ---")
print("1.  Assumptions:")
print("    - Z-test assumes population standard deviation (σ) is KNOWN.")
print("    - T-test assumes population standard deviation (σ) is UNKNOWN and uses the SAMPLE standard deviation as an estimate.")
print("2.  Distribution:")
print("    - Z-test uses the Standard Normal Distribution.")
print("    - T-test uses the T-distribution, which has heavier tails, especially for smaller sample sizes.")
print("3.  Sample Size:")
print("    - Z-test is appropriate for large sample sizes (n > 30 is a common rule of thumb, though depends on how well the sample std dev approximates population std dev).")
print("    - T-test is necessary for small sample sizes when σ is unknown.")
print("    - As sample size increases, the T-distribution approaches the Normal Distribution, and the T-test results converge towards the Z-test results.")

print("\nIn this simulation:")
print(f"- Our true population mean ({true_population_mean}) is indeed different from the hypothesized mean ({hypothesized_population_mean}).")
# Check if H0 was correctly rejected
if p_value_z_large < alpha:
    print(f"- The Z-test with a large sample ({large_sample_size}) correctly rejected H₀ (P-value={p_value_z_large:.4f}). With a large sample, the sample mean is a good estimate of the true mean, and the test had high power.")
else:
     print(f"- The Z-test with a large sample ({large_sample_size}) failed to reject H₀ (P-value={p_value_z_large:.4f}). This might happen due to sampling variability, even if H₀ is false, especially if the true mean is only slightly different.")

if p_value_t_small < alpha:
    print(f"- The T-test with a small sample ({small_sample_size}) also rejected H₀ (P-value={p_value_t_small:.4f}). Even with a smaller sample, the observed difference was significant enough.")
else:
    print(f"- The T-test with a small sample ({small_sample_size}) failed to reject H₀ (P-value={p_value_t_small:.4f}). This is more likely with a small sample because the test has less power to detect a difference compared to a large sample Z-test, due to higher variability in the sample mean and std dev.")

print("\nNotice that the T-test P-value might be higher (less significant) than the Z-test P-value even with similar effect sizes, because the T-distribution accounts for the additional uncertainty from estimating the standard deviation from a small sample.")

# Optional: Visualize the distributions of the test statistics
# This requires plotting the standard normal and the appropriate t-distribution
# along with the calculated statistics and critical values.

plt.figure(figsize=(12, 6))

# Plot Standard Normal Distribution (for Z-test)
x_norm = np.linspace(-4, 4, 200)
plt.plot(x_norm, norm.pdf(x_norm, 0, 1), label='Standard Normal Distribution (Z-test)', color='blue')
z_critical_lower = norm.ppf(alpha/2)
z_critical_upper = norm.ppf(1-alpha/2)
plt.axvline(z_critical_lower, color='blue', linestyle='dotted', linewidth=1.5)
plt.axvline(z_critical_upper, color='blue', linestyle='dotted', linewidth=1.5, label=f'Z Critical (α={alpha})')
plt.axvline(z_statistic_large, color='darkblue', linestyle='--', linewidth=2, label=f'Z-statistic ({z_statistic_large:.2f})')


# Plot T-Distribution (for T-test)
df = small_sample_size - 1
x_t = np.linspace(-4, 4, 200)
plt.plot(x_t, stats.t.pdf(x_t, df), label=f'T-distribution (df={df})', color='red')
t_critical_lower = stats.t.ppf(alpha/2, df)
t_critical_upper = stats.t.ppf(1-alpha/2, df)
plt.axvline(t_critical_lower, color='red', linestyle='dotted', linewidth=1.5)
plt.axvline(t_critical_upper, color='red', linestyle='dotted', linewidth=1.5, label=f'T Critical (α={alpha}, df={df})')
plt.axvline(t_statistic_small, color='darkred', linestyle='--', linewidth=2, label=f'T-statistic ({t_statistic_small:.2f})')


plt.xlabel('Test Statistic Value')
plt.ylabel('Density')
plt.title('Comparison of Z and T Distributions and Test Statistics')
plt.legend()
plt.grid(True)
plt.ylim(0, plt.ylim()[1] * 1.1) # Add some padding at the top
plt.show()

print("\nObservation from the plot:")
print("The T-distribution is flatter and wider than the Standard Normal Distribution, especially for low degrees of freedom (small sample size).")
print("This means that a T-statistic needs to be further from zero than a Z-statistic to reach the same level of statistical significance (same p-value) when the population standard deviation is unknown and estimated from the sample.")
print("The critical values for the T-test ({t_critical_lower:.4f}, {t_critical_upper:.4f}) are further from zero than the critical values for the Z-test ({z_critical_lower:.4f}, {z_critical_upper:.4f}).")
print("This reflects the increased uncertainty in the T-test due to estimating the population standard deviation.")

In [None]:
#9 Write a Python function to calculate the confidence interval for a sample mean and explain its significance

import numpy as np
def calculate_confidence_interval(data, confidence_level=0.95):
  """
  Calculates the confidence interval for the mean of a sample.

  Uses the t-distribution for smaller sample sizes or unknown population standard deviation,
  and the z-distribution for large sample sizes (though t-distribution is generally safe).

  Args:
    data: A list or NumPy array of numerical data (the sample).
    confidence_level: The desired confidence level (e.g., 0.95 for 95%).

  Returns:
    A tuple (lower_bound, upper_bound) representing the confidence interval,
    or None if the data is empty or has insufficient size.
  """
  if not data or len(data) < 2:
    print("Error: Data must contain at least two points to calculate standard deviation.")
    return None

  sample_mean = np.mean(data)
  sample_std = np.std(data, ddof=1)  # Use ddof=1 for sample standard deviation
  sample_size = len(data)
  alpha = 1 - confidence_level

  # Determine the appropriate distribution (T-distribution is generally safer
  # when population std dev is unknown, regardless of sample size,
  # but Z-distribution can be used for very large samples if preferred/justified).
  # We'll use the t-distribution here as it's more universally applicable when sigma is unknown.
  degrees_of_freedom = sample_size - 1

  # Calculate the critical value from the t-distribution (for a two-tailed interval)
  # ppf is the inverse of the CDF (percent-point function or quantile function)
  t_critical = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

  # Calculate the standard error of the mean (SEM)
  standard_error = sample_std / np.sqrt(sample_size)

  # Calculate the margin of error
  margin_of_error = t_critical * standard_error

  # Calculate the confidence interval bounds
  confidence_interval_lower = sample_mean - margin_of_error
  confidence_interval_upper = sample_mean + margin_of_error

  return confidence_interval_lower, confidence_interval_upper, sample_mean, margin_of_error, confidence_level

# Significance of Confidence Interval:
# A confidence interval provides a range of plausible values for the true population parameter
# (in this case, the population mean) based on a sample.
#
# Interpretation:
# We are X% confident that the true population mean lies within the calculated interval.
#
# What it DOESN'T mean:
# - It does NOT mean that there is an X% probability that the true population mean
#   falls within this specific interval calculated from this one sample. The true mean is
#   either in the interval or it isn't.
# - It does NOT mean X% of the sample data falls within this interval.
#
# Significance and Use Cases:
# 1. Estimation: Provides a range for the unknown population mean, not just a single point estimate (sample mean).
# 2. Uncertainty Quantification: The width of the interval indicates the precision of the estimate. A wider interval suggests more uncertainty.
# 3. Hypothesis Testing (implicit): If a hypothesized population mean falls outside the confidence interval, you would reject a two-tailed null hypothesis at the corresponding significance level (α = 1 - confidence_level). If it falls inside, you fail to reject.
# 4. Comparison: Allows for comparing means of different groups or samples by looking at overlapping intervals.
# 5. Decision Making: Helps in making informed decisions by providing a range of plausible values for a key parameter.

# Example Usage:
# Assume 'sample_data' is the dataset you want to create a confidence interval for.
# This data is defined in the preceding code (from Task #16).
# sample_data = np.array([52, 55, 58, 60, 63, 65, 67, 70, 72, 75])

# If you need a new sample:
sample_data_new = np.random.normal(loc=70, scale=10, size=25) # Simulate a new sample


confidence_level_example = 0.95 # 95% confidence interval

ci_result = calculate_confidence_interval(sample_data_new, confidence_level=confidence_level_example)

if ci_result:
  lower_bound, upper_bound, sample_mean, margin_of_error, conf_level = ci_result
  print(f"Sample Data (first 10): {sample_data_new[:10]}...")
  print(f"Sample Size: {len(sample_data_new)}")
  print(f"Sample Mean: {sample_mean:.4f}")
  print(f"Confidence Level: {conf_level}")
  print(f"Margin of Error: {margin_of_error:.4f}")
  print(f"\n{conf_level*100:.0f}% Confidence Interval for the Mean: ({lower_bound:.4f}, {upper_bound:.4f})")

  print("\nSignificance and Interpretation:")
  print(f"This {conf_level*100:.0f}% confidence interval [{lower_bound:.4f}, {upper_bound:.4f}] is an estimate of the range where the true population mean is likely to be.")
  print(f"We are {conf_level*100:.0f}% confident that the true mean of the population from which this sample was drawn lies within this specific interval.")
  print(f"The margin of error ({margin_of_error:.4f}) indicates the precision of our estimate; it's the maximum expected difference between our sample mean ({sample_mean:.4f}) and the true population mean with {conf_level*100:.0f}% confidence.")


In [None]:
#10 Write a Python program to calculate the margin of error for a given confidence level using sample data

import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import binom
from statsmodels.stats.weightstats import ztest
from scipy import stats
from scipy.stats import poisson

# Task: Write a Python program to calculate the margin of error for a given confidence level using sample data

def calculate_margin_of_error(data, confidence_level=0.95):
  """
  Calculates the margin of error for a sample mean given a confidence level.

  Uses the t-distribution, which is appropriate when the population
  standard deviation is unknown (the usual case with sample data).

  Args:
    data: A list or NumPy array of numerical data (the sample).
    confidence_level: The desired confidence level (e.g., 0.95 for 95%).

  Returns:
    The margin of error, or None if the data is empty or has insufficient size.
  """
  if not data or len(data) < 2:
    print("Error: Data must contain at least two points to calculate standard deviation.")
    return None

  sample_mean = np.mean(data)
  sample_std = np.std(data, ddof=1)  # Use ddof=1 for sample standard deviation (unbiased estimator)
  sample_size = len(data)
  alpha = 1 - confidence_level

  # Determine the degrees of freedom for the t-distribution
  degrees_of_freedom = sample_size - 1

  # Find the critical t-value for a two-tailed interval
  # stats.t.ppf(q, df) gives the quantile function (inverse of CDF)
  t_critical = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

  # Calculate the standard error of the mean (SEM)
  standard_error = sample_std / np.sqrt(sample_size)

  # Calculate the margin of error
  margin_of_error = t_critical * standard_error

  return margin_of_error, sample_mean, standard_error, confidence_level

# Example Usage:
# Assume 'sample_data' is a dataset you have.
# Using the sample data from the preceding code (Task #16 or #25 or #3):
sample_data_example = np.array([52, 55, 58, 60, 63, 65, 67, 70, 72, 75]) # Example data from preceding code

# Or simulate some new sample data
# sample_data_example = np.random.normal(loc=70, scale=10, size=30) # Simulate a sample

desired_confidence_level = 0.95 # 95% confidence

moe_result = calculate_margin_of_error(sample_data_example, confidence_level=desired_confidence_level)

if moe_result:
  margin_of_error, sample_mean, standard_error, conf_level = moe_result
  print(f"Sample Data (first 10): {sample_data_example[:10]}...")
  print(f"Sample Size: {len(sample_data_example)}")
  print(f"Sample Mean: {sample_mean:.4f}")
  print(f"Confidence Level: {conf_level}")
  print(f"Standard Error of the Mean (SEM): {standard_error:.4f}")
  print(f"Calculated Margin of Error: {margin_of_error:.4f}")

  # Calculate the corresponding confidence interval (optional, but often useful with MOE)
  confidence_interval_lower = sample_mean - margin_of_error
  confidence_interval_upper = sample_mean + margin_of_error
  print(f"\n{conf_level*100:.0f}% Confidence Interval: ({confidence_interval_lower:.4f}, {confidence_interval_upper:.4f})")

  print("\nInterpretation of Margin of Error:")
  print(f"The margin of error ({margin_of_error:.4f}) tells us the maximum likely difference between our sample mean ({sample_mean:.4f}) and the true population mean for the given confidence level ({conf_level*100:.0f}%).")
  print(f"With {conf_level*100:.0f}% confidence, we estimate that the true population mean is within {margin_of_error:.4f} units of our sample mean.")
  print("A smaller margin of error indicates a more precise estimate of the population mean.")
  print("Factors that influence the margin of error:")
  print("- Sample Size: Larger sample size decreases the standard error, thus decreasing the margin of error.")
  print("- Standard Deviation: Higher variability (larger standard deviation) in the data increases the standard error, thus increasing the margin of error.")
  print("- Confidence Level: A higher confidence level requires a larger critical value, thus increasing the margin of error.")

In [None]:
#11 Implement a Bayesian inference method using Bayes' Theorem in Python and explain the process

import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import binom
from statsmodels.stats.weightstats import ztest
from scipy import stats
from scipy.stats import poisson
from scipy.special import factorial # Used in Bayes' Theorem example

def bayes_theorem_example(prior_A, prior_B, likelihood_obs_given_A, likelihood_obs_given_B):
    """
    Implements Bayes' Theorem to calculate the posterior probability of events.

    Bayes' Theorem: P(A|Obs) = [P(Obs|A) * P(A)] / P(Obs)
    Where P(Obs) = P(Obs|A) * P(A) + P(Obs|B) * P(B) (Law of Total Probability)

    This function demonstrates a simple scenario with two mutually exclusive
    and exhaustive events (A and B) and a single observation.

    Args:
        prior_A: The prior probability of event A occurring, P(A).
        prior_B: The prior probability of event B occurring, P(B).
                 Assumes A and B are mutually exclusive and exhaustive, so prior_A + prior_B should be 1.
        likelihood_obs_given_A: The likelihood of the observation given event A, P(Obs|A).
        likelihood_obs_given_B: The likelihood of the observation given event B, P(Obs|B).

    Returns:
        The posterior probability of event A given the observation, P(A|Obs).
        Returns None if prior probabilities do not sum to 1.
    """
    if not np.isclose(prior_A + prior_B, 1.0):
        print("Error: Prior probabilities must sum to 1 for mutually exclusive and exhaustive events.")
        return None

    # Calculate the probability of the observation (Evidence) using the Law of Total Probability
    # P(Obs) = P(Obs|A) * P(A) + P(Obs|B) * P(B)
    probability_observation = (likelihood_obs_given_A * prior_A) + (likelihood_obs_given_B * prior_B)

    # Handle case where the observation is impossible under either event (probability_observation is 0)
    if probability_observation == 0:
        print("Error: Probability of the observation is zero. Cannot calculate posterior.")
        return 0 # Or handle as appropriate for the specific problem context

    # Calculate the posterior probability of A given the observation using Bayes' Theorem
    # P(A|Obs) = [P(Obs|A) * P(A)] / P(Obs)
    posterior_A_given_obs = (likelihood_obs_given_A * prior_A) / probability_observation

    return posterior_A_given_obs

# --- Example Usage ---

# Scenario: Medical Testing
# Assume a disease (Event D) affects 1% of the population.
# We have a test (Observation +) that is 95% accurate (P(+|D) = 0.95)
# and has a 10% false positive rate (P(+|Not D) = 0.10).
# We want to find the probability of having the disease given a positive test result (P(D|+)).

# Let:
# Event A = Having the Disease (D)
# Event B = Not Having the Disease (Not D)
# Observation = Positive Test Result (+)

prior_D = 0.01      # P(D) - Prior probability of having the disease
prior_Not_D = 1 - prior_D # P(Not D) - Prior probability of not having the disease

likelihood_pos_given_D = 0.95 # P(+|D) - Likelihood of a positive test given the disease (Sensitivity)
likelihood_pos_given_Not_D = 0.10 # P(+|Not D) - Likelihood of a positive test given no disease (False Positive Rate)

# Now, use the function to calculate P(D|+)
# In our function mapping: A -> D, B -> Not D, Obs -> +
posterior_D_given_pos = bayes_theorem_example(
    prior_A=prior_D,
    prior_B=prior_Not_D,
    likelihood_obs_given_A=likelihood_pos_given_D,
    likelihood_obs_given_B=likelihood_pos_given_Not_D
)

print("--- Bayesian Inference Example (Medical Test) ---")
print(f"Prior Probability of Disease (P(D)): {prior_D:.4f}")
print(f"Prior Probability of No Disease (P(Not D)): {prior_Not_D:.4f}")
print(f"Likelihood of Positive Test given Disease (P(+|D)): {likelihood_pos_given_D:.4f}")
print(f"Likelihood of Positive Test given No Disease (P(+|Not D)): {likelihood_pos_given_Not_D:.4f}")

if posterior_D_given_pos is not None:
    print(f"\nPosterior Probability of Disease given Positive Test (P(D|+)): {posterior_D_given_pos:.4f}")

    # Explanation of the process:
    print("\nExplanation of the Bayesian Inference Process:")
    print("1.  Start with Prior Probabilities: We have an initial belief about the probability of the events (having or not having the disease) before any new evidence is considered. P(D) and P(Not D) are our priors.")
    print(f"    - Our initial belief is that the probability of having the disease is {prior_D*100:.2f}%.")

    print("2.  Consider the Likelihood of the Evidence: We observe new evidence (a positive test result). We need to know how likely this evidence is under each of the possible events (having or not having the disease). These are the likelihoods, P(+|D) and P(+|Not D).")
    print(f"    - The test is positive. The likelihood of this positive test is {likelihood_pos_given_D*100:.2f}% if the person has the disease, and {likelihood_pos_given_Not_D*100:.2f}% if they don't.")

    print("3.  Calculate the Probability of the Evidence (Marginal Likelihood): We need to know the overall probability of observing the evidence (a positive test), considering all possible events. This is P(+), calculated using the Law of Total Probability:")
    print("    P(+) = P(+|D) * P(D) + P(+|Not D) * P(Not D)")
    # Calculate P(Obs) separately to show this step
    probability_observation_exp = (likelihood_pos_given_D * prior_D) + (likelihood_pos_given_Not_D * prior_Not_D)
    print(f"    P(+) = ({likelihood_pos_given_D} * {prior_D}) + ({likelihood_pos_given_Not_D} * {prior_Not_D}) = {probability_observation_exp:.4f}")
    print(f"    The overall probability of getting a positive test result in this population is {probability_observation_exp:.4f}.")

    print("4.  Apply Bayes' Theorem: Now we update our initial belief (prior) based on the new evidence (positive test) and the likelihoods, to get the Posterior Probability, P(D|+).")
    print("    P(D|+) = [P(+|D) * P(D)] / P(+)")
    print(f"    P(D|+) = ({likelihood_pos_given_D} * {prior_D}) / {probability_observation_exp:.4f}")
    print(f"    P(D|+) = {posterior_D_given_pos:.4f}")

    print(f"\nResult Interpretation:")
    print(f"After observing a positive test result, the updated probability of having the disease (Posterior probability) is {posterior_D_given_pos:.4f}.")
    print(f"This is significantly higher than the initial prior probability of {prior_D:.4f}, but it's important to note it's not 100%, even with a seemingly accurate test, because of the low prior probability of the disease and the false positive rate.")

# --- Another Simple Example ---
# Scenario: Two Bags of Marbles
# Bag 1 (Event A): 70% Red, 30% Blue
# Bag 2 (Event B): 20% Red, 80% Blue
# We randomly picked a bag (assume 50/50 chance for each bag - equal priors)
# and drew a Red marble (Observation).
# What is the probability that we picked Bag 1 given we drew a Red marble? P(Bag1 | Red)?

prior_Bag1 = 0.5  # P(Bag1)
prior_Bag2 = 0.5  # P(Bag2)

likelihood_Red_given_Bag1 = 0.7 # P(Red | Bag1)
likelihood_Red_given_Bag2 = 0.2 # P(Red | Bag2)

# Calculate P(Bag1 | Red)
# In our function mapping: A -> Bag1, B -> Bag2, Obs -> Red
posterior_Bag1_given_Red = bayes_theorem_example(
    prior_A=prior_Bag1,
    prior_B=prior_Bag2,
    likelihood_obs_given_A=likelihood_Red_given_Bag1,
    likelihood_obs_given_B=likelihood_Red_given_Bag2
)

print("\n" + "=" * 40)
print("--- Bayesian Inference Example (Marble Bags) ---")
print(f"Prior Probability of Bag 1 (P(Bag1)): {prior_Bag1:.4f}")
print(f"Prior Probability of Bag 2 (P(Bag2)): {prior_Bag2:.4f}")
print(f"Likelihood of Red given Bag 1 (P(Red|Bag1)): {likelihood_Red_given_Bag1:.4f}")
print(f"Likelihood of Red given Bag 2 (P(Red|Bag2)): {likelihood_Red_given_Bag2:.4f}")

if posterior_Bag1_given_Red is not None:
     print(f"\nPosterior Probability of Bag 1 given Red Marble (P(Bag1|Red)): {posterior_Bag1_given_Red:.4f}")
     print(f"Posterior Probability of Bag 2 given Red Marble (P(Bag2|Red)): {1 - posterior_Bag1_given_Red:.4f}") # Since Bag1 and Bag2 are exhaustive

     # Explanation
     print("\nResult Interpretation:")
     print(f"Starting with a 50/50 chance of picking either bag, observing a Red marble (which is more likely to come from Bag 1) updates our belief.")
     print(f"The updated probability that we picked Bag 1, given we saw a Red marble, is {posterior_Bag1_given_Red:.4f}.")
     print(f"This is higher than our initial prior probability of {prior_Bag1:.4f}, demonstrating how the evidence (the Red marble) shifted our belief towards Bag 1.")
```

In [None]:
# 12 Perform a Chi-square test for independence between two categorical variables in Python

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Task: Perform a Chi-square test for independence between two categorical variables in Python

# Assume we have data on two categorical variables, for example:
# Variable 1: 'Treatment Group' (Categorical: 'Control', 'Treatment')
# Variable 2: 'Outcome' (Categorical: 'Improvement', 'No Improvement')

# The Chi-square test for independence tests the null hypothesis (H₀) that
# there is no association between the two categorical variables in the population
# versus the alternative hypothesis (Hₐ) that there is an association.

# Let's represent the data in a contingency table (or frequency table).
# Each cell in the table shows the count of observations that fall into
# a specific combination of categories for the two variables.

# Example Contingency Table (Counts):
#                | Improvement | No Improvement | Row Total
# ---------------|-------------|----------------|-----------
# Control Group  |     40      |       60       |   100
# Treatment Group|     70      |       30       |   100
# ---------------|-------------|----------------|-----------
# Column Total   |    110      |       90       |   200 (Grand Total)

# We can create this table using a NumPy array or a Pandas DataFrame.
# Using a NumPy array is straightforward for the input to chi2_contingency.
contingency_table = np.array([
    [40, 60],  # Row for Control Group
    [70, 30]   # Row for Treatment Group
])

print("Contingency Table:")
# Optional: Display with labels for clarity
contingency_df = pd.DataFrame(contingency_table,
                              index=['Control Group', 'Treatment Group'],
                              columns=['Improvement', 'No Improvement'])
print(contingency_df)
print("\n" + "="*30 + "\n")

# Perform the Chi-square test for independence
# scipy.stats.chi2_contingency takes the contingency table as input.
# It returns:
# 1. chi2: The Chi-square test statistic.
# 2. p: The p-value of the test.
# 3. dof: The degrees of freedom of the test.
# 4. expected: The expected frequencies, based on the assumption of independence.
chi2_statistic, p_value, degrees_of_freedom, expected_frequencies = chi2_contingency(contingency_table)

# Define the significance level (alpha)
alpha = 0.05

print("Chi-square Test for Independence Results:")
print(f"Chi-square Statistic: {chi2_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print("\nExpected Frequencies (assuming independence):")
print(pd.DataFrame(expected_frequencies, index=contingency_df.index, columns=contingency_df.columns))

# Make a decision based on the p-value
print("\n" + "="*30 + "\n")
print("Hypothesis Test Decision:")
print(f"Significance Level (alpha): {alpha}")

if p_value < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that there is a significant association (dependence) between Treatment Group and Outcome.")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that there is a significant association (dependence) between Treatment Group and Outcome. The variables appear to be independent.")

# Interpretation in context
print("\nInterpretation:")
print("The Chi-square statistic measures the difference between the observed frequencies in the contingency table and the expected frequencies (what we would expect if the variables were independent).")
print("A larger Chi-square statistic indicates a larger difference between observed and expected counts.")
print(f"The P-value ({p_value:.4f}) is the probability of observing a Chi-square statistic as extreme as, or more extreme than, {chi2_statistic:.4f}, assuming the null hypothesis of independence is true.")

if p_value < alpha:
  print(f"Since the P-value ({p_value:.4f}) is less than the significance level ({alpha}), we have strong evidence against the null hypothesis.")
  print("This suggests that the distribution of 'Outcome' is significantly different across the 'Treatment Groups', indicating that the treatment likely had an effect on the outcome.")
else:
  print(f"Since the P-value ({p_value:.4f}) is greater than the significance level ({alpha}), we do not have enough evidence to reject the null hypothesis.")
  print("This suggests that any observed differences in 'Outcome' between the 'Treatment Groups' could reasonably be due to random chance, assuming the treatment had no real effect on the outcome.")

# Optional: Visualize the observed vs. expected frequencies or bar plots of proportions
# This helps to see where the differences lie.

# Calculate observed and expected proportions
observed_proportions = contingency_df.apply(lambda x: x / x.sum(), axis=1) # Row proportions
expected_table_df = pd.DataFrame(expected_frequencies, index=contingency_df.index, columns=contingency_df.columns)
expected_proportions = expected_table_df.apply(lambda x: x / x.sum(), axis=1) # Row proportions


print("\nObserved Row Proportions:")
print(observed_proportions)
print("\nExpected Row Proportions (if independent):")
print(expected_proportions)


# Stacked bar plot comparing proportions
observed_proportions.T.plot(kind='bar', stacked=True, figsize=(10, 7), color=['skyblue', 'salmon'])
plt.title('Observed Proportions of Outcome by Treatment Group')
plt.xlabel('Treatment Group')
plt.ylabel('Proportion')
plt.xticks(rotation=0)
plt.legend(title='Outcome', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# You could also compare the observed and expected counts directly using bar plots or heatmaps of the difference.

In [None]:
# 13 Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data

import pandas as pd
import numpy as np
# Task: Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data

# The expected frequency for a cell in a contingency table, assuming independence
# between the row and column variables, is calculated as:
# Expected Frequency = (Row Total * Column Total) / Grand Total

# This calculation is implicitly done by the `scipy.stats.chi2_contingency` function
# and returned as the fourth element in its output tuple.

# We will reuse the contingency table defined in the preceding code snippet (Task #12).
# This table represents observed frequencies.

# Example Contingency Table (Observed Frequencies):
#                | Improvement | No Improvement | Row Total
# ---------------|-------------|----------------|-----------
# Control Group  |     40      |       60       |   100
# Treatment Group|     70      |       30       |   100
# ---------------|-------------|----------------|-----------
# Column Total   |    110      |       90       |   200 (Grand Total)

contingency_table_observed = np.array([
    [40, 60],  # Row for Control Group
    [70, 30]   # Row for Treatment Group
])

print("Observed Frequencies (Contingency Table):")
contingency_df_observed = pd.DataFrame(contingency_table_observed,
                                        index=['Control Group', 'Treatment Group'],
                                        columns=['Improvement', 'No Improvement'])
print(contingency_df_observed)
print("\n" + "="*40 + "\n")

# Calculate Expected Frequencies Manually (to demonstrate the formula)

# Get Row Totals, Column Totals, and Grand Total
row_totals = np.sum(contingency_table_observed, axis=1)
column_totals = np.sum(contingency_table_observed, axis=0)
grand_total = np.sum(contingency_table_observed)

# Create an empty array for expected frequencies
expected_frequencies_manual = np.zeros_like(contingency_table_observed, dtype=float)

# Calculate expected frequency for each cell
num_rows, num_cols = contingency_table_observed.shape

for i in range(num_rows):
  for j in range(num_cols):
    expected_frequencies_manual[i, j] = (row_totals[i] * column_totals[j]) / grand_total

print("Calculated Expected Frequencies (Manual):")
expected_df_manual = pd.DataFrame(expected_frequencies_manual,
                                   index=contingency_df_observed.index,
                                   columns=contingency_df_observed.columns)
print(expected_df_manual.round(2)) # Round for display

print("\n" + "="*40 + "\n")

# Verify with scipy.stats.chi2_contingency

# Perform the Chi-square test again just to extract the expected frequencies calculated by scipy
# We don't need the other outputs for this specific task, but they are returned by the function.
chi2_statistic, p_value, degrees_of_freedom, expected_frequencies_scipy = chi2_contingency(contingency_table_observed)

print("Expected Frequencies (Calculated by scipy.stats.chi2_contingency):")
expected_df_scipy = pd.DataFrame(expected_frequencies_scipy,
                                   index=contingency_df_observed.index,
                                   columns=contingency_df_observed.columns)
print(expected_df_scipy.round(2)) # Round for display

print("\nComparison:")
# Check if manual calculation matches scipy's output (allowing for floating point differences)
if np.allclose(expected_frequencies_manual, expected_frequencies_scipy):
    print("Manual calculation of expected frequencies matches scipy's output.")
else:
    print("Manual calculation of expected frequencies differs from scipy's output (check logic).")

# The expected frequencies tell us what the cell counts would be if the two
# categorical variables were perfectly independent in the population, given the
# observed row and column totals. The Chi-square test compares the observed
# frequencies to these expected frequencies to determine if the deviations are
# larger than what would be expected by random chance alone.



In [None]:
# 14 Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution

import matplotlib.pyplot as plt
import numpy as np
# Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution

# A goodness-of-fit test (like the Chi-square goodness-of-fit test) is used to determine
# if a sample distribution matches a hypothesized theoretical distribution.

# Task: Perform a goodness-of-fit test (Chi-square) using Python to compare observed
# data (counts in categories) to expected frequencies based on a hypothesized distribution.

# Scenario: Suppose we roll a standard six-sided die 600 times.
# Hypothesized Distribution (Null Hypothesis H0): The die is fair.
#   This means each face (1, 2, 3, 4, 5, 6) is equally likely to appear.
#   The expected proportion for each face is 1/6.
# Alternative Hypothesis (Ha): The die is not fair.

# Observed Data: Let's say we recorded the following counts for each face:
observed_counts = np.array([95, 105, 100, 98, 102, 100]) # Counts for faces 1 to 6
categories = np.arange(1, 7) # The faces of the die

total_observations = np.sum(observed_counts)
print(f"Observed Counts for each face (1-6): {observed_counts}")
print(f"Total Observations: {total_observations}")

# Hypothesized Expected Distribution (under H0: Fair die)
# The expected frequency for each category is (Total Observations * Hypothesized Proportion for Category)
# For a fair die, the proportion for each face is 1/6.
hypothesized_proportions = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])

# Calculate the expected frequencies
expected_frequencies = total_observations * hypothesized_proportions
print(f"\nHypothesized Proportions (Fair Die): {hypothesized_proportions}")
print(f"Expected Frequencies (under H0): {expected_frequencies}")

# Perform the Chi-square goodness-of-fit test
# scipy.stats.chisquare performs the test.
# It takes the observed counts and the expected counts as input.
# It returns the Chi-square test statistic and the p-value.
chi2_statistic_gof, p_value_gof = stats.chisquare(f_obs=observed_counts, f_exp=expected_frequencies)

# Define the significance level (alpha)
alpha = 0.05

print(f"\nChi-square Goodness-of-Fit Test Results:")
print(f"Chi-square Statistic: {chi2_statistic_gof:.4f}")
print(f"P-value: {p_value_gof:.4f}")
# For the Chi-square goodness-of-fit test, the degrees of freedom are k - 1,
# where k is the number of categories. In this case, k=6, so df = 6 - 1 = 5.
degrees_of_freedom_gof = len(categories) - 1
print(f"Degrees of Freedom: {degrees_of_freedom_gof}")

# Make a decision based on the p-value
print("\n" + "="*30 + "\n")
print("Hypothesis Test Decision:")
print(f"Significance Level (alpha): {alpha}")

if p_value_gof < alpha:
  print("\nDecision: Reject the null hypothesis (H₀).")
  print(f"Conclusion: There is sufficient statistical evidence at the {alpha} significance level to conclude that the observed distribution of die rolls is significantly different from the expected distribution of a fair die.")
  print("This suggests the die might not be fair.")
else:
  print("\nDecision: Fail to reject the null hypothesis (H₀).")
  print(f"Conclusion: There is not enough statistical evidence at the {alpha} significance level to conclude that the observed distribution of die rolls is significantly different from the expected distribution of a fair die.")
  print("The observed differences from the expected counts could reasonably be due to random chance, assuming the die is fair.")

# Interpretation in context
print("\nInterpretation:")
print("The Chi-square goodness-of-fit test compares the observed frequencies in each category to the frequencies expected under the null hypothesis.")
print("The Chi-square statistic quantifies the overall discrepancy between the observed and expected counts.")
print(f"The P-value ({p_value_gof:.4f}) is the probability of observing data that deviates from the expected distribution as much as, or more than, our observed data, assuming the hypothesized distribution (fair die) is true.")

if p_value_gof < alpha:
  print(f"Since the P-value ({p_value_gof:.4f}) is less than {alpha}, the observed counts are statistically unlikely if the die were truly fair. We conclude the die is biased.")
else:
  print(f"Since the P-value ({p_value_gof:.4f}) is greater than {alpha}, the observed counts are reasonably likely to occur even if the die were fair. We do not have enough evidence to conclude the die is biased.")


# Optional: Visualize the observed vs. expected frequencies
plt.figure(figsize=(10, 6))

bar_width = 0.35
index = np.arange(len(categories))

plt.bar(index, observed_counts, bar_width, label='Observed Counts', color='skyblue')
plt.bar(index + bar_width, expected_frequencies, bar_width, label='Expected Frequencies (H0)', color='lightcoral', alpha=0.7)

plt.xlabel('Die Face')
plt.ylabel('Frequency (Counts)')
plt.title('Observed vs. Expected Frequencies of Die Rolls (Goodness-of-Fit)')
plt.xticks(index + bar_width / 2, categories)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()