#Statistics Part 2

#Theoretical question

1. What is hypothesis testing in statistics?
  - Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.
2. What is the null hypothesis, and how does it differ from the alternative hypothesis?
  - The null hypothesis (H0) proposes that there is no significant relationship or difference between variables, while the alternative hypothesis (Ha) suggests that there is a significant relationship or difference.
3. What is the significance level in hypothesis testing, and why is it important?
  - In hypothesis testing, the significance level, denoted by α (alpha), is the probability of rejecting the null hypothesis when it is actually true.
4. What does a P-value represent in hypothesis testing?
  - In hypothesis testing, a p-value represents the probability of obtaining results as extreme as, or more extreme than, the observed results, assuming the null hypothesis is true.
5. How do you interpret the P-value in hypothesis testing?
  - The lower the p-value, the greater the statistical significance of the observed difference.
6. What are Type 1 and Type 2 errors in hypothesis testing?
  - In hypothesis testing, Type 1 and Type 2 errors are two types of incorrect conclusions that can be drawn when testing a null hypothesis.
7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing?
  - a one-tailed test checks for a difference in one specific direction (e.g., greater than or less than), while a two-tailed test checks for a difference in either direction (e.g., greater than or less than).
8. What is the Z-test, and when is it used in hypothesis testing?
  - A z-test is a statistical method used to test hypotheses about population parameters, particularly the population mean, when the population standard deviation is known or when the sample size is large enough (typically greater than 30) to apply the Central Limit Theorem.
9. How do you calculate the Z-score, and what does it represent in hypothesis testing?
 - subtracting the mean from the data point and then dividing by the standard deviation.
10. What is the T-distribution, and when should it be used instead of the normal distribution?
  - It's used when dealing with small sample sizes (typically less than 30) or when the population standard deviation is unknown.
11. What is the difference between a Z-test and a T-test?
   - The main difference between a Z-test and a T-test lies in the assumptions about the population variance and the sample size. Z-tests are used when the population standard deviation is known and the sample size is large (typically greater than 30), while T-tests are used when the population standard deviation is unknown and/or the sample size is small.
12. What is the T-test, and how is it used in hypothesis testing?
   - A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related.
13. What is the relationship between Z-test and T-test in hypothesis testing?
   - In hypothesis testing, both z-tests and t-tests are used to determine if there's a significant difference between groups or if the sample mean significantly differs from a known population mean.
14. What is a confidence interval, and how is it used to interpret statistical results?
   - A confidence interval is a range of values within which the true population parameter is likely to fall, based on a sample.
15. What is the margin of error, and how does it affect the confidence interval?
   - The margin of error is a statistical measure that reflects the precision of an estimate, essentially how much the sample results might differ from the real population results.
16. How is Bayes' Theorem used in statistics, and what is its significance?
   - Bayes' Theorem allows you to assess the likelihood of something happening when something else is taken into account.
17. What is the Chi-square distribution, and when is it used?
   - The chi-square distribution is a continuous probability distribution used in hypothesis testing, particularly for analyzing categorical data.
18. What is the Chi-square goodness of fit test, and how is it applied?
   - The Chi-square goodness of fit test checks whether your sample data is likely to be from a specific theoretical distribution.
19. What is the F-distribution, and when is it used in hypothesis testing?
   - The F-distribution is a continuous probability distribution used in hypothesis testing, particularly when comparing variances between two or more groups
20. What is an ANOVA test, and what are its assumptions?
   - An ANOVA (Analysis of Variance) test is a statistical method used to compare the means of two or more groups.
21. What are the different types of ANOVA tests?
   - There are two main types of ANOVA tests: one-way ANOVA and two-way ANOVA
22. What is the F-test, and how does it relate to hypothesis testing?
   - The F-test is a statistical test used to compare variances between two or more groups.

#Practical Part - 1

In [None]:
# 1. Write a Python program to generate a random variable and display its value.
import random

# Generate a random number between 0 and 1
random_value = random.random()

# Display the value
print("Random value:", random_value)

In [None]:
# 2. Generate a discrete uniform distribution using Python and plot the probability mass function (PMF).
import matplotlib.pyplot as plt

# Values and equal probabilities (e.g., 1 to 6 like a die)
x = [1, 2, 3, 4, 5, 6]
pmf = [1/6] * 6  # Equal probability for each

# Plotting
plt.stem(x, pmf, use_line_collection=True)
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Discrete Uniform Distribution (PMF)')
plt.show()

In [None]:
# 3. Write a Python function to calculate the probability distribution function (PDF) of a Bernoulli distribution.
def bernoulli_pdf(x, p):
    """
    Calculate the PDF of a Bernoulli distribution.

    Parameters:
    x (int): The outcome (0 or 1)
    p (float): The probability of success (0 ≤ p ≤ 1)

    Returns:
    float: Probability of the outcome
    """
    if x == 1:
        return p
    elif x == 0:
        return 1 - p
    else:
        return 0.0

print(bernoulli_pdf(1, 0.7))  # Output: 0.7

In [None]:
# 4. Write a Python script to simulate a binomial distribution with n=10 and p=0.5, then plot its histogram.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
n = 10       # number of trials
p = 0.5      # probability of success
size = 1000  # number of simulations

# Simulate binomial distribution
data = np.random.binomial(n, p, size)

# Plot histogram
plt.hist(data, bins=range(n+2), edgecolor='black', align='left')
plt.xlabel('Number of Successes')
plt.ylabel('Frequency')
plt.title('Binomial Distribution (n=10, p=0.5)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
# 5. Create a Poisson distribution and visualize it using Python.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
lambda_val = 4   # average rate (λ)
size = 1000      # number of samples

# Generate Poisson-distributed data
data = np.random.poisson(lambda_val, size)

# Plot histogram
plt.hist(data, bins=range(0, max(data)+2), edgecolor='black', align='left')
plt.xlabel('Number of Events')
plt.ylabel('Frequency')
plt.title(f'Poisson Distribution (λ={lambda_val})')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
# 6. Write a Python program to calculate and plot the cumulative distribution function (CDF) of a discrete uniform distribution.
import numpy as np
import matplotlib.pyplot as plt

# Discrete uniform values (e.g., 1 to 6 like a die)
x = np.arange(1, 7)
cdf = np.cumsum([1/6] * 6)  # Equal probability for each, then cumulative sum

# Plot CDF
plt.step(x, cdf, where='post')
plt.xlabel('Value')
plt.ylabel('CDF')
plt.title('CDF of Discrete Uniform Distribution')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 7. Generate a continuous uniform distribution using NumPy and visualize it.
import numpy as np
import matplotlib.pyplot as plt

# Generate data from continuous uniform distribution between 0 and 1
data = np.random.uniform(0, 1, 1000)

# Plot histogram
plt.hist(data, bins=30, edgecolor='black', density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Continuous Uniform Distribution [0, 1]')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 8. Simulate data from a normal distribution and plot its histogram.
import numpy as np
import matplotlib.pyplot as plt

# Generate data from normal distribution (mean=0, std=1)
data = np.random.normal(0, 1, 1000)

# Plot histogram
plt.hist(data, bins=30, edgecolor='black', density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Normal Distribution (mean=0, std=1)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 9. Write a Python function to calculate Z-scores from a dataset and plot them.
import numpy as np
import matplotlib.pyplot as plt

def plot_z_scores(data):
    mean = np.mean(data)
    std = np.std(data)
    z_scores = (data - mean) / std

    plt.plot(z_scores, 'o')
    plt.axhline(0, color='red', linestyle='--')
    plt.xlabel('Index')
    plt.ylabel('Z-score')
    plt.title('Z-scores of Data')
    plt.grid(True)
    plt.show()

    return z_scores

# Example usage:
data = np.array([10, 12, 9, 15, 10, 8, 13])
plot_z_scores(data)

In [None]:
# 10. Implement the Central Limit Theorem (CLT) using Python for a non-normal distribution.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
sample_size = 30
num_samples = 1000

# Generate sample means from a uniform distribution (non-normal)
means = [np.mean(np.random.uniform(0, 1, sample_size)) for _ in range(num_samples)]

# Plot histogram of sample means
plt.hist(means, bins=30, edgecolor='black', density=True)
plt.title('CLT: Distribution of Sample Means (Uniform Data)')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 11. Simulate multiple samples from a normal distribution and verify the Central Limit Theorem.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
sample_size = 30
num_samples = 1000
mu, sigma = 5, 2  # mean and std dev of original normal distribution

# Generate sample means
means = [np.mean(np.random.normal(mu, sigma, sample_size)) for _ in range(num_samples)]

# Plot histogram of sample means
plt.hist(means, bins=30, edgecolor='black', density=True)
plt.title('CLT: Distribution of Sample Means (Normal Data)')
plt.xlabel('Sample Mean')
plt.ylabel('Density')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 12. Write a Python function to calculate and plot the standard normal distribution (mean = 0, std = 1).
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def plot_standard_normal():
    x = np.linspace(-4, 4, 1000)  # range of x values
    y = norm.pdf(x, 0, 1)         # standard normal PDF

    plt.plot(x, y, label='Standard Normal PDF')
    plt.xlabel('x')
    plt.ylabel('Probability Density')
    plt.title('Standard Normal Distribution (mean=0, std=1)')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend()
    plt.show()

# Call the function
plot_standard_normal()

In [None]:
# 13. Generate random variables and calculate their corresponding probabilities using the binomial distribution.
import numpy as np
from scipy.stats import binom

# Parameters
n = 10      # number of trials
p = 0.5     # probability of success
size = 20   # number of random variables to generate

# Generate random variables from binomial distribution
random_vars = np.random.binomial(n, p, size)

# Calculate probabilities (PMF) for these values
probabilities = binom.pmf(random_vars, n, p)

# Print results
for rv, prob in zip(random_vars, probabilities):
    print(f'Value: {rv}, Probability: {prob:.4f}')

In [None]:
# 14. Write a Python program to calculate the Z-score for a given data point and compare it to a standard normal distribution.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def calculate_and_plot_zscore(data_point, data):
    mean = np.mean(data)
    std = np.std(data)
    z = (data_point - mean) / std
    print(f"Data point: {data_point}")
    print(f"Mean: {mean:.2f}, Std Dev: {std:.2f}")
    print(f"Z-score: {z:.2f}")

    # Plot standard normal distribution
    x = np.linspace(-4, 4, 1000)
    y = norm.pdf(x, 0, 1)
    plt.plot(x, y, label='Standard Normal PDF')

    # Mark the Z-score on the plot
    plt.axvline(z, color='red', linestyle='--', label=f'Z-score = {z:.2f}')

    plt.xlabel('Z-score')
    plt.ylabel('Probability Density')
    plt.title('Z-score compared to Standard Normal Distribution')
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()

# Example usage
data = [10, 12, 9, 15, 10, 8, 13]
data_point = 14
calculate_and_plot_zscore(data_point, data)

In [None]:
# 15. Implement hypothesis testing using Z-statistics for a sample dataset.
import numpy as np
from scipy.stats import norm

def z_test(sample, pop_mean, pop_std, alpha=0.05):
    """
    Perform a one-sample Z-test.

    Parameters:
    - sample: array-like, sample data
    - pop_mean: float, population mean under the null hypothesis
    - pop_std: float, population standard deviation (known)
    - alpha: significance level (default 0.05)

    Returns:
    - z_stat: calculated Z statistic
    - p_value: two-tailed p-value
    - conclusion: whether to reject null hypothesis
    """
    n = len(sample)
    sample_mean = np.mean(sample)
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # two-tailed test

    print(f"Sample mean: {sample_mean:.4f}")
    print(f"Z statistic: {z_stat:.4f}")
    print(f"P-value: {p_value:.4f}")

    if p_value < alpha:
        conclusion = "Reject the null hypothesis."
    else:
        conclusion = "Fail to reject the null hypothesis."

    print(conclusion)
    return z_stat, p_value, conclusion

# Example usage
sample_data = [102, 99, 101, 98, 100, 97, 103, 100, 99, 101]
population_mean = 100
population_std = 2  # known population std dev
z_test(sample_data, population_mean, population_std)

In [None]:
# 16. Create a confidence interval for a dataset using Python and interpret the result.
import numpy as np
from scipy import stats

def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)  # Standard error of the mean
    h = sem * stats.t.ppf((1 + confidence) / 2, n - 1)  # Margin of error

    lower = mean - h
    upper = mean + h

    print(f"Sample mean = {mean:.3f}")
    print(f"{confidence*100:.1f}% confidence interval: ({lower:.3f}, {upper:.3f})")

    return lower, upper

# Example usage:
data = [12, 15, 14, 10, 13, 15, 16, 14, 12, 11]
confidence_interval(data)

In [None]:
# 17. Generate data from a normal distribution, then calculate and interpret the confidence interval for its mean.
import numpy as np
from scipy import stats

# Generate data from normal distribution
np.random.seed(42)
data = np.random.normal(loc=50, scale=5, size=100)  # mean=50, std=5, sample size=100

# Calculate confidence interval function
def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)  # Standard error of the mean
    h = sem * stats.t.ppf((1 + confidence) / 2, n - 1)  # Margin of error

    lower = mean - h
    upper = mean + h

    print(f"Sample mean = {mean:.3f}")
    print(f"{confidence*100:.1f}% confidence interval: ({lower:.3f}, {upper:.3f})")
    print(f"We are {confidence*100:.1f}% confident that the true population mean lies within this interval.")

    return lower, upper

# Calculate and interpret confidence interval
confidence_interval(data)

In [None]:
# 18. Write a Python script to calculate and visualize the probability density function (PDF) of a normal distribution.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters
mu = 0      # mean
sigma = 1   # standard deviation

# Generate x values
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)

# Calculate PDF values
pdf = norm.pdf(x, mu, sigma)

# Plot the PDF
plt.plot(x, pdf, label=f'Normal PDF\n$\mu={mu}$, $\sigma={sigma}$')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Normal Distribution PDF')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 19. Use Python to calculate and interpret the cumulative distribution function (CDF) of a Poisson distribution.
from scipy.stats import poisson

# Parameters
lambda_val = 3  # average rate (λ)
k = 5           # value at which to evaluate the CDF

# Calculate CDF
cdf_value = poisson.cdf(k, lambda_val)

print(f"CDF at k={k} for Poisson(λ={lambda_val}) is {cdf_value:.4f}")

# Interpretation
print(f"This means the probability of observing up to {k} events "
      f"(including {k}) is approximately {cdf_value:.4f}.")

In [None]:
# 20. Simulate a random variable using a continuous uniform distribution and calculate its expected value.
import numpy as np

# Parameters for the continuous uniform distribution
a = 2  # lower bound
b = 8  # upper bound

# Simulate one random variable
random_var = np.random.uniform(a, b)
print(f"Random variable sampled: {random_var:.4f}")

# Calculate the expected value (mean) analytically
expected_value = (a + b) / 2
print(f"Expected value of Uniform({a}, {b}) = {expected_value:.4f}")

In [None]:
# 21. Write a Python program to compare the standard deviations of two datasets and visualize the difference.
import numpy as np
import matplotlib.pyplot as plt

# Example datasets
data1 = np.array([12, 15, 14, 10, 13, 15, 16, 14, 12, 11])
data2 = np.array([20, 22, 19, 21, 20, 18, 23, 22, 21, 19])

# Calculate standard deviations
std1 = np.std(data1, ddof=1)  # sample std dev
std2 = np.std(data2, ddof=1)

print(f"Standard deviation of dataset 1: {std1:.3f}")
print(f"Standard deviation of dataset 2: {std2:.3f}")

# Visualize the standard deviations
labels = ['Dataset 1', 'Dataset 2']
std_values = [std1, std2]

plt.bar(labels, std_values, color=['skyblue', 'lightgreen'])
plt.ylabel('Standard Deviation')
plt.title('Comparison of Standard Deviations')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# 22. Calculate the range and interquartile range (IQR) of a dataset generated from a normal distribution.
import numpy as np
from scipy.stats import iqr

# Generate data from normal distribution
np.random.seed(0)
data = np.random.normal(loc=50, scale=10, size=100)

# Calculate range
data_range = np.max(data) - np.min(data)

# Calculate interquartile range (IQR)
data_iqr = iqr(data)

print(f"Range of data: {data_range:.2f}")
print(f"Interquartile Range (IQR) of data: {data_iqr:.2f}")

In [None]:
# 23. Implement Z-score normalization on a dataset and visualize its transformation.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore

# Sample dataset
data = np.array([10, 12, 9, 15, 10, 8, 13, 14, 7, 11])

# Perform Z-score normalization
data_normalized = zscore(data)

# Plot original and normalized data distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data, bins=5, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(data_normalized, bins=5, color='lightgreen', edgecolor='black')
axes[1].set_title('Z-score Normalized Data')
axes[1].set_xlabel('Z-score')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# 24. Write a Python function to calculate the skewness and kurtosis of a dataset generated from a normal distribution.
import numpy as np
from scipy.stats import skew, kurtosis

def calc_skewness_kurtosis(data):
    skewness = skew(data)
    kurt = kurtosis(data)  # excess kurtosis by default
    print(f"Skewness: {skewness:.4f}")
    print(f"Kurtosis (excess): {kurt:.4f}")
    return skewness, kurt

# Generate data from normal distribution
np.random.seed(0)
data = np.random.normal(loc=0, scale=1, size=1000)

# Calculate and print skewness and kurtosis
calc_skewness_kurtosis(data)

#Practical Part - 2

In [None]:
# 1. Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results.
import numpy as np
from scipy.stats import norm

def z_test(sample, pop_mean, pop_std, alpha=0.05):
    n = len(sample)
    sample_mean = np.mean(sample)
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # two-tailed test

    print(f"Sample mean = {sample_mean:.4f}")
    print(f"Population mean = {pop_mean:.4f}")
    print(f"Z statistic = {z_stat:.4f}")
    print(f"P-value = {p_value:.4f}")

    if p_value < alpha:
        print(f"Reject the null hypothesis at alpha = {alpha}.")
        print("There is a significant difference between sample mean and population mean.")
    else:
        print(f"Fail to reject the null hypothesis at alpha = {alpha}.")
        print("There is no significant difference between sample mean and population mean.")

# Example usage:
sample_data = [102, 99, 101, 98, 100, 97, 103, 100, 99, 101]
population_mean = 100
population_std = 2  # known population std deviation

z_test(sample_data, population_mean, population_std)

In [None]:
# 2. Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python.
import numpy as np
from scipy.stats import norm

# Simulate random sample data from a normal distribution
np.random.seed(42)
sample_size = 30
pop_mean = 50
pop_std = 5  # known population std dev
sample_data = np.random.normal(loc=52, scale=pop_std, size=sample_size)  # sample mean shifted to 52

# Perform one-sample Z-test
sample_mean = np.mean(sample_data)
z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(sample_size))
p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # two-tailed test

print(f"Sample mean: {sample_mean:.3f}")
print(f"Z statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: significant difference from population mean.")
else:
    print("Fail to reject the null hypothesis: no significant difference from population mean.")

In [None]:
#3. Implement a one-sample Z-test using Python to compare the sample mean with the population mean.
import numpy as np
from scipy.stats import norm

def one_sample_z_test(sample, pop_mean, pop_std, alpha=0.05):
    n = len(sample)
    sample_mean = np.mean(sample)
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # two-tailed test

    print(f"Sample mean = {sample_mean:.4f}")
    print(f"Population mean = {pop_mean:.4f}")
    print(f"Z statistic = {z_stat:.4f}")
    print(f"P-value = {p_value:.4f}")

    if p_value < alpha:
        print("Reject the null hypothesis: significant difference.")
    else:
        print("Fail to reject the null hypothesis: no significant difference.")

# Example usage
sample_data = [104, 98, 101, 102, 99, 97, 105, 100, 103, 99]
population_mean = 100
population_std = 3  # known population std deviation

one_sample_z_test(sample_data, population_mean, population_std)

In [None]:
# 4.  Perform a two-tailed Z-test using Python and visualize the decision region on a plot.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def two_tailed_z_test(sample, pop_mean, pop_std, alpha=0.05):
    n = len(sample)
    sample_mean = np.mean(sample)
    z_stat = (sample_mean - pop_mean) / (pop_std / np.sqrt(n))
    p_value = 2 * (1 - norm.cdf(abs(z_stat)))

    print(f"Sample mean = {sample_mean:.4f}")
    print(f"Z statistic = {z_stat:.4f}")
    print(f"P-value = {p_value:.4f}")

    # Critical z-values for two-tailed test
    z_critical = norm.ppf(1 - alpha/2)

    # Plotting
    x = np.linspace(-4, 4, 1000)
    y = norm.pdf(x)

    plt.plot(x, y, label='Standard Normal Distribution')
    plt.fill_between(x, 0, y, where=(x <= -z_critical), color='red', alpha=0.3, label='Rejection Region')
    plt.fill_between(x, 0, y, where=(x >= z_critical), color='red', alpha=0.3)
    plt.axvline(z_stat, color='blue', linestyle='--', label=f'Z Statistic = {z_stat:.2f}')
    plt.axvline(-z_critical, color='red', linestyle='--', label=f'Critical Values ±{z_critical:.2f}')
    plt.axvline(z_critical, color='red', linestyle='--')

    plt.title('Two-tailed Z-test')
    plt.xlabel('Z value')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.show()

    # Interpretation
    if abs(z_stat) > z_critical:
        print("Reject the null hypothesis: significant difference.")
    else:
        print("Fail to reject the null hypothesis: no significant difference.")

# Example data
sample_data = [104, 98, 101, 102, 99, 97, 105, 100, 103, 99]
population_mean = 100
population_std = 3
alpha = 0.05

two_tailed_z_test(sample_data, population_mean, population_std, alpha)

In [None]:
# 5. Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def plot_type_errors(mu0, mu1, sigma, n, alpha=0.05):
    """
    mu0: mean under null hypothesis
    mu1: mean under alternative hypothesis
    sigma: population standard deviation
    n: sample size
    alpha: significance level (Type I error rate)
    """
    # Standard error
    se = sigma / np.sqrt(n)

    # Critical value for rejecting H0 (two-tailed test)
    z_crit = norm.ppf(1 - alpha/2)

    # Critical values in terms of sample mean
    crit_low = mu0 - z_crit * se
    crit_high = mu0 + z_crit * se

    # X values for plotting distributions
    x_min = mu0 - 4*se
    x_max = mu1 + 4*se
    x = np.linspace(x_min, x_max, 1000)

    # Null hypothesis distribution (centered at mu0)
    h0_pdf = norm.pdf(x, mu0, se)
    # Alternative hypothesis distribution (centered at mu1)
    h1_pdf = norm.pdf(x, mu1, se)

    plt.figure(figsize=(10,6))

    # Plot H0 and H1
    plt.plot(x, h0_pdf, label='Null Hypothesis $H_0$', color='blue')
    plt.plot(x, h1_pdf, label='Alternative Hypothesis $H_1$', color='green')

    # Shade rejection regions under H0 (Type I error areas)
    x_reject_low = np.linspace(x_min, crit_low, 200)
    x_reject_high = np.linspace(crit_high, x_max, 200)
    plt.fill_between(x_reject_low, 0, norm.pdf(x_reject_low, mu0, se), color='red', alpha=0.3, label='Type I Error (α)')
    plt.fill_between(x_reject_high, 0, norm.pdf(x_reject_high, mu0, se), color='red', alpha=0.3)

    # Shade Type II error region under H1 (acceptance region for H0)
    x_accept = np.linspace(crit_low, crit_high, 300)
    plt.fill_between(x_accept, 0, norm.pdf(x_accept, mu1, se), color='orange', alpha=0.3, label='Type II Error (β)')

    # Add vertical lines for critical values
    plt.axvline(crit_low, color='black', linestyle='--', label='Critical Values')
    plt.axvline(crit_high, color='black', linestyle='--')

    plt.title('Type I and Type II Errors in Hypothesis Testing')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()

    # Calculate Type II error (beta)
    beta = norm.cdf(crit_high, mu1, se) - norm.cdf(crit_low, mu1, se)

    print(f"Significance level (Type I error, α): {alpha:.3f}")
    print(f"Type II error (β): {beta:.3f}")
    print(f"Power of the test (1 - β): {1 - beta:.3f}")

# Example usage:
plot_type_errors(mu0=100, mu1=105, sigma=15, n=30, alpha=0.05)

In [None]:
# 6. Write a Python program to perform an independent T-test and interpret the results.
import numpy as np
from scipy.stats import ttest_ind

# Example data: two independent samples
group1 = np.array([23, 21, 19, 24, 30, 22, 20, 18])
group2 = np.array([31, 29, 35, 30, 28, 34, 33, 27])

# Perform independent T-test (assumes unequal variances by default)
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: the means are significantly different.")
else:
    print("Fail to reject the null hypothesis: no significant difference between means.")

In [None]:
# 7. Perform a paired sample T-test using Python and visualize the comparison results.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_rel

# Example data: before and after scores for the same group
before = np.array([88, 75, 90, 85, 92, 78, 84, 91])
after  = np.array([90, 78, 94, 88, 95, 81, 86, 94])

# Paired T-test
t_stat, p_value = ttest_rel(before, after)

print(f"Paired T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: significant difference between paired samples.")
else:
    print("Fail to reject the null hypothesis: no significant difference between paired samples.")

# Visualization
plt.figure(figsize=(8, 5))
for i in range(len(before)):
    plt.plot([0, 1], [before[i], after[i]], marker='o', color='gray')
plt.xticks([0, 1], ['Before', 'After'])
plt.title('Paired Sample Comparison')
plt.ylabel('Scores')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
# 8. Simulate data and perform both Z-test and T-test, then compare the results using Python.
import numpy as np
from scipy.stats import ttest_1samp, norm

# Simulate sample data
np.random.seed(42)
sample_size = 30
true_mean = 50
population_std = 10  # known for Z-test
sample_data = np.random.normal(loc=52, scale=population_std, size=sample_size)

# Known population mean for testing
hypothesized_mean = 50

# Z-test (when population std dev is known)
sample_mean = np.mean(sample_data)
z_stat = (sample_mean - hypothesized_mean) / (population_std / np.sqrt(sample_size))
z_p_value = 2 * (1 - norm.cdf(abs(z_stat)))  # two-tailed

# T-test (when population std dev is unknown)
t_stat, t_p_value = ttest_1samp(sample_data, hypothesized_mean)

# Print results
print("Z-test Results:")
print(f"  Z-statistic: {z_stat:.4f}")
print(f"  P-value:     {z_p_value:.4f}")

print("\nT-test Results:")
print(f"  T-statistic: {t_stat:.4f}")
print(f"  P-value:     {t_p_value:.4f}")

# Interpretation
alpha = 0.05
print("\nInterpretation (alpha = 0.05):")
if z_p_value < alpha:
    print("  Z-test: Reject the null hypothesis.")
else:
    print("  Z-test: Fail to reject the null hypothesis.")

if t_p_value < alpha:
    print("  T-test: Reject the null hypothesis.")
else:
    print("  T-test: Fail to reject the null hypothesis.")

In [None]:
# 9. Write a Python function to calculate the confidence interval for a sample mean and explain its significance.
import numpy as np
from scipy.stats import t

def confidence_interval(sample, confidence=0.95):
    n = len(sample)
    mean = np.mean(sample)
    std_err = np.std(sample, ddof=1) / np.sqrt(n)  # standard error
    t_crit = t.ppf((1 + confidence) / 2, df=n-1)   # critical t-value
    margin = t_crit * std_err

    lower = mean - margin
    upper = mean + margin

    print(f"Sample mean = {mean:.2f}")
    print(f"{int(confidence*100)}% Confidence Interval: ({lower:.2f}, {upper:.2f})")
    return (lower, upper)

# Example usage
sample_data = [102, 98, 100, 101, 99, 97, 103, 104, 100, 98]
confidence_interval(sample_data)

In [None]:
# 10. Write a Python program to calculate the margin of error for a given confidence level using sample data.
import numpy as np
from scipy.stats import t

def calculate_margin_of_error(sample, confidence=0.95):
    n = len(sample)
    sample_std = np.std(sample, ddof=1)
    std_error = sample_std / np.sqrt(n)
    t_critical = t.ppf((1 + confidence) / 2, df=n-1)

    margin_of_error = t_critical * std_error

    print(f"Sample Size = {n}")
    print(f"Sample Mean = {np.mean(sample):.2f}")
    print(f"Sample Std Dev = {sample_std:.2f}")
    print(f"{int(confidence*100)}% Margin of Error = ±{margin_of_error:.2f}")

    return margin_of_error

# Example usage
sample_data = [102, 98, 100, 101, 99, 97, 103, 104, 100, 98]
calculate_margin_of_error(sample_data, confidence=0.95)

In [None]:
# 11. Implement a Bayesian inference method using Bayes' Theorem in Python and explain the process.
def bayes_theorem(prior, sensitivity, specificity):
    """
    Apply Bayes' Theorem to compute posterior probability of having disease given a positive test result.
    prior: P(Disease)
    sensitivity: P(Pos | Disease)
    specificity: P(Neg | No Disease)
    """
    # P(Pos | No Disease) = 1 - specificity
    false_positive_rate = 1 - specificity

    # P(Pos) = P(Pos | Disease) * P(Disease) + P(Pos | No Disease) * P(No Disease)
    p_pos = sensitivity * prior + false_positive_rate * (1 - prior)

    # P(Disease | Pos) = [P(Pos | Disease) * P(Disease)] / P(Pos)
    posterior = (sensitivity * prior) / p_pos

    print(f"Prior (P(Disease))          = {prior:.4f}")
    print(f"Likelihood (P(Pos|Disease)) = {sensitivity:.4f}")
    print(f"P(Pos)                      = {p_pos:.4f}")
    print(f"Posterior (P(Disease|Pos))  = {posterior:.4f}")

    return posterior

# Example usage
prior = 0.01         # 1% of people have the disease
sensitivity = 0.99   # True positive rate
specificity = 0.95   # True negative rate

bayes_theorem(prior, sensitivity, specificity)

In [None]:
# 12. Perform a Chi-square test for independence between two categorical variables in Python.
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Example: Contingency table (2 categorical variables)
# Rows: Gender (Male, Female)
# Columns: Preference (Product A, Product B)

data = [[30, 10],   # Male
        [20, 40]]   # Female

# Create a DataFrame for clarity (optional)
table = pd.DataFrame(data, columns=["Product A", "Product B"], index=["Male", "Female"])

# Perform Chi-square test
chi2, p, dof, expected = chi2_contingency(table)

print("Observed Table:")
print(table)
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, columns=table.columns, index=table.index))
print(f"\nChi-square Statistic = {chi2:.4f}")
print(f"Degrees of Freedom   = {dof}")
print(f"P-value              = {p:.4f}")

# Interpretation
alpha = 0.05
if p < alpha:
    print("Result: Reject the null hypothesis – variables are dependent.")
else:
    print("Result: Fail to reject the null hypothesis – variables are independent.")

In [None]:
# 13. Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data.
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Observed data: contingency table
# Rows = Gender (Male, Female), Columns = Preference (A, B, C)
observed = np.array([[20, 30, 50],   # Male
                     [30, 50, 20]])  # Female

# Optional: wrap in a DataFrame for readability
table = pd.DataFrame(observed, columns=["A", "B", "C"], index=["Male", "Female"])

# Calculate expected frequencies using scipy
chi2, p, dof, expected = chi2_contingency(table)

# Output results
print("Observed Frequencies:")
print(table)

print("\nExpected Frequencies:")
expected_df = pd.DataFrame(expected, columns=table.columns, index=table.index)
print(expected_df.round(2))

In [None]:
# 14. Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution.
import numpy as np
from scipy.stats import chisquare

# Observed data (e.g., dice roll outcomes)
observed = np.array([18, 22, 20, 17, 21, 22])  # from 60 total rolls

# Expected frequencies under a fair die (uniform distribution)
expected = np.array([10]*6) * (np.sum(observed) / 60)  # uniform distribution

# Perform Chi-square goodness-of-fit test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

# Output results
print("Observed Frequencies:", observed)
print("Expected Frequencies:", expected.astype(int))
print(f"\nChi-square Statistic = {chi2_stat:.4f}")
print(f"P-value = {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis – observed data does not fit the expected distribution.")
else:
    print("Result: Fail to reject the null hypothesis – data fits the expected distribution.")