# Q1

In [None]:
# ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three or more groups. It relies on several assumptions 
# to ensure the validity of the results. Let's discuss these assumptions and provide examples of violations that could affect the validity of \
# the ANOVA results:

In [None]:
# 1. Independence: The observations within each group must be independent of each other. Violations occur when there is dependence or correlation 
# between the observations. For example, if the measurements taken from individuals within the same family are not independent, ANOVA assumptions
# are violated.

In [None]:
# 2. Normality: The data within each group should follow a normal distribution. This assumption is most critical when the sample sizes are small. 
# Violations can occur when the data significantly deviates from normality. For instance, if the data is heavily skewed or has extreme outliers, 
# it may violate the normality assumption.

In [None]:
# 3. Homogeneity of variance: The variances of the different groups being compared should be approximately equal. Violations arise when the 
# variability across groups is significantly different. For example, if one group has a much larger variance compared to the others, the assumption 
# of homogeneity of variance is violated.

In [None]:
# 4. Independence of errors: The errors or residuals (the differences between observed values and predicted values) should be independent of 
# each other. Violations occur when there is a pattern or correlation among the residuals. For instance, if the residuals systematically increase 
# or decrease across the levels of the independent variable, the assumption of independence of errors is violated.

In [None]:
# Examples of violations and their impact on ANOVA results:

In [None]:
# 1. Non-independence: If participants in a study are related (e.g., siblings), and their responses are not independent, the assumption of 
# independence is violated. The violation can lead to inflated or deflated F-statistics, which can impact the validity of the results.

In [None]:
# 2. Non-normality: If the data within each group significantly deviates from a normal distribution, the assumption of normality is violated. 
# This violation can affect the accuracy of p-values and confidence intervals. In such cases, non-parametric alternatives or transformations of 
# the data may be more appropriate.

In [None]:
# 3. Heterogeneity of variance: When the variability across groups is significantly different, violating the assumption of homogeneity of variance, 
# the F-statistic may be biased. If the group with larger variance also has larger means, ANOVA may erroneously identify a significant difference 
# between groups.

In [None]:
# 4. Dependence of errors: If there is a pattern or correlation among the residuals, violating the assumption of independence of errors, the 
# standard errors of the estimates may be incorrect. This can lead to inaccurate hypothesis tests and confidence intervals.

In [None]:
# It's important to note that while violations of these assumptions can impact the validity of ANOVA results, the impact and severity of 
# violations depend on the specific context and the extent of the violation.

# Q2

In [None]:
# The three types of ANOVA are:

In [None]:
# 1.One-way ANOVA: One-way ANOVA is used when you have one categorical independent variable (also known as a factor) with three or more levels 
# and a continuous dependent variable. It is used to determine if there are any significant differences between the means of the groups defined 
# by the levels of the independent variable. For example, you might use one-way ANOVA to compare the average scores of students in three different 
# teaching methods (e.g., lecture, discussion, and online) to determine if there are any significant differences in learning outcomes.

In [None]:
# 2.Two-way ANOVA: Two-way ANOVA is used when you have two independent variables (factors) and a continuous dependent variable. Each independent 
# variable has two or more levels, and the interaction between the two independent variables is of interest. Two-way ANOVA allows you to analyze 
# the main effects of each independent variable as well as their interaction effect on the dependent variable. For example, you might use two-way 
# ANOVA to examine the effects of both gender (male vs. female) and treatment type (drug A vs. drug B) on blood pressure.

In [None]:
# 3. Mixed-effects ANOVA: Mixed-effects ANOVA, also known as repeated-measures ANOVA, is used when you have one or more within-subject factors 
# (repeated measures) and one or more between-subject factors. It is appropriate when you have a mix of both within-subject and between-subject v
# ariables. Mixed-effects ANOVA allows you to examine the effects of each factor individually as well as their interaction effect. This type of 
# ANOVA is commonly used in longitudinal or experimental designs where measurements are taken at multiple time points or under different conditions.
# For example, you might use mixed-effects ANOVA to analyze the effects of time (within-subject factor) and treatment group (between-subject factor) 
# on the pain levels of patients over a period of several weeks.

# Q3

In [None]:
# The partitioning of variance in ANOVA refers to the decomposition of the total variability observed in a dataset into different sources or 
# components. Understanding this concept is essential because it helps us identify the contributions of various factors and their interactions
# to the overall variation in the data. It allows us to quantify the proportion of variability explained by different sources and determine if 
# these sources are statistically significant.

In [None]:
# In ANOVA, the total variability observed in the data is divided into two main components:

In [None]:
# Between-group variability: This component represents the variability among the group means or levels of the independent variable. 
# It reflects the differences between the groups being compared. If the between-group variability is large relative to the within-group 
# variability, it suggests that there are significant differences between the groups.

In [None]:
# Within-group variability: This component represents the variability within each group or level of the independent variable. It reflects the 
# individual differences or random variation within each group. The within-group variability serves as the baseline or reference level of 
# variability against which the between-group variability is compared.

In [None]:
# By partitioning the total variability into these two components, ANOVA helps us evaluate the significance of group differences and determine 
# whether the observed differences are likely due to the effects of the independent variable or simply due to random variation.

In [None]:
# Understanding the partitioning of variance is important for several reasons:

In [None]:
#  Hypothesis testing: ANOVA uses the partitioning of variance to assess the statistical significance of group differences. By comparing the 
# ratio of between-group variability to within-group variability (F-ratio), ANOVA helps determine if the observed differences are significant 
# or likely to have occurred by chance.

In [None]:
# Effect size estimation: The partitioning of variance allows us to quantify the proportion of variability explained by the independent 
# variable(s) and their interactions. This information is crucial for assessing the practical significance or importance of the observed 
# effects.

In [None]:
# Model evaluation: The partitioning of variance helps evaluate the goodness-of-fit of the ANOVA model. It allows us to determine how well 
# the model accounts for the observed variability in the data.

# Q4

In [1]:
import scipy.stats as stats

# Sample data for each group
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]

# Combine the groups into a single list
data = group1 + group2 + group3

# Compute the overall mean
overall_mean = sum(data) / len(data)

# Calculate the total sum of squares (SST)
sst = sum((x - overall_mean) ** 2 for x in data)

# Calculate the group means
group_means = [sum(group) / len(group) for group in [group1, group2, group3]]

# Calculate the explained sum of squares (SSE)
sse = sum(len(group) * (mean - overall_mean) ** 2 for group, mean in zip([group1, group2, group3], group_means))

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print("Total sum of squares (SST):", sst)
print("Explained sum of squares (SSE):", sse)
print("Residual sum of squares (SSR):", ssr)

Total sum of squares (SST): 230.0
Explained sum of squares (SSE): 90.0
Residual sum of squares (SSR): 140.0


# Q5

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'A': [1, 1, 2, 2, 3, 3, 4, 4],
    'B': [1, 2, 1, 2, 1, 2, 1, 2],
    'Y': [3, 5, 4, 7, 6, 9, 8, 10]
})

# Fit the two-way ANOVA model
model = ols('Y ~ A + B + A:B', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effect
main_effect_A = anova_table['sum_sq']['A']
main_effect_B = anova_table['sum_sq']['B']
interaction_effect = anova_table['sum_sq']['A:B']

print("Main effect A:", main_effect_A)
print("Main effect B:", main_effect_B)
print("Interaction effect:", interaction_effect)

Main effect A: 28.900000000000002
Main effect B: 12.50000000000001
Interaction effect: 1.9721522630525295e-31


# Q6

In [None]:
# In the given scenario, you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. Based on these results, we 
# can draw the following conclusions:

In [None]:
# 1. Differences between groups: The obtained F-statistic of 5.23 indicates that there are significant differences between the groups. 
# The F-statistic measures the ratio of the between-group variability to the within-group variability. A larger F-value suggests a greater 
# difference between the group means.

In [None]:
#2. Statistical significance: The p-value of 0.02 indicates that the observed differences between the groups are statistically significant. 
# The p-value represents the probability of obtaining the observed data or more extreme data if the null hypothesis (no group differences) 
# were true. A p-value less than the chosen significance level (e.g., 0.05) suggests that the differences between the groups are unlikely to 
# have occurred by chance alone.

In [None]:
# Interpreting these results, we can conclude that there are significant differences between the groups based on the one-way ANOVA. 
# This implies that at least one group mean differs significantly from the means of the other groups. However, it does not provide specific 
# information about which group means are different from each other. To identify the specific group differences, further post hoc tests or 
# pairwise comparisons may be conducted.

# Q7

In [None]:
# Handling missing data in a repeated measures ANOVA requires careful consideration as it can impact the validity and reliability of the analysis. 
# Here are some common approaches to handle missing data in a repeated measures ANOVA and their potential consequences:

In [None]:
# 1. Complete Case Analysis (Listwise deletion): This approach involves excluding any cases with missing data from the analysis. 
# The consequence of this method is a reduction in sample size, potentially leading to loss of statistical power and potentially biased 
# results if the missingness is related to the variables of interest.

In [None]:
# 2. Pairwise Deletion: This approach involves including all available data for each pairwise comparison, even if some cases have missing data 
#for certain variables. The consequence of this method is that different comparisons may have different sample sizes, potentially affecting 
# the precision and power of the analysis. However, this method utilizes more data compared to complete case analysis.

In [None]:
# 3. Mean Substitution: This approach involves replacing missing values with the mean value of the corresponding variable. The consequence of 
#this method is that it may underestimate the variance of the data, leading to biased estimates of the population parameters and potentially 
# distorting the results.

In [None]:
# 4. Last Observation Carried Forward (LOCF): This approach involves replacing missing values with the last observed value from the same 
# individual. The consequence of this method is that it assumes the missing data is consistent with the last observed value, which may not
# be valid in cases where the missingness is related to the variables being measured.

In [None]:
# 5. Multiple Imputation: This approach involves creating multiple plausible imputed values for each missing data point based on the observed 
# data and a statistical model. The consequence of this method is that it accounts for the uncertainty associated with the missing data, provides 
# more valid and efficient estimates, and can lead to more accurate statistical inference.

# Q8

In [None]:
# 1. Tukey's Honestly Significant Difference (HSD) Test: Tukey's HSD test compares all possible pairs of group means and provides a simultaneous 
# confidence interval for each pairwise difference. It controls for the family-wise error rate, making it suitable when conducting multiple 
# comparisons. Tukey's HSD is generally recommended when the sample sizes are equal or similar across groups.

In [None]:
# 2. Bonferroni Correction: The Bonferroni correction is a conservative approach that adjusts the significance level for each pairwise comparison. 
# The corrected significance level is divided by the number of comparisons being made. This method is appropriate when conducting a large number 
# of pairwise comparisons, as it controls the family-wise error rate but may have reduced power.

In [None]:
# 3. Scheffé's Test: Scheffé's test is a conservative post-hoc test that compares all possible pairwise differences while controlling the 
# family-wise error rate. It is more robust but less powerful compared to other post-hoc tests. Scheffé's test is useful when the sample 
# sizes are unequal or the assumptions for other post-hoc tests are violated.

In [None]:
# 4. Dunnett's Test: Dunnett's test is used when comparing multiple treatment groups to a control group. It controls the overall type I error 
# rate when conducting multiple comparisons against a single control group.

In [None]:
# 5. Games-Howell Test: The Games-Howell test is a robust post-hoc test that does not assume equal variances or equal sample sizes across groups. 
# It is useful when the assumption of homogeneity of variances is violated.

In [None]:
# Example scenario: Let's say you conducted a study comparing the effectiveness of four different teaching methods (A, B, C, and D) on student 
# performance. You performed a one-way ANOVA and found a significant overall effect (p < 0.05). Now, you want to determine which specific pairs 
# of teaching methods differ significantly from each other. In this situation, you would apply a post-hoc test, such as Tukey's HSD or 
# Scheffé's test, to compare all possible pairs of teaching methods and identify significant differences between them. These post-hoc tests 
# would provide specific information about which teaching methods yield significantly different outcomes, allowing for a more nuanced 
# interpretation of the study results.

# Q9

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet
diet_A = np.array([2.1, 1.8, 2.5, 1.9, 2.3, 2.0, 1.7, 1.5, 1.8, 1.9,
                   2.1, 2.4, 1.8, 2.2, 2.5, 2.3, 2.0, 1.9, 2.1, 1.8,
                   2.5, 1.9, 2.3, 2.0, 1.7, 1.5, 1.8, 1.9, 2.1, 2.4,
                   1.8, 2.2, 2.5, 2.3, 2.0, 1.9, 2.1, 1.8, 2.5, 1.9,
                   2.3, 2.0, 1.7, 1.5, 1.8, 1.9, 2.1, 2.4, 1.8, 2.2])

diet_B = np.array([2.4, 2.1, 2.7, 2.6, 2.9, 2.5, 2.3, 2.2, 2.1, 2.4,
                   2.0, 2.6, 2.3, 2.1, 2.7, 2.6, 2.9, 2.5, 2.3, 2.2,
                   2.1, 2.4, 2.0, 2.6, 2.3, 2.1, 2.7, 2.6, 2.9, 2.5,
                   2.3, 2.2, 2.1, 2.4, 2.0, 2.6, 2.3, 2.1, 2.7, 2.6,
                   2.9, 2.5, 2.3, 2.2, 2.1, 2.4, 2.0, 2.6, 2.3, 2.1])

diet_C = np.array([3.0, 2.7, 3.2, 2.8, 3.1, 2.9, 3.2, 3.0, 2.8, 3.1,
                   2.7, 3.2, 2.9, 3.0, 2.8, 3.1, 2.9, 3.2, 3.0, 2.8,
                   3.1, 2.7, 3.2, 2.9, 3.0, 2.8, 3.1, 2.9, 3.2, 3.0,
                   2.8, 3.1, 2.7, 3.2, 2.9, 3.0, 2.8, 3.1, 2.9, 3.2,
                   3.0, 2.8, 3.1, 2.7, 3.2, 2.9, 3.0, 2.8, 3.1, 2.9])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

F-Statistic: 195.83023803912337
p-value: 3.5169279849919386e-42


In [4]:
# H0: There is no significant difference between the 3 diets.
# H1: There is significant difference between the 3 diets.

# Q11

In [7]:
import numpy as np
import scipy.stats as stats

# Generate random test scores for the control and experimental groups
np.random.seed(42)
control_group = np.random.normal(loc=70, scale=10, size=100)
experimental_group = np.random.normal(loc=75, scale=10, size=100)

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the t-statistic and p-value
print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Perform post-hoc tests (e.g., Tukey's HSD) if the results are significant
if p_value < 0.05:
    # Combine the data from both groups
    combined_data = np.concatenate([control_group, experimental_group])

    # Create group labels (0 for control, 1 for experimental)
    group_labels = np.concatenate([np.zeros_like(control_group), np.ones_like(experimental_group)])

Two-sample t-test results:
t-statistic: -4.754695943505281
p-value: 3.819135262679478e-06


# Q12

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a DataFrame with the data
data = {
    'Day': list(range(1, 31)) * 3,
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': [50, 55, 60, 58, 57, 54, 52, 55, 59, 60, 55, 53, 54, 58, 56, 55, 53, 52,
              45, 48, 46, 44, 47, 50, 45, 43, 42, 44, 49, 50] * 3
}

df = pd.DataFrame(data)

# Perform the repeated measures ANOVA
model = ols('Sales ~ Store + C(Day)', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)

# Perform the post-hoc test
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(posthoc)

            df        sum_sq       mean_sq             F    PR(>F)
Store      2.0  6.276177e-27  3.138089e-27  5.150825e-01  0.600158
C(Day)    29.0  2.510900e+03  8.658276e+01  1.421160e+28  0.000000
Residual  58.0  3.533592e-25  6.092400e-27           NaN       NaN
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     A      B      0.0   1.0 -3.3075 3.3075  False
     A      C      0.0   1.0 -3.3075 3.3075  False
     B      C      0.0   1.0 -3.3075 3.3075  False
--------------------------------------------------
