**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**

**Assumptions for ANOVA:**
   - **Independence:** The observations within each group are independent of each other.
   - **Normality:** The residuals (the differences between the observed values and the predicted values) are normally distributed.
   - **Homogeneity of variances:** The variance of the residuals is constant across all levels of the independent variable(s).

**Examples of violations:**
   - **Non-normality:** If the residuals are not normally distributed, it can affect the reliability of the p-values and confidence intervals.
   - **Heterogeneity of variances:** Unequal variances across groups can lead to inflated Type I error rates, meaning you might detect differences that aren't truly present or fail to detect real differences.

---
**Q2. What are the three types of ANOVA, and in what situations would each be used?**

**Types of ANOVA:**
   - **One-way ANOVA:** Used when comparing the means of three or more independent groups to determine if they are significantly different from each other.
   - **Two-way ANOVA:** Used when there are two independent variables (factors) and their interaction on a dependent variable needs to be analyzed.
   - **N-way ANOVA:** Generalization of ANOVA for more than two independent variables.

---
**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

**Partitioning of Variance:**
   - In ANOVA, the total variance observed in the data is partitioned into different sources: the variance explained by the factors (explained variance) and the variance not explained by the factors (residual variance).
   - Understanding this concept is important because it helps in assessing the relative importance of the factors and their interaction in explaining the variability in the dependent variable.

---
**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**

1. Total Sum of Squares
   $ (SST) = \sum_{i=1}^{n} (y_i - \bar{y})^2 $
   
   Where:
   - $ y_i $ represents each observed value.
   - $ \bar{y} $ represents the overall mean of all observed values.
   - $ n $ is the total number of observations.<br>

2. Explained Sum of Squares
   $ (SSE) = \sum_{j=1}^{k} n_j (\bar{y}_j - \bar{y})^2 $
   
   Where:
   - $ n_j $ is the number of observations in the $ j^{th} $ group.
   - $ \bar{y}_j $ is the mean of the $ j^{th} $ group.
   - $ k $ is the total number of groups.<br>

3. Residual Sum of Squares
   $ (SSR) = \sum_{i=1}^{n} (y_i - \bar{y}_j)^2 $
   
   Where:
   - $ \bar{y}_j $ is the mean of the group to which the $ i^{th} $ observation belongs.

These formulas represent the calculation of SST, SSE, and SSR in a one-way ANOVA context. Now calculating through Python-

In [1]:
import numpy as np

# Example 
data = np.array([2, 4, 6, 8, 10])  

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate SST 
SST = np.sum((data - overall_mean) ** 2)

# Calculate SSE 
group_means = np.mean(data)
SSE = np.sum((group_means - overall_mean) ** 2)

# Calculate SSR 
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

Total Sum of Squares (SST): 40.0
Explained Sum of Squares (SSE): 0.0
Residual Sum of Squares (SSR): 40.0


---
**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

**Calculating Main and Interaction Effects in Two-Way ANOVA with Python**

**1. Import libraries**

**2. Prepare your data**

* Organize your data in a pandas DataFrame with columns representing factors and the dependent variable.

**3. Calculate means:**

* Calculate the mean value of the dependent variable for each level of each factor and the overall mean.
* You can use functions like `groupby` in pandas to efficiently achieve this.

**4. Define functions for calculations:**

* Define separate functions to calculate the main effect of each factor and the interaction effect.
* These functions should consider the overall mean, individual factor means, and potentially group-wise means depending on the interaction effect calculation method.

**5. Calculate effects:**

* Apply the defined functions to your data and factor means to obtain the numerical values for each effect.

**6. Interpret the results:**

* Consider the magnitude and direction of the calculated effects along with statistical tests (e.g., F-test) to draw conclusions about the significance of each factor and their interaction.

In [10]:
# Sample data
data = {
    "Factor1": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
    "Factor2": ["Low", "High", "Low", "Low", "High", "Low", "Low", "High", "Low"],
    "Value": [20, 25, 18, 22, 28, 23, 15, 20, 17]
}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate means
factor1_means = df.groupby("Factor1")["Value"].mean()
factor2_means = df.groupby("Factor2")["Value"].mean()
overall_mean = df["Value"].mean()

# Define functions 
def main_effect_1(data, factor1_means, overall_mean):
    return np.sum([
        (factor1_means[level] - overall_mean)**2 * df[df["Factor1"] == level].shape[0]
        for level in factor1_means.index
    ])

def main_effect_2(data, factor2_means, overall_mean):
    return np.sum([
        (factor2_means[level] - overall_mean)**2 * df[df["Factor2"] == level].shape[0]
        for level in factor2_means.index
    ])

def interaction_effect(data, factor1_means, factor2_means, overall_mean):
    return np.sum([
        ((df[df["Factor1"] == level1]["Value"].mean() - factor1_means[level1] - factor2_means[level2] + overall_mean)**2) * df[df["Factor1"] == level1].shape[0]
        for level1 in factor1_means.index
        for level2 in factor2_means.index
    ])

# Calculate effects
main_effect_1_value = main_effect_1(df, factor1_means, overall_mean)
main_effect_2_value = main_effect_2(df, factor2_means, overall_mean)
interaction_effect_value = interaction_effect(df, factor1_means, factor2_means, overall_mean)

# Print results
print("Main effect of Factor 1:", main_effect_1_value)
print("Main effect of Factor 2:", main_effect_2_value)
print("Interaction effect:", interaction_effect_value)

Main effect of Factor 1: 73.55555555555556
Main effect of Factor 2: 53.38888888888884
Interaction effect: 133.47222222222211


---
**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.**

**What can you conclude about the differences between the groups, and how would you interpret these results?**

**Overall differences:**

* **F-statistic of 5.23:** This indicates a moderate-to-large effect size, suggesting potential meaningful differences between the groups.
* **p-value of 0.02:** This is statistically significant at the 0.05 level, meaning it's unlikely (2% chance) to observe such a difference by chance alone.

**However, you cannot definitively conclude:**

* **Which specific groups differ:** One-way ANOVA only tells you if there's an overall difference, not which specific groups differ from each other.

**Further analysis is needed:**

* Post-hoc tests
* Effect size
* Visualization

**Interpretation:**

The one-way ANOVA results suggest a statistically significant difference between the groups (F = 5.23, p = 0.02). However, post-hoc tests are needed to identify specific groups that differ and assess the practical significance of these differences.

---
**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**

In a repeated measures ANOVA, missing data can be handled using various methods, each with its own implications:

1. **Complete Case Analysis (CCA)**:
   - This approach involves excluding cases with missing data on any of the variables involved in the analysis.
   - **Pros:** Simple and straightforward.
   - **Cons:** Can lead to loss of statistical power, biased results if missing data is not completely random, and reduced generalizability if the missing data are related to the outcome variable.
   
2. **Imputation**:
   - Missing values can be replaced with estimated values using imputation methods such as mean imputation, median imputation, or regression imputation.
   - **Pros:** Retains sample size, can reduce bias if the missing data mechanism is missing at random (MAR), and is relatively simple to implement.
   - **Cons:** Can introduce bias if the missing data mechanism is non-ignorable, may underestimate standard errors, and can inflate Type I error rates.

3. **Maximum Likelihood Estimation (MLE)**:
   - MLE involves modeling the covariance structure of the data and estimating parameters using the observed data likelihood function.
   - **Pros:** Utilizes all available data, provides unbiased estimates under the assumption of MAR, and allows for valid statistical inference.
   - **Cons:** Can be computationally intensive, requires specifying a correct covariance structure, and may be sensitive to model misspecification.

4. **Multiple Imputation (MI)**:
   - MI involves creating multiple imputed datasets, analyzing each dataset separately, and combining the results.
   - **Pros:** Accounts for uncertainty due to missing data, provides more reliable estimates compared to single imputation methods, and allows for valid statistical inference.
   - **Cons:** Requires assumptions about the missing data mechanism, can be computationally intensive, and may be challenging to implement correctly.

The choice of method depends on factors such as the nature of the missing data, the underlying assumptions, and the goals of the analysis. However, it's important to acknowledge that different methods can lead to different results, and sensitivity analyses should be conducted to assess the robustness of the findings.

---
**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**

1. **Tukey's HSD Test**: Compares all pairs of group means, suitable for equal sample sizes and homogeneous variances.
   
2. **Bonferroni Correction**: Adjusts significance level for multiple comparisons to control family-wise error rate.

3. **Sidak Correction**: Similar to Bonferroni, slightly less conservative.

4. **Dunnett's Test**: Compares treatment groups to control, controlling family-wise error rate.

In a research study investigating the effect of different exercise regimens on weight loss, an ANOVA may initially be conducted to analyze whether there are significant differences in weight loss among three or more exercise groups (e.g., aerobic, strength training, and combined exercise). If the ANOVA yields a significant result, indicating that at least one group's mean weight loss differs from the others, a post-hoc test such as Tukey's HSD or Bonferroni correction would be necessary to determine which specific exercise groups differ significantly from each other. This helps to identify the most effective exercise regimen(s) for weight loss while controlling for Type I error.

---
**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.**

In [15]:
import scipy.stats as stats
df = pd.read_csv("weight_loss_data.csv")

diet_A = df["Diet_A"]
diet_B = df["Diet_B"]
diet_C = df["Diet_C"]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("The one-way ANOVA test indicates that there is a significant difference in mean weight loss between the three diets.")
else:
    print("The one-way ANOVA test does not find significant evidence to reject the null hypothesis, suggesting no significant difference in mean weight loss between the three diets.")

F-statistic: 849.9285618143902
p-value: 1.640501512893696e-81
The one-way ANOVA test indicates that there is a significant difference in mean weight loss between the three diets.


---
**Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.**

In [11]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

df = pd.read_csv("ques10.csv")
df.head()

Unnamed: 0,Software,Experience,Time
0,A,Novice,25
1,A,Experienced,22
2,A,Novice,28
3,B,Novice,21
4,B,Experienced,19


In [10]:
model = ols('Time ~ Software + Experience + Software:Experience', data=df).fit()

table = anova_lm(model)
print(table)

                       df      sum_sq     mean_sq          F    PR(>F)
Software              2.0  141.838889   70.919444  13.134920  0.000140
Experience            1.0  156.816667  156.816667  29.043859  0.000016
Software:Experience   2.0   42.961111   21.480556   3.978392  0.032194
Residual             24.0  129.583333    5.399306        NaN       NaN


**Interpretations:**
- Significant differences in task completion time among software programs.
- Significant differences in task completion time between novice and experienced employees.
- A significant interaction effect, meaning the impact of software programs on task completion time varies based on employee experience level, and vice versa.

---
**Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**

To conduct a two-sample t-test in Python and follow up with a post-hoc test if the results are significant, you can follow these steps:

1. **Perform the Two-Sample t-test**:
   - Use `scipy.stats.ttest_ind` function to perform the two-sample t-test.
   - Calculate the t-statistic and p-value.

2. **Check Significance**:
   - If the p-value is below your chosen significance level (e.g., α = 0.05), then there is a significant difference in test scores between the two groups.

3. **Post-hoc Test (if significant)**:
   - If the t-test is significant, conduct a post-hoc test to determine which group(s) differ significantly from each other.
   - You can use Tukey's HSD Test, Bonferroni Correction, or similar methods to compare group means.

Here's how we can implement it in Python:

In [7]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data
control_group = [80, 85, 90, 75, 78, 82, 79, 81, 88, 86]
experimental_group = [85, 88, 92, 80, 83, 87, 84, 89, 90, 91]

# Perform two-sample t-test
t_stat, p_value = ttest_ind(control_group, experimental_group)

if p_value < 0.05:
    print("There is a significant difference in test scores between the two groups.")
    # Perform post-hoc test
    data = pd.DataFrame({'score': control_group + experimental_group,
                         'group': ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)})
    posthoc = pairwise_tukeyhsd(data['score'], data['group'])
    print(posthoc)
else:
    print("There is no significant difference in test scores between the two groups.")

There is a significant difference in test scores between the two groups.
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental      4.5 0.0316 0.4449 8.5551   True
---------------------------------------------------------


---
**Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc
test to determine which store(s) differ significantly from each other.**

To conduct a repeated measures ANOVA in Python and follow up with a post-hoc test if the results are significant, you can follow these steps:

1. **Perform the Repeated Measures ANOVA**:
   - Use `statsmodels` library to perform the repeated measures ANOVA.
   - Prepare your data in a long format where each row represents a single observation (i.e., sales for each store on each day).
   - Use the `ols` function to fit a linear model and `AnovaRM` class to perform the repeated measures ANOVA.

2. **Check Significance**:
   - If the p-value from the ANOVA is below your chosen significance level (e.g., α = 0.05), then there are significant differences in sales between the three stores.

3. **Post-hoc Test (if significant)**:
   - If the ANOVA is significant, conduct a post-hoc test to determine which store(s) differ significantly from each other.
   - You can use Tukey's HSD Test, Bonferroni Correction, or similar methods to compare store means.

In [1]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from pingouin import pairwise_ttests
from statsmodels.stats.multicomp import pairwise_tukeyhsd

df = pd.read_csv("sales_data.csv")

# Perform repeated measures ANOVA
aov = AnovaRM(data=df, depvar='Sales', within=['Day'], subject='Store')
res_anova = aov.fit()

print(res_anova)

# Check if the ANOVA results are significant
if res_anova.anova_table['Pr > F'][0] < 0.05:
    # Perform post-hoc tests
    posthoc_res = pairwise_ttests(data=df, dv='Sales', within='Day', subject='Store', parametric=True, padjust='bonf')
    print(posthoc_res)

    # Perform Tukey's HSD test for multiple comparisons
    tukey_res = pairwise_tukeyhsd(df['Sales'], df['Store'])
    print(tukey_res)
else:
    print("ANOVA results are not significant. Post-hoc tests are not performed.")

              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
Day  0.8057 29.0000 58.0000 0.7334

ANOVA results are not significant. Post-hoc tests are not performed.
