## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA assumes that the samples being compared are independent, normally distributed, and have equal variances. Violations of these assumptions could impact the validity of the results. For example, violations of the independence assumption could occur if there is some form of dependence between the samples being compared, such as in a repeated measures design. Violations of the normality assumption could occur if the data are heavily skewed or have outliers. Violations of the equal variance assumption could occur if the variances of the samples being compared are significantly different from each other.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are one-way ANOVA, factorial ANOVA, and repeated measures ANOVA. One-way ANOVA is used when there is only one independent variable. Factorial ANOVA is used when there are two or more independent variables, and their effects on the dependent variable are of interest. Repeated measures ANOVA is used when the same subjects are measured under different conditions.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the division of the total variance in the data into different sources of variation, such as the variation between groups and the variation within groups. Understanding this concept is important because it allows us to quantify the amount of variability in the data that can be attributed to different sources of variation and to test whether these sources of variation are statistically significant.



## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Read in the data
data = pd.read_csv("data.csv")

# Fit the one-way ANOVA model
model = ols("y ~ group", data=data).fit()

# Calculate the sum of squares
anova_table = sm.stats.anova_lm(model, typ=2)
SST = anova_table["sum_sq"]["group"] + anova_table["sum_sq"]["Residual"]
SSE = anova_table["sum_sq"]["group"]
SSR = anova_table["sum_sq"]["Residual"]


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Read in the data
data = pd.read_csv("data.csv")

# Fit the two-way ANOVA model
model = ols("y ~ A + B + A:B", data=data).fit()

# Calculate the main effects
main_effects = model.params[["A", "B"]]

# Calculate the interaction effect
interaction_effect = model.params["A:B"]


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

If the F-statistic is significant and the p-value is less than the alpha level (usually 0.05), we can conclude that there are significant differences between the groups. In this case, we can reject the null hypothesis that the means of the groups are equal. The F-statistic tells us the ratio of the variation between groups to the variation within groups. A high F-statistic and a low p-value indicate that the variation between groups is much larger than the variation within groups, suggesting that there are significant differences between the means of the groups

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

When handling missing data in repeated measures ANOVA, the most commonly used method is the pairwise deletion or listwise deletion method. This method involves removing any observations with missing values on any of the variables used in the analysis. Another method is the imputation method, where missing values are replaced with estimated values based on observed data.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Some common post-hoc tests used after ANOVA are Tukey's Honestly Significant Difference (HSD) test, Bonferroni correction, Scheffe's test, and Fisher's Least Significant Difference (LSD) test. These tests are used to determine which groups differ significantly from each other after finding a significant main effect in the ANOVA.

For example, if we conduct an ANOVA to compare the mean weight loss of three diets, and find a significant main effect, we can then use a post-hoc test to determine which specific diets led to significant differences in weight loss. This can help us identify which diet may be most effective for weight loss.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

import pandas as pd
import scipy.stats as stats

data = pd.read_csv('data.csv')

# Conduct one-way ANOVA assuming that our data have these three columns
f_stat, p_val = stats.f_oneway(data['diet_A'], data['diet_B'], data['diet_C'])

# Report results
print('F-statistic:', f_stat)
print('p-value:', p_val)

# Interpret results
if p_val < 0.05:
    print('There is a significant difference in mean weight loss between the three diets.')
else:
    print('There is no significant difference in mean weight loss between the three diets.')


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the data
df = pd.DataFrame({
    'software_program': ['A']*30 + ['B']*30 + ['C']*30,
    'experience_level': ['novice']*15 + ['experienced']*15 + ['novice']*15 + ['experienced']*15 + ['novice']*15 + ['experienced']*15,
    'completion_time': [10, 11, 9, 12, 11, 13, 14, 15, 12, 13, 10, 11, 9, 12, 11, 13, 14, 15, 12, 13, 10, 11, 9, 12, 11, 13, 14, 15, 12, 13,
                        20, 21, 19, 22, 21, 23, 24, 25, 22, 23, 20, 21, 19, 22, 21, 23, 24, 25, 22, 23, 20, 21, 19, 22, 21, 23, 24, 25, 22, 23,
                        30, 31, 29, 32, 31, 33, 34, 35, 32, 33, 30, 31, 29, 32, 31, 33, 34, 35, 32, 33, 30, 31, 29, 32, 31, 33, 34, 35, 32, 33]
})

# Fit a two-way ANOVA model with interaction
model = ols('completion_time ~ C(software_program) + C(experience_level) + C(software_program):C(experience_level)', data=df).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(model, typ=2))


                                               sum_sq    df             F  \
C(software_program)                      6.000000e+03   2.0  1.006390e+03   
C(experience_level)                      1.960000e+01   1.0  6.575080e+00   
C(software_program):C(experience_level)  4.349898e-28   2.0  7.296155e-29   
Residual                                 2.504000e+02  84.0           NaN   

                                               PR(>F)  
C(software_program)                      2.063068e-59  
C(experience_level)                      1.212087e-02  
C(software_program):C(experience_level)  1.000000e+00  
Residual                                          NaN  


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.


In [2]:
import numpy as np
from scipy.stats import ttest_ind

# Create arrays with the data
control_scores = np.array([80, 75, 85, 90, 70, 80, 75, 85, 90, 70, 80, 75, 85, 90, 70, 80, 75, 85, 90, 70, 80, 75, 85, 90, 70, 80, 75, 85, 90, 70])
experimental_scores = np.array([85, 90, 80, 75, 95, 85, 90, 80, 75, 95, 85, 90, 80, 75, 95, 85, 90, 80, 75, 95, 85, 90, 80, 75, 95, 85, 90, 80, 75, 95])

# Conduct a two-sample t-test
t, p = ttest_ind(control_scores, experimental_scores)

# Print the t-statistic and p-value
print(f"t = {t:.2f}, p = {p:.4f}")


t = -2.69, p = 0.0093


The results are significant (p < 0.05), indicating that there is a significant difference in test scores between the control and experimental groups. To determine which group(s) differ significantly from each other, we can conduct a post-hoc test such as Tukey's HSD test

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd


In [3]:
data = pd.DataFrame({'store': ['A']*30 + ['B']*30 + ['C']*30,
                     'day': np.tile(np.arange(1, 31), 3),
                     'sales': np.random.randint(100, 1000, size=90)})


In [4]:
model = ols('sales ~ C(store) + C(day)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)


In [5]:
print(anova_table)


                sum_sq    df         F    PR(>F)
C(store)  3.054942e+04   2.0  0.215529  0.806758
C(day)    2.051401e+06  29.0  0.998126  0.488006
Residual  4.110505e+06  58.0       NaN       NaN


In [6]:
posthoc = pairwise_tukeyhsd(data['sales'], data['store'], alpha=0.05)
print(posthoc)


  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B  16.3333 0.9693 -147.5165 180.1831  False
     A      C -28.2667  0.911 -192.1165 135.5831  False
     B      C    -44.6 0.7934 -208.4498 119.2498  False
-------------------------------------------------------
