#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans - ANOVA (Analysis of Variance) is a statistical method used to analyze the differences between means of two or more groups. However, ANOVA has several assumptions that must be met for the results to be valid. The following are the assumptions required for ANOVA:

Independence: Observations within and between groups should be independent. Each subject should only be in one group, and each group should not influence the other groups. Violation of this assumption may lead to a biased estimate of the variance, leading to inaccurate results.

Normality: The residuals of each group should be normally distributed. Normality assumes that the distribution of errors is symmetric, and the mean of the distribution is zero. If the data is not normally distributed, it could lead to inaccurate results.

Homogeneity of variance: Homogeneity of variance refers to the assumption that the variance of the residuals is equal across all groups. If the variance of the residuals is unequal, it could lead to inaccurate results.

Examples of violations that could impact the validity of ANOVA results include:

Outliers: Outliers can cause a violation of the normality assumption by skewing the distribution of the data. Outliers may also cause unequal variance, which violates the homogeneity of variance assumption.

Non-normality: If the residuals are not normally distributed, it can cause inaccurate results, especially with small sample sizes.

Unequal variance: Unequal variance can cause inaccurate results, especially with small sample sizes.

Non-independence: If observations within or between groups are not independent, it can lead to biased estimates of variance and inaccurate results.

Non-random sampling: If the sample is not representative of the population, it can lead to inaccurate results.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans - The three types of ANOVA are:

One-Way ANOVA: This is used when there is only one factor being tested, and it has two or more levels. For example, a researcher might use one-way ANOVA to test whether there is a significant difference in the mean weight of plants grown under different fertilizer treatments.

Two-Way ANOVA: This is used when there are two factors being tested, and each factor has two or more levels. For example, a researcher might use two-way ANOVA to test whether there is a significant difference in the mean weight of plants grown under different fertilizer treatments and different light levels.

N-Way ANOVA: This is used when there are more than two factors being tested, and each factor has two or more levels. For example, a researcher might use n-way ANOVA to test whether there is a significant difference in the mean weight of plants grown under different fertilizer treatments, light levels, and watering schedules.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans - The partitioning of variance in ANOVA refers to the division of the total variance observed in a dataset into different components that can be attributed to different sources. There are two main sources of variance in ANOVA: the variability within groups and the variability between groups.

By partitioning the total variance into these components, ANOVA can determine whether the differences observed between groups are statistically significant, or whether they can be attributed to chance. This is important because it allows researchers to determine whether a treatment or intervention has a significant effect, or whether any observed differences could be due to random chance.

Understanding the partitioning of variance is also important for interpreting the results of ANOVA. By identifying the proportion of variance that can be attributed to different sources, researchers can gain insight into the relative importance of different factors in the outcome being studied. Additionally, understanding the partitioning of variance can help researchers identify potential sources of error or bias in their study design or analysis.





#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset with a categorical variable and a continuous variable
df = pd.DataFrame({'group': ['A', 'B', 'C', 'D', 'E', 'F'],
                   'score': [10, 15, 20, 12, 18, 25]})

# Fit a one-way ANOVA model
model = ols('score ~ group', data=df).fit()

# Calculate the total sum of squares (SST)
sst = sum((df['score'] - df['score'].mean()) ** 2)

# Calculate the explained sum of squares (SSE)
sse = sum((model.predict(df) - df['score'].mean()) ** 2)

# Calculate the residual sum of squares (SSR)
ssr = sum((df['score'] - model.predict(df)) ** 2)

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 151.33333333333331
SSE: 151.33333333333317
SSR: 6.310887241768094e-29


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [3]:
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the tips dataset from seaborn
tips = sns.load_dataset("tips")

# Create a formula for the ANOVA model
formula = 'total_bill ~ sex + time + sex:time'

# Fit the ANOVA model
model = ols(formula, data=tips).fit()

# Perform the ANOVA and print the results table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                sum_sq     df         F    PR(>F)
sex         231.460310    1.0  3.022685  0.083390
time        473.011803    1.0  6.177153  0.013623
sex:time      3.371906    1.0  0.044034  0.833968
Residual  18377.855945  240.0       NaN       NaN


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans - If we obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is evidence of a significant difference between the groups. In other words, the means of the groups are not all equal.

The F-statistic indicates the ratio of the variance between groups to the variance within groups. A larger F-statistic indicates a larger difference between group means relative to the variation within groups. In this case, an F-statistic of 5.23 suggests that the differences between the groups are larger than what would be expected due to chance variation alone.

The p-value of 0.02 indicates that the probability of observing such a large F-statistic if the group means are all equal is only 2%. Therefore, we reject the null hypothesis of equal group means and conclude that at least one of the group means is different from the others.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans - Handling missing data in a repeated measures ANOVA can be challenging, as the data is often dependent and the same individuals are measured across multiple time points or conditions. There are different methods to handle missing data, including:

Listwise deletion: This method involves excluding all individuals who have missing data on any of the variables. While this method is straightforward, it can lead to a reduction in sample size and loss of statistical power.

Mean imputation: This method involves replacing missing values with the mean value for that variable. While this method is easy to implement, it can lead to biased estimates of the means and variances.

Maximum likelihood estimation: This method involves using statistical models to estimate the missing values based on the available data. This method can provide unbiased estimates of the means and variances, but it requires a sophisticated statistical model and may not work well for small sample sizes.

The consequences of using different methods to handle missing data can vary depending on the amount and pattern of missing data, as well as the method used. In general, using listwise deletion can lead to a reduction in statistical power, while mean imputation can lead to biased estimates of the means and variances. Maximum likelihood estimation can provide unbiased estimates of the means and variances, but it may not work well for small sample sizes or when the missing data is not missing at random. It is important to carefully consider the amount and pattern of missing data and choose a method that is appropriate for the specific research question and data at hand.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans - Post-hoc tests are used to determine which groups differ significantly from each other after obtaining a significant result from an ANOVA. Some common post-hoc tests include Tukey's HSD, Bonferroni correction, Scheffé's method, and Dunnett's test.

Tukey's HSD: This test is used to compare all possible pairs of groups to determine which pairs have a significant difference. It controls the family-wise error rate (FWER), which is the probability of making at least one type I error among all the comparisons. Tukey's HSD is commonly used when there are equal sample sizes and variances across all groups.

Bonferroni correction: This test is used to control the FWER by dividing the significance level by the number of pairwise comparisons. For example, if there are four groups, and the significance level is set to 0.05, then the adjusted significance level would be 0.05/6 = 0.0083, since there are six pairwise comparisons. Bonferroni correction is a conservative method, and it is commonly used when there are unequal sample sizes or variances across groups.

Scheffé's method: This test is also used to control the FWER, but it is less conservative than Bonferroni correction. It is commonly used when there are unequal sample sizes or variances across groups.
Dunnett's test: This test is used to compare each group to a control group. It controls the family-wise error rate for these comparisons.
An example of a situation where a post-hoc test might be necessary is in a study comparing the effectiveness of three different types of pain medication. An ANOVA might reveal a significant difference between the groups, but a post-hoc test would be necessary to determine which specific pairs of groups differ significantly from each other.


#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

Ans - To conduct a one-way ANOVA in Python to compare the mean weight loss of three diets, we can use the scipy.stats module. Let's assume that the weight loss data is stored in a Pandas DataFrame called df, where the column 'diet' specifies the diet group (A, B, or C), and the column 'weight_loss' specifies the weight loss in pounds. Here's the Python code:

In [5]:
import statsmodels.api as sm
import seaborn as sns

df = sns.load_dataset('tips')
import random

random.seed(123) # set seed for reproducibility

diet = []
for i in range(len(df)):
    diet.append(random.choice(['A', 'B', 'C']))

df['diet'] = diet
model = sm.formula.ols('total_bill ~ diet', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)



                sum_sq     df         F    PR(>F)
diet        198.281750    2.0  1.253553  0.287343
Residual  19060.182333  241.0       NaN       NaN


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

# Setting random seed for reproducibility
np.random.seed(123)

# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

# Print the simulated data head 
print('Simulated Data example :')
print(data.head())

print('\n======================================================================================\n')

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

# Set significance level
alpha = 0.05

# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")

Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


Here are the interpretations of the three conclusions:
"There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.

"There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.

"There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [3]:
import numpy as np
from scipy.stats import ttest_ind

# generate sample data
control_scores = np.random.normal(70, 10, 100)
experimental_scores = np.random.normal(75, 10, 100)

# conduct two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# conduct post-hoc test (Tukey's HSD)
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(np.concatenate((control_scores, experimental_scores)),
                                  np.concatenate((np.repeat('control', 100), np.repeat('experimental', 100))))

print(tukey_results)


t-statistic: -3.0651400307444217
p-value: 0.0024791276971826066
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   4.2465 0.0025 1.5144 6.9786   True
---------------------------------------------------------


In this example, we generate two samples of test scores (one for the control group and one for the experimental group) using the numpy.random.normal function. We then conduct a two-sample t-test using scipy.stats.ttest_ind, which returns the t-statistic and p-value. If the p-value is less than our chosen significance level (e.g., 0.05), we conclude that there is a significant difference in test scores between the two groups.

In this case, let's say the t-statistic is 2.24 and the p-value is 0.027. This indicates that there is a significant difference in test scores between the two groups (since the p-value is less than 0.05). To determine which group(s) differ significantly from each other, we can conduct a post-hoc test using Tukey's HSD (using the statsmodels.stats.multicomp.pairwise_tukeyhsd function). This will give us a table of results showing the pairwise differences in means and the corresponding p-values.

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import pingouin as pg

# create a sample dataset
np.random.seed(123)
data = pd.DataFrame({
    'day': np.repeat(range(1, 31), 3),
    'store': np.tile(['A', 'B', 'C'], 30),
    'sales': np.random.normal(loc=1000, scale=100, size=90)
})

# conduct repeated measures ANOVA
rm_anova = pg.rm_anova(dv='sales', within='store', subject='day', data=data)
print(rm_anova)


  Source  ddof1  ddof2         F     p-unc      ng2       eps
0  store      2     58  1.669709  0.197225  0.03671  0.959348


In [6]:
# conduct pairwise t-test with Bonferroni correction
posthoc = pg.pairwise_ttests(dv='sales', within='store', subject='day', data=data, padjust='bonf')
print(posthoc)


  Contrast  A  B  Paired  Parametric         T   dof alternative     p-unc  \
0    store  A  B    True        True -1.740227  29.0   two-sided  0.092423   
1    store  A  C    True        True -0.892032  29.0   two-sided  0.379718   
2    store  B  C    True        True  0.998930  29.0   two-sided  0.326091   

     p-corr p-adjust   BF10    hedges  
0  0.277268     bonf  0.742 -0.453587  
1  1.000000     bonf   0.28 -0.256064  
2  0.978273     bonf  0.307  0.216494  


