Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results. <br>

Analysis of Variance (ANOVA) is a statistical technique used to determine whether there are significant differences between two or more groups of data. There are several assumptions that must be met for ANOVA to be valid. These include:

Independence: The observations in each group must be independent of each other.

Normality: The data within each group should follow a normal distribution.

Homogeneity of variance: The variance within each group should be equal.

Random sampling: The data must be randomly selected from the population.

If any of these assumptions are violated, the results of ANOVA may not be valid. For example:

Independence: If the data is not independent, then the assumption of independence is violated. For example, if a study examines the effect of a drug on patients and the same patients are used for both the treatment and control groups, then the observations are not independent.

Normality: If the data is not normally distributed, then the assumption of normality is violated. For example, if a study examines the effect of a drug on blood pressure, and the data is skewed, then the assumption of normality is violated.

Homogeneity of variance: If the variance within each group is not equal, then the assumption of homogeneity of variance is violated. For example, if a study examines the effect of a drug on different age groups, and the variance within each group is significantly different, then the assumption of homogeneity of variance is violated.

Random sampling: If the data is not randomly selected from the population, then the assumption of random sampling is violated. For example, if a study examines the effect of a drug on a specific group of people who are already known to have a certain condition, then the assumption of random sampling is violated.

Q2. What are the three types of ANOVA, and in what situations would each be used? <br>

There are three main types of ANOVA, each used in different situations:

One-way ANOVA: This type of ANOVA is used to compare the means of three or more groups that are independent of each other. It is called "one-way" because it involves only one independent variable. For example, a one-way ANOVA could be used to compare the mean test scores of students from three different schools.

Two-way ANOVA: This type of ANOVA is used to analyze the effect of two independent variables on a single dependent variable. It is called "two-way" because it involves two independent variables. For example, a two-way ANOVA could be used to analyze the effect of both age and gender on a certain health condition.

Repeated measures ANOVA: This type of ANOVA is used when the same group of participants is measured more than once under different conditions. It is called "repeated measures" because the same participants are measured multiple times. For example, a repeated measures ANOVA could be used to analyze the effect of different types of exercise on the heart rate of the same group of individuals.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept? <br>

In ANOVA (Analysis of Variance), the partitioning of variance refers to the division of the total variance in the data into different sources of variation. Understanding this concept is important because it helps to determine the contribution of different factors to the variation in the data.

The partitioning of variance in ANOVA is done into three components:

Total Sum of Squares (SST): This is the total amount of variation in the data.

Sum of Squares Within Groups (SSW): This is the amount of variation that is due to differences within groups or samples.

Sum of Squares Between Groups (SSB): This is the amount of variation that is due to differences between groups or samples.

The formula for the partitioning of variance is:

SST = SSB + SSW

The ratio of SSB to SSW is used to calculate the F statistic, which is used to test the null hypothesis that there is no difference between the means of the groups.

Understanding the partitioning of variance is important because it allows researchers to determine which factor(s) are contributing to the variation in the data. This information can be used to identify the most important factors and to design further experiments or interventions that target these factors.

Additionally, ANOVA is a powerful statistical technique that is commonly used in many fields, including biology, psychology, economics, and engineering, so understanding the partitioning of variance is crucial for interpreting and communicating the results of ANOVA analyses.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python? <br>



In [85]:
import numpy as np

#create some sample data
group1 = [ 4, 6, 5]
group2 = [8, 10, 9]
group3 = [12, 14, 13]

data = np.concatenate([group1,group2,group3])

mean = np.mean(data)
sst = np.sum((data - mean)**2)

sse = ((np.mean(group1)-mean)**2) * len(group1)
sse += ((np.mean(group2) - mean) ** 2) * len(group2)
sse += ((np.mean(group3) - mean) ** 2) * len(group3)

# calculate the residual sum of squares (SSR)
ssr = sst - sse

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)

#ALTERNATIVELY,
# import pandas as pd
# import statsmodels.api as sm
# from statsmodels.formula.api import ols

# # create a DataFrame with your data
# df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
#                    'value': [4, 6, 5, 8, 10, 9, 12, 14, 13]})

# # fit the one-way ANOVA modelza
# model = ols('value ~ group', data=df).fit()

# # calculate SST
# result = sm.stats.anova_lm(model, typ=1)
# sse = result.loc[result.index == 'group']['sum_sq'][0]
# ssr = result.loc[result.index == 'Residual']['sum_sq'][0]
# print(sse)
# print(ssr)
# sst = ssr+sse
# print(sst)
# print(result)

SST: 102.0
SSE: 96.0
SSR: 6.0


Q5.In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [90]:
# Importing libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
  
# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13, 
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})
  
  
# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) + \
C(Fertilizer):C(Watering)',data=dataframe).fit()

result = sm.stats.anova_lm(model, type=2)
main_effects = result.loc[['C(Watering)','C(Fertilizer)']]
print(main_effects)
interaction_effects = result.loc[['C(Fertilizer):C(Watering)']]
print("-"*40)
print(interaction_effects)

                df    sum_sq   mean_sq         F    PR(>F)
C(Watering)    1.0  0.000369  0.000369  0.000133  0.990865
C(Fertilizer)  1.0  0.033333  0.033333  0.012069  0.913305
----------------------------------------
                            df    sum_sq   mean_sq         F    PR(>F)
C(Fertilizer):C(Watering)  1.0  0.040866  0.040866  0.014796  0.904053


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results? <br>

A one-way ANOVA tests the null hypothesis that there is no significant difference between the means of three or more groups. In this case, the obtained F-statistic of 5.23 and a p-value of 0.02 indicates that there is a statistically significant difference between the groups.

Specifically, the F-statistic of 5.23 suggests that the variation in the means between the groups is greater than what we would expect due to chance alone. The p-value of 0.02 indicates that the probability of obtaining such a large F-statistic by chance is only 2%, which is below the commonly used threshold of 5% (or 0.05) for statistical significance.

Therefore, we can reject the null hypothesis and conclude that there is a significant difference between the means of the groups. However, we cannot determine which specific group(s) differ significantly from the others based on the ANOVA results alone. Post-hoc tests, such as Tukey's HSD or Bonferroni correction, can be conducted to determine the pairwise differences between the groups.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data? <br>

Handling missing data in a repeated measures ANOVA can be challenging as it requires dealing with missing values across multiple variables. There are several methods to handle missing data, including listwise deletion, pairwise deletion, mean imputation, regression imputation, and multiple imputation.

Listwise deletion involves removing any cases with missing data, which can result in a loss of power and biased estimates if the missing data is not missing completely at random (MCAR). Pairwise deletion involves using all available data for each variable, which can lead to biased results if the missing data is not MCAR.

Mean imputation involves replacing missing values with the mean value of the available data for that variable. While this method is simple, it can result in biased estimates of the means, standard deviations, and correlations.

Regression imputation involves using regression models to estimate the missing values based on the available data. This method can produce unbiased estimates if the imputation model is correctly specified.

Multiple imputation involves generating multiple plausible imputed datasets and combining the results to obtain unbiased estimates and standard errors that account for the uncertainty in the imputation process. This method is the most recommended for handling missing data in repeated measures ANOVA.

The consequences of using different methods to handle missing data in a repeated measures ANOVA can vary depending on the nature and extent of the missing data, as well as the method used. Using listwise or pairwise deletion can result in biased estimates and a loss of power, while mean imputation can result in biased estimates of means, standard deviations, and correlations. Regression imputation can produce unbiased estimates if the imputation model is correctly specified, but it may not account for the uncertainty in the imputation process. Multiple imputation is the most recommended method as it produces unbiased estimates and standard errors that account for the uncertainty in the imputation process.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary. <br>

Post-hoc tests are used after performing ANOVA (Analysis of Variance) to determine the significant differences between pairs of means. Post-hoc tests help to identify which groups differ from one another after finding significant differences between groups in ANOVA. Some commonly used post-hoc tests include:

Tukey's Honestly Significant Difference (HSD) test: This test is commonly used when the sample sizes are equal and when the assumption of homogeneity of variances is met. The Tukey HSD test is considered to be one of the most reliable post-hoc tests because it controls the overall Type I error rate.

Bonferroni correction: This test is used to adjust the p-value for multiple comparisons. It is a conservative test that controls the family-wise error rate, making it ideal for situations where a large number of comparisons are being made.

Scheffé's test: This test is used when the sample sizes are unequal or when the assumption of homogeneity of variances is violated. It is less powerful than the Tukey HSD test but is more conservative.

Dunnett's test: This test is used when comparing several groups to a control group. It is considered to be more powerful than the Bonferroni correction and the Scheffé's test.

Example of a situation where a post-hoc test might be necessary:
Suppose a researcher wants to investigate whether there are differences in the average exam scores of three different groups of students (Group A, Group B, and Group C). The researcher performs ANOVA and finds that there is a statistically significant difference between the groups. However, the ANOVA does not tell which groups differ from each other. In this case, the researcher needs to perform a post-hoc test to determine which groups are significantly different from each other.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [93]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np

a = np.random.normal(5,1,20)
b = np.random.normal(5,1.2,15)
c = np.random.normal(4.5,0.3,15)

weight_loss  = np.concatenate([a,b,c])
x = np.array([['A'],['B'],['C']])
#np.repeat(x,[20,15,15]) -- repeats a 20 times, b 15 times and c 15 times
df1 = pd.DataFrame({'Diet': np.repeat(x,[20,15,15])},
                 )
df2= pd.DataFrame({'Weight_lost':weight_loss})

df = pd.concat([df1,df2],axis=1)
model = ols('Weight_lost ~ C(Diet)',data=df).fit()
result = sm.stats.anova_lm(model,type=1)
result

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(Diet),2.0,3.389464,1.694732,2.673772,0.079476
Residual,47.0,29.79028,0.633836,,


In [2]:
import numpy as np
from scipy.stats import f_oneway

a = np.random.normal(5,1,20)
b = np.random.normal(5,1.2,15)
c = np.random.normal(4.5,0.3,15)

weight_loss  = np.concatenate([a,b,c])
# Conduct the one-way ANOVA
f_statistic, p_value = f_oneway(a, b, c)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 2.6863686425239486
p-value: 0.07858248236984498


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset with 30 employees
df = pd.DataFrame({
    'program': ['A', 'B', 'C'] * 10,  # Assign each employee to a program
    'experience': ['novice'] * 15 + ['experienced'] * 15,  # Randomly assign experience level
    'time': [10, 12, 15, 14, 16, 18, 20, 22, 25, 24, 11, 13, 16, 18, 20,
             22, 26, 28, 30, 32, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
})

# Fit a two-way ANOVA model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

                              sum_sq    df         F    PR(>F)
C(program)                  3.266667   2.0  0.049025  0.952253
C(experience)             294.533333   1.0  8.840420  0.006612
C(program):C(experience)   30.466667   2.0  0.457229  0.638437
Residual                  799.600000  24.0       NaN       NaN


The ANOVA table shows that there is a significant main effect of experience on task completion time , butc there is no significant main effect of program or interaction effect between program and experience level. 

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [3]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(123)
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# Conduct two-sample t-test
t_stat, p_val = ttest_ind(control_scores, experimental_scores)
print(f"t-statistic: {t_stat:.2f}, p-value: {p_val:.4f}")

# Conduct post-hoc test
tukey_results = pairwise_tukeyhsd(np.concatenate((control_scores, experimental_scores)),
                                  np.concatenate(([0]*50, [1]*50)), alpha=0.05)
print(tukey_results)


t-statistic: -2.32, p-value: 0.0227
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     0      1   5.2768 0.0227 0.7537 9.7998   True
--------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(123)
store_a_sales = np.random.normal(loc=100, scale=10, size=30)
store_b_sales = np.random.normal(loc=110, scale=10, size=30)
store_c_sales = np.random.normal(loc=120, scale=10, size=30)

sales_df = pd.DataFrame({'Stores': np.repeat(np.array([['A'],['B'],['C']]),[30,30,30]),
                        'Sales':np.repeat([store_a_sales,store_b_sales,store_c_sales],1)})

# Add a column for the day number
sales_df['Day'] = np.repeat(range(1, 31), 3)

# Reshape the data to long format
sales_long = pd.melt(sales_df, id_vars='Day', var_name='Stores', value_name='Sales')
sales_long
# # Conduct repeated measures ANOVA
# rm = AnovaRM(data=sales_long, depvar='Sales', subject='Stores', within=['Day'])
# res = rm.fit()
# print(res.summary())

# # Conduct post-hoc test
# tukey_results = pairwise_tukeyhsd(sales_long['Sales'], sales_long['Stores'], alpha=0.05)
# print(tukey_results)


  sales_long = pd.melt(sales_df, id_vars='Day', var_name='Stores', value_name='Sales')


Unnamed: 0,Day,Stores,Sales
0,1,Stores,A
1,1,Stores,A
2,1,Stores,A
3,2,Stores,A
4,2,Stores,A
...,...,...,...
175,29,Sales,109.140976
176,29,Sales,112.67538
177,30,Sales,107.874769
178,30,Sales,140.871134


In [7]:
store_a_sales = np.random.normal(loc=100, scale=10, size=30)
store_b_sales = np.random.normal(loc=110, scale=10, size=30)
store_c_sales = np.random.normal(loc=120, scale=10, size=30)


# Create a data frame with the sales data
# sales_df = pd.DataFrame({'Stores A': store_a_sales,
#                          'Store B': store_b_sales,
#                          'Store C': store_c_sales})

sales_df = pd.DataFrame({'Stores': np.repeat(np.array([['A'],['B'],['C']]),[30,30,30]),
                        'Sales':np.repeat([store_a_sales,store_b_sales,store_c_sales],1)})
sales_df

Unnamed: 0,Stores,Sales
0,A,87.866152
1,A,86.735122
2,A,114.083691
3,A,93.912892
4,A,86.793974
...,...,...
85,C,116.066891
86,C,110.424182
87,C,140.564673
88,C,101.115076
