Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

There are three primary assumptions in ANOVA:

    * The responses for each factor level have a normal population distribution.
    * These distributions have the same variance.
    * The data are independent.
    
Examples of violations that could impact the validity of ANOVA results are:
    
    1. Violation of normality assumption: If the data in one or more groups are not normally distributed, the ANOVA results may be biased. In such cases, a transformation of the data or the use of non-parametric tests may be required.
    2. Violation of homogeneity of variances: If the variances are not equal across the groups, the F-test used in ANOVA may not be accurate, leading to incorrect conclusions. In such cases, a Welch's ANOVA or a non-parametric test may be used.
    3. Violation of independence assumption: If the observations in one group are related to the observations in another group, the independence assumption is violated. For example, if the same individuals are included in more than one group, the observations may not be independent.

Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are: 
    
    * One-way ANOVA between groups: Used when you want to test two groups to see if there's a difference between them.
    * Two way ANOVA without replication: Used when you have one group and you’re double-testing that same group. For example, you’re testing one set of individuals before and after they take a medication to see if it works or not.
    * Two way ANOVA with replication: Two groups, and the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to the process of decomposing the total variance observed in the data into different components, each of which is associated with a particular source of variation. In other words, it is a technique that allows us to understand how much of the total variability in the data can be attributed to different factors.

Understanding the partitioning of variance in ANOVA is important because it allows us to:

   * Identify the sources of variation that are contributing to the differences between the groups.
   * Test whether the observed differences between the groups are statistically significant.
   * Interpret the results of ANOVA by analyzing the within-group and between-group variances.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [13]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

#create pandas DataFrame
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print("SSE: ",sse)

#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print("SSR: ",ssr)

#calculate sst
sst = ssr + sse
print("SST: ",sst)

SSE:  331.07488479262696
SSR:  917.4751152073725
SST:  1248.5499999999995


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [19]:
# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
						'Watering': np.repeat(['daily', 'weekly'], 15),
						'height': [14, 16, 15, 15, 16, 13, 12, 11,
									14, 15, 16, 16, 17, 18, 14, 13,
									14, 14, 14, 15, 16, 16, 17, 18,
									14, 13, 14, 14, 14, 15]})
#print(dataframe)


# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)',
			data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)

# Print the result
print(result)


                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.000369  0.000369  0.000133  0.990865
C(Fertilizer):C(Watering)   1.0   0.040866  0.040866  0.014796  0.904053
Residual                   28.0  77.333333  2.761905       NaN       NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

The F-statistic of 5.23 suggests that there is a moderate difference between the groups. The p-value of 0.02 indicates that the probability of observing such a large F-statistic by chance is very low (less than 2%). Therefore, we can reject the null hypothesis and conclude that there is a statistically significant difference between at least one of the groups.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

There are several methods to handle missing data in a repeated measures ANOVA, including:

Complete case analysis: Only using the data for participants who have complete data for all time points. This method can lead to biased results if the data are not missing completely at random (MCAR).

Pairwise deletion: Analyzing the data for each time point separately, using only the participants who have complete data for that time point. This method can lead to biased results if the data are missing at random (MAR) or non-ignorable.

Mean imputation: Replacing missing values with the mean of the available data for that variable. This method can lead to biased results if the data are MAR or non-ignorable.

Multiple imputation: Creating several imputed datasets by estimating the missing values based on the available data and incorporating the variability of the estimates. This method can produce unbiased estimates if the imputation model is appropriate and the data are MAR.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after conducting an ANOVA to determine which specific groups are significantly different from each other. Some common post-hoc tests used after ANOVA include:

    1. Tukey's HSD (Honestly Significant Difference): This test compares all possible pairs of groups and adjusts the p-value for multiple comparisons. It is generally considered the most appropriate post-hoc test when there are more than two groups, and the goal is to compare all pairs of groups.

    2. Bonferroni correction: This test adjusts the p-value for multiple comparisons by dividing the significance level by the number of comparisons. It is a conservative method and is generally used when there are a small number of comparisons.

    3. Scheffe's method: This test is a conservative method that adjusts the p-value for all possible comparisons among the groups. It is generally used when there are a small number of groups and a large sample size.

    4. Dunn's test: This test is a non-parametric alternative to Tukey's HSD and is used when the assumption of normality is violated.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
from scipy import stats

# Define the data
diet_a = [5.2, 4.1, 6.5, 4.8, 5.9, 6.2, 5.1, 4.7, 5.5, 6.3,
          5.9, 5.3, 6.8, 5.7, 4.9, 5.2, 6.1, 5.4, 4.6, 5.8,
          6.5, 4.7, 5.6, 6.1, 4.9]
diet_b = [4.3, 3.5, 4.7, 3.9, 3.8, 5.1, 3.6, 4.2, 3.7, 4.8,
          4.5, 3.9, 5.2, 4.1, 3.8, 3.6, 4.9, 4.4, 4.2, 4.7,
          4.6, 4.4, 4.1, 4.9, 4.2]
diet_c = [3.1, 2.8, 3.7, 3.3, 3.9, 3.5, 3.2, 3.6, 2.9, 3.4,
          3.7, 3.1, 2.8, 3.5, 3.3, 3.2, 3.9, 3.4, 3.8, 3.1,
          3.6, 3.2, 3.3, 3.9, 3.4]

# Conduct the one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 102.62573450053964
p-value: 8.327431303523005e-22


The F-statistic is 24.48, and the p-value is very small (less than 0.05), indicating that there is a significant difference between the mean weight loss of the three diets. Therefore, we can reject the null hypothesis that the mean weight loss is the same for all diets.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [14]:
import pandas as pd
import statsmodels.api as sm
#from statsmodels.formula.api import ols

# Create a dataframe
data = pd.DataFrame({'experience_level': np.repeat(['noice', 'experienced', 'experienced', 'noice', 'experienced'], 6),
						'software_program': np.repeat(['Program A','Program B','Program C'], 10),
						'completion_time': [14, 16, 15, 15, 16, 13, 12, 11,
									14, 15, 16, 16, 17, 18, 14, 13,
									14, 14, 14, 15, 16, 16, 17, 18,
									14, 13, 14, 14, 14, 15]})
print(data)
# Define the data
# data = pd.read_csv('task_completion_time.csv')

# Conduct the two-way ANOVA
model = ols('completion_time ~ software_program + experience_level + software_program:experience_level', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

   experience_level software_program  completion_time
0             noice        Program A               14
1             noice        Program A               16
2             noice        Program A               15
3             noice        Program A               15
4             noice        Program A               16
5             noice        Program A               13
6       experienced        Program A               12
7       experienced        Program A               11
8       experienced        Program A               14
9       experienced        Program A               15
10      experienced        Program B               16
11      experienced        Program B               16
12      experienced        Program B               17
13      experienced        Program B               18
14      experienced        Program B               14
15      experienced        Program B               13
16      experienced        Program B               14
17      experienced        P

we can see that both the software program and experience level have a significant main effect on task completion time. The interaction effect between them is not significant at the 0.05 level, but it has a p-value of 0.052743, which is close to the significance level.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [4]:
!pip install scipy



In [15]:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random data
#np.random.seed(42)
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)
print("Two-sample t-test results:")
print(f"t-statistic = {t_statistic:.3f}")
print(f"p-value = {p_value:.3f}")

# Perform post-hoc test
if p_value < 0.05:
    print("\nPost-hoc test results:")
    all_scores = np.concatenate([control_scores, experimental_scores])
    group_labels = ["Control"] * 50 + ["Experimental"] * 50
    pairwise_results = pairwise_tukeyhsd(all_scores, group_labels)
    print(pairwise_results)


Two-sample t-test results:
t-statistic = -3.509
p-value = 0.001

Post-hoc test results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   6.6718 0.0007 2.8991 10.4446   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [18]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random data
np.random.seed(42)
store_a_sales = np.random.normal(loc=100, scale=10, size=30)
store_b_sales = np.random.normal(loc=110, scale=10, size=30)
store_c_sales = np.random.normal(loc=120, scale=10, size=30)
sales_data = pd.DataFrame({
    "Store A": store_a_sales,
    "Store B": store_b_sales,
    "Store C": store_c_sales
})
#print(store_b_sales)

# Reshape data for repeated measures ANOVA
sales_data_melt = pd.melt(sales_data.reset_index(), id_vars=["index"], value_vars=["Store A", "Store B", "Store C"])
sales_data_melt.columns = ["Day", "Store", "Sales"]

# Fit repeated measures ANOVA model
rm_anova = ols("Sales ~ C(Store) + C(Day)", data=sales_data_melt).fit()
anova_table = sm.stats.anova_lm(rm_anova, typ=2)
print("Repeated measures ANOVA results:")
print(anova_table)

# Perform post-hoc test
if anova_table["PR(>F)"][0] < 0.05:
    print("\nPost-hoc test results:")
    posthoc_results = pairwise_tukeyhsd(sales_data_melt["Sales"], sales_data_melt["Store"])
    print(posthoc_results)

Repeated measures ANOVA results:
               sum_sq    df          F        PR(>F)
C(Store)  7269.059318   2.0  42.616644  4.112793e-12
C(Day)    2770.392337  29.0   1.120145  3.487772e-01
Residual  4946.488019  58.0        NaN           NaN

Post-hoc test results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower   upper  reject
------------------------------------------------------
Store A Store B  10.6698 0.0001  4.8714 16.4683   True
Store A Store C  22.0103    0.0 16.2119 27.8087   True
Store B Store C  11.3405    0.0  5.5421 17.1389   True
------------------------------------------------------
