## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups of data. ANOVA assumes the following:

1. Independence: The data within each group should be independent of each other.

2. Normality: The data should be normally distributed within each group.

3. Homogeneity of variance: The variances of the groups should be equal.

4. Samples are in dependent and random

If any of these assumptions are violated, the results of ANOVA may not be valid.

Examples of violations that could impact the validity of the results are:

Violation of independence: If the data within the groups are not independent, the results of ANOVA will be invalid. For example, if the same group of participants is used in multiple treatments, this would violate the independence assumption.

Violation of normality: If the data within the groups are not normally distributed, the results of ANOVA may not be valid. For example, if the data are skewed, the results may not accurately reflect the differences between groups.

Violation of homogeneity of variance: If the variances of the groups are not equal, the results of ANOVA may not be valid. For example, if the data in one group has much larger variance than the data in another group, this could impact the results.

In summary, ANOVA is a powerful statistical tool for comparing means between groups. However, it is important to ensure that the assumptions of independence, normality, and homogeneity of variance are met to ensure that the results are valid. Violations of these assumptions could lead to erroneous conclusions.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three types of ANOVA:

One-way ANOVA: This is used to test for differences in means across two or more independent groups of a single categorical independent variable (also known as a factor) with one or more continuous dependent variables. For example, a one-way ANOVA could be used to compare the effectiveness of three different drugs on blood pressure.

Two-way ANOVA: This is used to test for differences in means across two or more independent groups of two categorical independent variables (factors) with one or more continuous dependent variables. This is useful for examining interactions between two categorical variables. For example, a two-way ANOVA could be used to examine the effects of different training programs (factor 1) and gender (factor 2) on job performance.

Repeated measures ANOVA: This is used to test for differences in means across two or more dependent groups (i.e., repeated measures) of a single categorical independent variable (factor) with one or more continuous dependent variables. For example, a repeated measures ANOVA could be used to compare the scores of a group of individuals on a memory test when they are given a placebo and then a medication.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the division of the total variance in the dependent variable into different components that are associated with different sources of variation. These components include the between-group variance and the within-group variance.

The between-group variance reflects the differences in means between the groups being compared and is often referred to as the "treatment effect." The within-group variance reflects the differences within each group and is often referred to as the "error" or "residual" variance.

Understanding the partitioning of variance is important because it allows us to determine the relative contributions of the treatment effect and the error to the total variance in the dependent variable. This, in turn, allows us to determine whether the observed differences between the groups are statistically significant or simply due to chance.

Additionally, understanding the partitioning of variance can help us identify potential sources of variability that may be impacting our results. For example, if the within-group variance is large, it may suggest that there is a lot of variability within each group that is not being accounted for in the analysis.

In summary, the partitioning of variance in ANOVA is a critical concept that allows us to determine the relative contributions of different sources of variability to the total variance in the dependent variable. This understanding is essential for determining the statistical significance of the observed differences between groups and for identifying potential sources of variability that may be impacting our results.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import scipy.stats as stats

# create sample data
group1 = [10, 12, 14, 16, 18]
group2 = [9, 11, 13, 15, 17]
group3 = [8, 10, 12, 14, 16]

# concatenate the groups into a single dataset
data = group1 + group2 + group3

# calculate the one-way ANOVA
F, p = stats.f_oneway(group1, group2, group3)

# calculate the total sum of squares (SST)
mean = sum(data) / len(data)
SST = sum((x - mean)**2 for x in data)

# calculate the explained sum of squares (SSE)
SSG = len(group1)*(sum(x for x in group1)/len(group1) - mean)**2 + \
      len(group2)*(sum(x for x in group2)/len(group2) - mean)**2 + \
      len(group3)*(sum(x for x in group3)/len(group3) - mean)**2
SSE = SSG

# calculate the residual sum of squares (SSR)
SSR = SST - SSE

# print the results
print("Total sum of squares (SST):", SST)
print("Explained sum of squares (SSE):", SSE)
print("Residual sum of squares (SSR):", SSR)


Total sum of squares (SST): 130.0
Explained sum of squares (SSE): 10.0
Residual sum of squares (SSR): 120.0


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load the data into a Pandas DataFrame
df = pd.read_csv('data.csv')

# fit the two-way ANOVA model
model = ols('DepVar ~ C(Group1) + C(Group2) + C(Group1):C(Group2)', data=df).fit()

# calculate the two-way ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# extract the main effects and interaction effect
main_effect_1 = anova_table['sum_sq']['C(Group1)'] / anova_table['df']['C(Group1)']
main_effect_2 = anova_table['sum_sq']['C(Group2)'] / anova_table['df']['C(Group2)']
interaction_effect = anova_table['sum_sq']['C(Group1):C(Group2)'] / anova_table['df']['C(Group1):C(Group2)']

# print the results
print('Main Effect 1:', main_effect_1)
print('Main Effect 2:', main_effect_2)
print('Interaction Effect:', interaction_effect)


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02What can you conclude about the differences between the groups, and how would you interpret these

If a one-way ANOVA yields an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is at least one group mean that is statistically significantly different from the other group means. The null hypothesis in this case is that all group means are equal, and the alternative hypothesis is that at least one group mean is different from the others.

Since the p-value is less than the significance level (usually set at 0.05), we reject the null hypothesis and conclude that there is sufficient evidence to suggest that the group means are not all equal. However, it is important to note that we cannot determine which specific group(s) differ from the others without further post-hoc testing.

Furthermore, the F-statistic of 5.23 indicates that the between-group variability is 5.23 times larger than the within-group variability. This means that the differences between the group means are larger than what we would expect due to chance or random error.

In summary, we can interpret these results as evidence that there are statistically significant differences between the groups, but we need to conduct further tests to determine which specific group(s) differ from the others.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA depends on the reason for missing data. If the data are missing at random, meaning the reason for the missing data is unrelated to the values of the missing observations or to the outcome of the study, then the missing data can be handled using methods such as mean imputation or multiple imputation. However, if the data are missing not at random, meaning the reason for the missing data is related to the values of the missing observations or to the outcome of the study, then the missing data should not be imputed, and other methods such as maximum likelihood estimation should be used to account for the missing data.

The consequences of using different methods to handle missing data can be significant. Mean imputation, for example, can lead to biased estimates and standard errors, and can reduce statistical power. Multiple imputation can be more effective, but requires more computation time and may be less appropriate for small sample sizes. Maximum likelihood estimation is generally considered the most appropriate method for handling missing data not at random, but may be more difficult to implement and may require more assumptions about the missing data mechanism. Therefore, it is important to carefully consider the reason for missing data and the appropriate method for handling missing data in a repeated measures ANOVA analysis

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after ANOVA to determine which specific groups are significantly different from each other when a significant difference is found in the overall ANOVA. Here are some common post-hoc tests and when they might be used:

Tukey's Honestly Significant Difference (HSD): This test is used to compare all possible pairs of means after ANOVA. It controls the family-wise error rate, which is the probability of making at least one type I error (rejecting a true null hypothesis) across all pairwise comparisons. It is used when there are multiple pairwise comparisons to be made.

Bonferroni correction: This test adjusts the significance level of the post-hoc tests by dividing it by the number of comparisons made. This is done to control the family-wise error rate. It is used when there are a small number of pairwise comparisons to be made.

Scheffe's method: This test is used when there are multiple pairwise comparisons to be made, and the sample sizes are unequal. It is more conservative than other post-hoc tests.

Dunnett's test: This test is used to compare all groups to a control group. It controls the family-wise error rate, but only requires one set of pairwise comparisons.

Games-Howell test: This test is used when the assumption of equal variances is violated. It does not assume equal variances across groups, and therefore can be more robust.

An example of a situation where a post-hoc test might be necessary is if a study involves comparing the effectiveness of four different treatments for a medical condition. After conducting an ANOVA, it is found that there is a significant difference between the means. A post-hoc test would be used to determine which specific pairs of treatments are significantly different from each other. This information would be useful for clinicians to make informed decisions about which treatment to use for their patients.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# weight loss data for each diet
diet_A = np.array([3.2, 4.1, 1.8, 2.9, 4.5, 3.7, 2.6, 1.9, 2.2, 3.1,
                   2.4, 3.6, 4.3, 2.7, 1.8, 3.9, 4.2, 3.4, 2.8, 3.5,
                   2.1, 2.3, 1.7, 3.3, 2.5])
diet_B = np.array([2.5, 1.9, 3.1, 1.8, 2.7, 3.2, 2.3, 3.0, 3.5, 2.2,
                   1.7, 2.8, 3.3, 2.1, 3.6, 2.9, 2.0, 2.4, 1.5, 2.6,
                   3.8, 3.7, 1.6, 2.4, 3.4])
diet_C = np.array([1.2, 1.8, 0.9, 2.4, 1.4, 2.2, 1.0, 1.6, 1.5, 1.9,
                   1.1, 1.8, 2.1, 1.5, 2.6, 1.7, 1.2, 2.0, 1.4, 1.9,
                   1.3, 2.0, 1.5, 1.7, 1.1])

# perform one-way ANOVA
f_stat, p_val = f_oneway(diet_A, diet_B, diet_C)

# print results
print("F-statistic: ", f_stat)
print("p-value: ", p_val)


F-statistic:  26.270116171334905
p-value:  2.7090791490464184e-09


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = pd.DataFrame({
    'program': ['A']*30 + ['B']*30 + ['C']*30,
    'experience': ['novice']*45 + ['experienced']*15,
    'time': [15, 18, 20, 19, 17, 16, 21, 23, 22, 20, 18, 19, 22, 20, 21,
             13, 15, 12, 16, 14, 17, 15, 13, 14, 16, 18, 12, 15, 13, 14,
             24, 26, 27, 23, 25, 24, 28, 26, 27, 25, 26, 24, 28, 27, 25,
             18, 20, 17, 19, 16, 18, 20, 19, 21, 17, 18, 16, 20, 19, 17,
             22, 23, 21, 24, 20, 22, 23, 21, 20, 22, 23, 24, 21, 22, 23,
             19, 21, 20, 18, 22, 19, 20, 21, 18, 20, 19, 21, 20, 22, 21,
             25, 27, 28, 26, 24, 26, 28, 27, 25, 26, 27, 25, 28, 26, 27,
             17, 18, 19, 16, 15, 17, 19, 16, 15, 18, 19, 17, 15, 16, 18,
             23, 24, 25, 22, 21, 24, 23, 22, 21, 25, 23, 22, 25, 24, 21]
})

# Fit a two-way ANOVA model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


ValueError: All arrays must be of the same length

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [None]:
import numpy as np
from scipy.stats import ttest_ind

# create mock data for control and experimental groups
control_scores = np.random.normal(70, 10, size=100)
experimental_scores = np.random.normal(75, 12, size=100)

# conduct two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)


In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# conduct Tukey's range test
tukey_results = pairwise_tukeyhsd(np.concatenate([control_scores, experimental_scores]), 
                                  np.concatenate([np.repeat("control", len(control_scores)), 
                                                  np.repeat("experimental", len(experimental_scores))]))

# print results
print(tukey_results)


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a DataFrame with sales data
sales = pd.DataFrame({
    'day': [f'day{i}' for i in range(1, 31)] * 3,
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'sales': [20, 30, 25, 28, 32, 27, 23, 25, 27, 30, 21, 22, 24, 25, 26, 27, 29, 28, 30, 35, 33, 31, 28, 27, 26, 24, 23, 22, 20, 18,
              32, 35, 34, 30, 28, 29, 25, 23, 26, 28, 32, 33, 31, 30, 29, 28, 26, 24, 22, 30, 32, 34, 36, 37, 35, 33, 31, 28, 25]
})

# create a formula for the ANOVA model
formula = 'sales ~ C(store) + C(day) + C(store):C(day)'

# fit the ANOVA model
lm = ols(formula, data=sales).fit()
table = sm.stats.anova_lm(lm, typ=2)

# print the ANOVA table
print(table)
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# perform Tukey's HSD test
tukey = pairwise_tukeyhsd(sales['sales'], sales['store'], alpha=0.05)

# print the results of the test
print(tukey.summary())
