In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.


In [None]:
 ANOVA (Analysis of Variance) assumes that the data used in the analysis is normally distributed, the variances of the groups are equal, and the observations are independent. Violations of these assumptions can lead to incorrect conclusions from the ANOVA. For example, violations of normality can be caused by outliers or skewed distributions, while violations of equal variances can occur when the groups being compared have different variances. Violations of independence can occur when data points are not truly independent, such as in repeated measures designs or in time series data.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?


In [None]:
 The three types of ANOVA are:

One-way ANOVA: used to test for differences between two or more groups in a single independent variable.
Two-way ANOVA: used to test for differences between two or more groups in two independent variables and their interaction effect.
Repeated measures ANOVA: used to test for differences between two or more groups where the same subjects are measured at different times or under different conditions.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


In [None]:
 The partitioning of variance in ANOVA refers to the process of decomposing the total variation in the data into different components, including the variation between groups, the variation within groups, and the random error variation. This is important to understand because it allows us to quantify the amount of variation that can be attributed to different sources and to assess the significance of these sources.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?


In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit a one-way ANOVA model
model = ols('outcome ~ group', data).fit()

# Calculate the sum of squares
SST = sum((data['outcome'] - data['outcome'].mean()) ** 2)
SSE = sum(model.resid ** 2)
SSR = SST - SSE


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit a two-way ANOVA model
model = ols('outcome ~ var1 + var2 + var1*var2', data).fit()

# Calculate the main effects and interaction effect
main_effect_var1 = model.params['var1']
main_effect_var2 = model.params['var2']
interaction_effect = model.params['var1:var2']


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
If a one-way ANOVA yields an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a statistically significant difference between the groups. Specifically, we can reject the null hypothesis that the means of the groups are equal. The F-statistic indicates the ratio of the variation between groups to the variation within groups, and the p-value indicates the probability of observing such a large F-statistic if the null hypothesis were true. A p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic by chance alone, assuming that the null hypothesis is true. Therefore, we can interpret these results.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


In [None]:
In a repeated measures ANOVA, handling missing data can be challenging. One approach is to remove the participant with missing data from the analysis. Another approach is to use imputation methods to fill in the missing values. However, the choice of method can have consequences for the validity of the results. For example, removing participants with missing data can reduce statistical power and introduce bias if the missing data are not missing completely at random. On the other hand, imputation methods can introduce bias if the missing data are not missing completely at random and the imputed values are not accurate representations of the missing data.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.


In [None]:
Some common post-hoc tests used after ANOVA include Tukey's HSD test, Bonferroni correction, and Scheffe's method. Tukey's HSD test is used when the sample sizes are equal and the variances are approximately equal. Bonferroni correction is used when the number of pairwise comparisons is large, and Scheffe's method is used when the sample sizes are unequal or the variances are not equal. A post-hoc test might be necessary when the ANOVA indicates that there is a significant difference between the groups, but it is not clear which groups are different from each other.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


In [None]:
import pandas as pd
from scipy.stats import f_oneway

data = pd.read_csv("weight_loss_data.csv")
groups = data.groupby("diet")["weight_loss"].apply(list)

F, p = f_oneway(*groups)

print("F-statistic:", F)
print("p-value:", p)


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

data = pd.read_csv("software_program_data.csv")

model = ols("completion_time ~ software_program * experience_level", data).fit()
table = anova_lm(model, typ=2)

print(table)


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.


In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load data into a dataframe
df = pd.read_csv('test_scores.csv')

# Split into control and experimental groups
control_group = df[df['teaching_method'] == 'control']['test_score']
experimental_group = df[df['teaching_method'] == 'experimental']['test_score']

# Perform two-sample t-test
t_stat, p_val = ttest_ind(control_group, experimental_group)

print('T-statistic:', t_stat)
print('P-value:', p_val)


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM

# Load data into a dataframe
df = pd.read_csv('sales_data.csv')

# Convert data to long format
df_long = pd.melt(df, id_vars='day', value_vars=['store_A_sales', 'store_B_sales', 'store_C_sales'], var_name='store', value_name='sales')

# Conduct repeated measures ANOVA
aovrm = AnovaRM(df_long, 'sales', 'day', within=['store'], aggregate_func='mean')
res = aovrm.fit()

print(res.anova_table)
