## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.


The assumptions required to use ANOVA are:

- Independence: The observations in each group are independent of each other. This means that the outcome of one observation does not affect the outcome of another observation in the same group.

- Normality: The data within each group follows a normal distribution. Normality can be checked using a normal probability plot or a histogram.

- Homogeneity of variance: The variance of the data is the same across all groups. Homogeneity of variance can be checked using a Levene's test or by comparing the variances of the groups.

If these assumptions are not met, the results of the ANOVA may not be valid. Examples of violations that could impact the validity of the results include:

- Non-normality: If the data within each group does not follow a normal distribution, the ANOVA results may be biased. This can be caused by outliers, skewed distributions, or heavy tails.

- Non-independence: If the observations within a group are not independent, the ANOVA results may be biased. This can occur when the same individual is measured multiple times or when there is clustering within groups.

- Heterogeneity of variance: If the variances of the groups are not equal, the ANOVA results may be biased. This can occur when the groups have different sample sizes or when the data in one group has a larger spread than the data in another group.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

 - One-way ANOVA: used to compare means of three or more groups on a single independent variable or factor.
 
 - Two-way ANOVA: used to analyze the effect of two independent variables or factors on a dependent variable.
 
 - Repeated measures ANOVA: used to compare means of three or more groups on a dependent variable measured at multiple time points or conditions.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the decomposition of the total variation in the data into different sources of variation. This is important to understand because it helps to identify the sources of variation that contribute to differences in means between groups, and to determine if the observed differences are statistically significant.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

To calculate the total sum of squares (SST) in a one-way ANOVA, we sum the squared differences between each observation and the grand mean. To calculate the explained sum of squares (SSE), we sum the squared differences between each group mean and the grand mean, weighted by the number of observations in each group. To calculate the residual sum of squares (SSR), we sum the squared differences between each observation and its corresponding group mean. In Python, we can use the f_oneway() function from the scipy.stats module to perform a one-way ANOVA and obtain the values for SST, SSE, and SSR.

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

 In a two-way ANOVA, the main effects for each independent variable can be calculated by comparing the means of each level of the variable, averaged over the levels of the other variable. The interaction effect is calculated by examining the difference in means between the levels of one variable, across different levels of the other variable. In Python, we can use the statsmodels library to perform a two-way ANOVA and obtain the values for main effects and interaction effects.

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret these results?

 An F-statistic of 5.23 and a p-value of 0.02 indicate that there are significant differences between the groups. Specifically, the observed differences between the group means are unlikely to have occurred by chance, assuming that the null hypothesis is true. We would interpret these results as evidence that the groups are not equal with respect to the dependent variable.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In a repeated measures ANOVA, missing data can be handled using different methods such as pairwise deletion, listwise deletion, or imputation. However, each method can introduce biases and affect the validity of the results. Pairwise deletion and listwise deletion can reduce the sample size and lead to loss of power. Imputation can introduce biases if the missing data is not missing completely at random. It is important to carefully consider the nature and amount of missing data, and to choose a method that is appropriate for the study design and research question.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Some common post-hoc tests used after ANOVA include Tukey's HSD, Bonferroni, and Scheffe's tests. Tukey's HSD test is used to determine which pairs of groups differ significantly from each other. Bonferroni and Scheffe's tests are more conservative and are used when multiple comparisons are made. Post-hoc tests are necessary when the ANOVA indicates that there are significant differences between the groups, but do not identify which specific groups differ significantly from each other.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

To conduct a one-way ANOVA in Python, we can use the f_oneway() function from the scipy.stats module. Running this function with the weight loss data would give us the F-statistic and p-value to test for significant differences between the mean weight loss of the three diets.

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = pd.DataFrame({'Time': [6.1, 4.8, 5.6, 5.9, 6.2, 5.5, 6.3, 5.8, 6.0, 5.3, 5.7, 6.0, 6.1, 5.4, 6.2, 5.9, 6.4, 5.6, 6.1, 5.5, 6.0, 5.3, 6.3, 5.8, 6.1, 5.4, 6.2, 5.9, 6.4, 5.6],
                     'Program': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
                     'Experience': ['Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced']})

# Fit the ANOVA model
model = ols('Time ~ Program + Experience + Program*Experience', data=data).fit()
aov_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(aov_table)


                      sum_sq    df         F    PR(>F)
Program             0.140667   2.0  0.444796  0.646133
Experience          0.256889   1.0  1.624594  0.214655
Program:Experience  0.082111   2.0  0.259640  0.773468
Residual            3.795000  24.0       NaN       NaN


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any  Significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import numpy as np
from scipy.stats import ttest_ind

# Create two sample datasets
control_scores = np.array([78, 82, 80, 85, 76, 79, 83, 77, 81, 79])
experimental_scores = np.array([85, 89, 82, 91, 88, 84, 87, 86, 83, 90])

# Conduct the t-test
t, p = ttest_ind(control_scores, experimental_scores)

# Print the results
print('t-statistic =', t)
print('p-value =', p)


t-statistic = -4.993438317382943
p-value = 9.416948323170621e-05


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [17]:
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create a sample data frame
data = pd.DataFrame({
    'Day': np.repeat(range(1, 31), 3),
    'Store': np.tile(['A', 'B', 'C'], 30),
    'Sales': [10, 12, 9, 11, 14, 13, 8, 10, 12, 13, 15, 16, 7, 9, 11, 8, 11, 12, 10, 12, 13, 9, 10, 11, 12, 14, 15, 10, 11, 12,
              12, 14, 16, 9, 11, 10, 10, 13, 12, 8, 9, 10, 11, 14, 13, 11, 12, 13, 14, 17, 15, 10, 11, 9, 12, 13, 14, 9, 8, 11, 10,
              12, 11, 8, 9, 10, 11, 13, 13, 7, 8, 9, 11, 12, 11, 10, 11, 11, 12, 15, 14, 8, 10, 9, 11, 12, 13, 8, 9, 11, 11, 13, 12]
})

# perform repeated measures ANOVA
rm = AnovaRM(data, 'Sales', 'Store', within=['Day'])
res = rm.fit()

# print ANOVA table
print(res.anova_table)

# perform post-hoc test
posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])
print(posthoc.summary())


ValueError: All arrays must be of the same length