Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact 
the validity of the results.
Ans.  **Assumptions of ANOVA:**

1. **Independence:** Observations within each group must be independent of one another. This means that the value of one observation should not influence the value of another observation in the same group.

2. **Homogeneity of Variances:** The variances of the groups being compared must be equal. This assumption is also known as homoscedasticity.

3. **Normal Distribution:** The data in each group should be normally distributed. This assumption can be relaxed somewhat if the sample sizes are large enough.

4. **Additivity:** The effects of the different factors being studied should be additive. This means that the total effect of the factors is equal to the sum of the individual effects of each factor.

**Examples of Violations that could Impact the Validity of the Results:**

1. **Lack of Independence:** If the observations within each group are not independent, then the results of the ANOVA may be biased. For example, if participants in a study are asked to rate the attractiveness of a series of faces, and the faces are presented in a random order, then the observations can be considered independent. However, if the faces are presented in a block design, with all the attractive faces presented together and all the unattractive faces presented together, then the observations are not independent. This is because the attractiveness of one face may influence the attractiveness of the next face that is seen.

2. **Heterogeneity of Variances:** If the variances of the groups being compared are not equal, then the results of the ANOVA may be biased. For example, if one group of participants is more variable in their responses than another group, then the group with the greater variance is more likely to have a significant effect in the ANOVA.

3. **Non-Normal Distribution:** If the data in each group is not normally distributed, then the results of the ANOVA may be biased. This is because the F-test statistic, which is used to test the significance of the ANOVA results, assumes that the data are normally distributed.

4. **Non-Additivity:** If the effects of the different factors being studied are not additive, then the results of the ANOVA may be biased. For example, if there is an interaction between two factors, then the total effect of the two factors is not equal to the sum of the individual effects of each factor.

Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans. **Three types of ANOVA:**

1. **One-way ANOVA:** This type of ANOVA is used to compare the means of two or more groups on a single dependent variable. For example, a researcher might use a one-way ANOVA to compare the mean scores of three different groups of students on a math test.

2. **Two-way ANOVA:** This type of ANOVA is used to compare the means of two or more groups on two different dependent variables. For example, a researcher might use a two-way ANOVA to compare the mean scores of two different groups of students on a math test and a reading test.

3. **Three-way ANOVA:** This type of ANOVA is used to compare the means of two or more groups on three different dependent variables. For example, a researcher might use a three-way ANOVA to compare the mean scores of three different groups of students on a math test, a reading test, and a science test.

**Situations in which each type of ANOVA would be used:**

1. **One-way ANOVA:** One-way ANOVA is used when a researcher is interested in comparing the means of two or more groups on a single dependent variable. For example, a researcher might use a one-way ANOVA to compare the mean scores of three different groups of students on a math test to determine if there is a significant difference in their math abilities.

2. **Two-way ANOVA:** Two-way ANOVA is used when a researcher is interested in comparing the means of two or more groups on two different dependent variables. For example, a researcher might use a two-way ANOVA to compare the mean scores of two different groups of students on a math test and a reading test to determine if there is a significant difference in their math and reading abilities.

3. **Three-way ANOVA:** Three-way ANOVA is used when a researcher is interested in comparing the means of two or more groups on three different dependent variables. For example, a researcher might use a three-way ANOVA to compare the mean scores of three different groups of students on a math test, a reading test, and a science test to determine if there is a significant difference in their math, reading, and science abilities


Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans. Partitioning of variance is a statistical technique that is used to determine the amount of variance in a data set that is due to different sources of variation. In ANOVA, the total variance in the data is partitioned into two components:

1. **Within-Groups Variance:** This component represents the variance that is due to individual differences within each group. For example, if we are comparing the mean scores of three different groups of students on a math test, the within-groups variance would represent the variance in scores within each group.

2. **Between-Groups Variance:** This component represents the variance that is due to differences between the groups. For example, if we are comparing the mean scores of three different groups of students on a math test, the between-groups variance would represent the variance in mean scores between the three groups.

**Importance of Understanding the Partitioning of Variance:**

Understanding the partitioning of variance is important for two reasons:

1. **Determining the Significance of the ANOVA Results:** The F-test statistic, which is used to test the significance of the ANOVA results, is calculated by dividing the between-groups variance by the within-groups variance. If the F-test statistic is significant, then this indicates that there is a significant difference between the groups.

2. **Estimating the Effect Size:** The effect size is a measure of the magnitude of the difference between the groups. The effect size can be calculated by dividing the between-groups variance by the total variance. The larger the effect size, the greater the difference between the groups.


Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
sum of squares (SSR) in a one-way ANOVA using Python?
Ans. 

In [1]:
import numpy as np
from scipy import stats

# One-way ANOVA example data
data = [
    [10, 12, 14, 16, 18],
    [11, 13, 15, 17, 19],
    [12, 14, 16, 18, 20],
]

# Calculate the total sum of squares (SST)
sst = np.sum((data - np.mean(data)) ** 2)

# Calculate the explained sum of squares (SSE)
sse = np.sum((np.mean(data, axis=1) - np.mean(data)) ** 2)

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

# Print the results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)

Total Sum of Squares (SST): 130.0
Explained Sum of Squares (SSE): 2.0
Residual Sum of Squares (SSR): 128.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
Ans. 

In [None]:
import numpy as np
from scipy import stats

# Two-way ANOVA example data
data = np.array([
    [[10, 12, 14],
     [11, 13, 15]],
    [[12, 14, 16],
     [13, 15, 17]],
    [[14, 16, 18],
     [15, 17, 19]]
])

# Calculate the main effects and interaction effects
a, b, ab, resid = stats.linregress(np.arange(1, 4), np.arange(1, 7), np.ravel(data))

# Print the results
print("Main effect of Factor A:", a)
print("Main effect of Factor B:", b)
print("Interaction effect:", ab)

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. 
What can you conclude about the differences between the groups, and how would you interpret these 
results?
Ans.  **Conclusion about the Differences Between the Groups:**

- The obtained F-statistic of 5.23 indicates a statistically significant difference between at least two of the group means. 

- The p-value of 0.02 is less than the commonly used significance level of 0.05. This means that there is only a 2% chance that the observed differences between the group means could have occurred by chance alone.

**Interpretation of the Results:**

- The results provide strong evidence that at least one of the group means differs from the others. However, the analysis does not indicate which specific groups are different from each other.

- To determine which specific groups differ, pairwise comparisons (such as post-hoc tests) would need to be conducted.

- The effect size (e.g., partial eta squared) could also be calculated to provide an estimate of the magnitude of the differences between the group means.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential 
consequences of using different methods to handle missing data?
Ans. To handle missing data in a repeated measure of annova we use following steps:
1. **Listwise Deletion**: 
- This method involves excluding any case that has missing data on any of the variables included in the analysis.
- **Consequences**: 
    - Reduced sample size, which can lead to reduced statistical power.
    - Potential bias if the missing data is not missing at random (MNAR). For example, if participants who drop out of a study are more likely to have lower scores on the outcome variable, then listwise deletion would result in a biased estimate of the population mean.

2. **Pairwise Deletion**: 
- This method involves calculating the mean or sum of the available data for each pair of variables.
- **Consequences**:
    - Can lead to biased estimates if the missing data is not missing at random (MNAR). 
    - Can result in different sample sizes for different pairs of variables, which can make it difficult to compare results across variables.

3. **Multiple Imputation**:
- This method involves imputing missing values multiple times (e.g., 5 or 10 times) using a statistical method such as regression imputation or predictive mean matching.
- **Consequences**: 
    - Can produce more accurate estimates than listwise or pairwise deletion, especially if the missing data is MNAR. 
    - Can also provide information about the uncertainty of the imputed values.

4. **Maximum Likelihood Estimation (MLE)**: 
- This method uses a statistical model to estimate the missing values. 
- **Consequences**: 
    - Can produce unbiased estimates even if the missing data is MNAR, as long as the model is correctly specified. 
    - However, MLE can be sensitive to outliers and may not be appropriate if the missing data is heavily skewed or contains a large number of outliers.

5. **Full Information Maximum Likelihood (FIML)**: 
- This method uses all of the available data, including the missing data, to estimate the model parameters. 
- **Consequences**: 
    - Can produce more accurate estimates than other methods, especially if the missing data is MNAR. 
    - However, FIML can be computationally intensive and may not be feasible for large datasets.

In general, multiple imputation and FIML are preferred methods for handling missing data because they can produce unbiased estimates even if the missing data is MNAR. However, these methods can be more computationally intensive than other methods, and they may not be appropriate for all situations.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
an example of a situation where a post-hoc test might be necessary.
Ans. 1. **Tukey's HSD (Honestly Significant Difference) Test**:

   - **When to Use**: When you have unequal sample sizes and you want to compare all possible pairs of means.
   - **Example**: A researcher wants to compare the average test scores of four different teaching methods. They conduct an ANOVA and find a significant difference between the means. To determine which teaching methods are significantly different from each other, they use Tukey's HSD test.

2. **Scheffé's Test**:

   - **When to Use**: When you have equal sample sizes and you want to compare all possible pairs of means.
   - **Example**: A researcher wants to compare the average weight loss of two different diet plans. They conduct an ANOVA and find a significant difference between the means. To determine which diet plan is significantly more effective, they use Scheffé's test.

3. **Bonferroni Correction**:

   - **When to Use**: When you have multiple comparisons and you want to control the overall Type I error rate.
   - **Example**: A researcher wants to compare the effectiveness of four different drugs in reducing blood pressure. They conduct an ANOVA and find a significant difference between the means. To determine which drugs are significantly more effective than the others, they use the Bonferroni correction.

4. **Dunnett's Test**:

   - **When to Use**: When you have a control group and you want to compare each treatment group to the control group.
   - **Example**: A researcher wants to compare the effectiveness of three different treatments for depression. They conduct an ANOVA and find a significant difference between the means. To determine which treatments are significantly more effective than the control group, they use Dunnett's test.

5. **Newman-Keuls Test**:

   - **When to Use**: When you have unequal sample sizes and you want to compare all possible pairs of means, but you don't have the assumptions necessary for Tukey's HSD test.
   - **Example**: A researcher wants to compare the average test scores of four different teaching methods. They conduct an ANOVA and find a significant difference between the means. However, the sample sizes for each teaching method are unequal. To determine which teaching methods are significantly different from each other, they use the Newman-Keuls test.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python 
to determine if there are any significant differences between the mean weight loss of the three diets. 
Report the F-statistic and p-value, and interpret the results.

In [4]:
import pandas as pd
import numpy as np
from scipy import stats

# Load the data
data = pd.read_csv('diets.csv')

# Create a one-way ANOVA model
model = stats.f_oneway(data['Weight Loss Diet A'], data['Weight Loss Diet B'], data['Weight Loss Diet C'])

# Print the F-statistic and p-value
print('F-statistic:', model.statistic)
print('p-value:', model.pvalue)

# Interpret the results
if model.pvalue < 0.05:
    print('There is a significant difference between the mean weight loss of the three diets.')
else:
    print('There is no significant difference between the mean weight loss of the three diets.')

KeyError: 'Weight Loss Diet A'

Q10. A company wants to know if there are any significant differences in the average time it takes to 
complete a task using three different software programs: Program A, Program B, and Program C. They 
randomly assign 30 employees to one of the programs and record the time it takes each employee to 
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or 
interaction effects between the software programs and employee experience level (novice vs. 
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
# Import the necessary libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data
data = pd.read_csv('task_completion_times.csv')

# Create a two-way ANOVA model
model = ols('time ~ software_program * experience_level', data=data).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(model, typ=2))

# Interpret the results
# Main effect of software program
print("Main effect of software program:")
print("F-statistic:", model.f_pvalue[0])
print("p-value:", model.pvalues[0])
if model.pvalues[0] < 0.05:
    print("There is a significant difference in the average time it takes to complete the task using the three software programs.")
else:
    print("There is no significant difference in the average time it takes to complete the task using the three software programs.")

# Main effect of experience level
print("\nMain effect of experience level:")
print("F-statistic:", model.f_pvalue[1])
print("p-value:", model.pvalues[1])
if model.pvalues[1] < 0.05:
    print("There is a significant difference in the average time it takes to complete the task between novice and experienced employees.")
else:
    print("There is no significant difference in the average time it takes to complete the task between novice and experienced employees.")

# Interaction effect between software program and experience level
print("\nInteraction effect between software program and experience level:")
print("F-statistic:", model.f_pvalue[2])
print("p-value:", model.pvalues[2])
if model.pvalues[2] < 0.05:
    print("There is a significant interaction effect between the software program and experience level.")
else:
    print("There is no significant interaction effect between the software program and experience level.")

Q11. An educational researcher is interested in whether a new teaching method improves student test 
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the 
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
two-sample t-test using Python to determine if there are any significant differences in test scores 
between the two groups. If the results are significant, follow up with a post-hoc test to determine which 
group(s) differ significantly from each other.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from scipy import stats

# Load the data
data = pd.read_csv('test_scores.csv')

# Check for normality and equal variances
print(stats.normaltest(data.control))
print(stats.normaltest(data.experimental))
print(stats.levene(data.control, data.experimental))

# Conduct a two-sample t-test
t_test = stats.ttest_ind(data.control, data.experimental)

# Print the results of the t-test
print("t-value:", t_test.statistic)
print("p-value:", t_test.pvalue)

# If the results are significant, conduct a post-hoc test
if t_test.pvalue < 0.05:
    # Conduct a post-hoc test, e.g., Tukey's HSD test
    tukey_hsd = stats.tukey_hsd(data.control, data.experimental)
    
    # Print the results of the post-hoc test
    print("Tukey's HSD test:")
    print(tukey_hsd.pvalue)

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three 
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any 
significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import numpy as np
from statsmodels.stats.multicomp import MultiComparison

# Load the data
data = pd.read_csv('sales_data.csv')

# Create a repeated measures ANOVA model
model = pd.stats.api.mixed_anova(data, repeated_measures=['store'])

# Print the ANOVA table
print(model)

# Check if the results are significant
if model.pvalue < 0.05:
    # Conduct a post-hoc test to determine which store(s) differ significantly from each other
    mc = MultiComparison(data['sales'], data['store'])
    res = mc.tukeyhsd()

    # Print the results of the post-hoc test
    print(res.summary())