# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions for ANOVA:

Homogeneity of Variance: Variances are roughly equal across groups.

Violation Example: One group has significantly larger variance in test scores than others.
Independence of Observations: Data points are not dependent on each other.

Violation Example: Repeated measurements of the same individuals are not independent.
Normality of Residuals: Residuals (differences between observed and predicted values) follow a normal distribution.

Violation Example: Residuals show a skewed or non-normal distribution.
Violations of these assumptions can lead to inaccurate ANOVA results.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

Three types of ANOVA:

One-Way ANOVA:

Used when you have one categorical independent variable (factor) with three or more levels (groups).
Determines if there are significant differences in means between the groups.
Two-Way ANOVA:

Used when you have two categorical independent variables (factors) and want to assess their individual and interactive effects on a dependent variable.
Determines if there are significant main effects and interactions between the factors.
Three-Way ANOVA:

Used when you have three categorical independent variables (factors) and want to assess their individual and interactive effects on a dependent variable.
Similar to two-way ANOVA but with an additional factor.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA involves breaking down the total variance in a dataset into different components to understand the sources of variability in your data. It's important to understand this concept because it helps you:

Identify Sources of Variation: It allows you to see how much of the variability in your data is due to differences between groups (treatments) and how much is due to random variability (within groups).

Assess Group Differences: By quantifying the variation between groups, you can determine if the differences you observe are statistically significant or if they could have occurred by chance.

Interpret Results: Understanding the partitioning of variance helps you interpret ANOVA results and draw meaningful conclusions about the effects of different factors or treatments on the dependent variable.

Guide Further Analysis: It informs you whether additional analyses, like post-hoc tests, are needed to compare specific group means when ANOVA indicates significant differences.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In a one-way ANOVA, you can calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) using Python as follows:

Assuming you have a dataset with one independent variable (factor) and one dependent variable, and you've already loaded your data into a variable called data, and your groups are labeled group1, group2, etc. You can use the scipy library for this calculation:

In [3]:
import numpy as np
import scipy.stats as stats

# Assume you have a dataset like this
data = {
    'group1': [12, 13, 14, 15, 16],
    'group2': [22, 23, 24, 25, 26],
    'group3': [32, 33, 34, 35, 36]
}

# Calculate the overall mean
overall_mean = np.mean([value for group in data.values() for value in group])

# Calculate SST
sst = np.sum((overall_mean - [value for group in data.values() for value in group])**2)

# Calculate group means
group_means = {group: np.mean(values) for group, values in data.items()}

# Calculate SSE
sse = np.sum([len(values) * (mean - overall_mean)**2 for group, values, mean in zip(data.keys(), data.values(), group_means.values())])

# Calculate SSR
ssr = sst - sse

print(f'SST: {sst}')
print(f'SSE: {sse}')
print(f'SSR: {ssr}')

SST: 1030.0
SSE: 1000.0
SSR: 30.0


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [10]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assume you have a dataset like this
data = pd.DataFrame({
    'A': np.repeat(['A1', 'A2'], 15),
    'B': np.tile(np.repeat(['B1', 'B2', 'B3'], 5), 2),
    'value': np.random.random(30)
})

# Fit the model
model = ols('value ~ C(A) + C(B) + C(A):C(B)', data).fit()

# Perform ANOVA and print the table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

             sum_sq    df         F    PR(>F)
C(A)       0.002424   1.0  0.046896  0.830385
C(B)       0.222947   2.0  2.157028  0.137566
C(A):C(B)  0.429144   2.0  4.151995  0.028279
Residual   1.240301  24.0       NaN       NaN


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, the F-statistic is used to test whether there are significant differences in means between the groups. In your scenario:

F-statistic: 5.23
p-value: 0.02
Here's how to interpret these results:

F-Statistic: The F-statistic measures the ratio of variation between the group means to the variation within the groups. A larger F-statistic indicates larger differences between group means compared to within-group variation.

p-Value: The p-value is the probability of observing an F-statistic as extreme as the one calculated from your data, assuming that there are no real differences between the groups (i.e., the null hypothesis is true). A smaller p-value suggests stronger evidence against the null hypothesis.

Interpretation:

Since the p-value (0.02) is less than the conventional significance level (e.g., 0.05), you would typically reject the null hypothesis. This means that there is evidence to suggest that there are significant differences in means between the groups.

In practical terms, the differences between the groups are statistically significant. However, you would also want to consider the effect size and the context of your study to determine if these differences are practically significant and meaningful.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In a repeated measures ANOVA, handling missing data is an important consideration, as missing data can potentially bias your results and reduce the statistical power of your analysis. There are several methods to handle missing data, each with its own potential consequences:

Listwise Deletion (Complete Case Analysis):

Incomplete cases (participants with missing data on any variable) are removed from the analysis.
Pros:
Simple and easy to implement.
Cons:
Reduces the sample size and statistical power.
May introduce bias if missing data are not completely random (i.e., missing data are related to the outcome or other variables).
Pairwise Deletion (Available Case Analysis):

Analysis is conducted using all available data for each pair of variables.
Pros:
Uses all available data and avoids sample size reduction.
Cons:
Different subsets of data are used for different comparisons, potentially leading to biased results.
Standard errors may be incorrect, making hypothesis tests and confidence intervals unreliable.
Imputation:

Missing values are replaced with estimated values using imputation methods (e.g., mean imputation, regression imputation, multiple imputation).
Pros:
Retains all cases and maximizes sample size and statistical power.
Reduces potential bias.
Cons:
The choice of imputation method can impact results.
Imputation assumes that the missing data are missing at random (MAR), which may not always hold true.
Mixed Models (Longitudinal Analysis):

Utilizes all available data points while accounting for within-subject correlations.
Pros:
Retains all data points and properly accounts for the repeated measures structure.
Cons:
Requires more advanced statistical techniques and software.
Assumption of missing at random (MAR) should still be met.

The consequences of using different methods to handle missing data include potential bias, changes in the significance of results, and differences in effect size estimates. It's important to carefully consider the nature of your data and the reasons for missingness when choosing a method. Imputation methods are often preferred when missing data are not completely at random, but the choice of imputation method should be based on the specific characteristics of your dataset.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after conducting an analysis of variance (ANOVA) to make pairwise comparisons between groups when the ANOVA indicates that there are significant differences among group means. Common post-hoc tests include:

Tukey's Honestly Significant Difference (Tukey HSD):

When to use: Tukey's HSD is a conservative test that controls the familywise error rate. It is suitable when you have conducted an ANOVA with three or more groups and want to compare all pairs of groups to determine which ones are significantly different from each other.
Example: In a study comparing the effectiveness of four different treatments for a medical condition, you want to determine which pairs of treatments result in significantly different outcomes.
Bonferroni Correction:

When to use: The Bonferroni correction is a more conservative approach to control the familywise error rate by adjusting the significance level for each pairwise comparison. It is suitable when you want to control for multiple comparisons, especially when you have a large number of groups.
Example: You are conducting multiple pairwise comparisons between the means of 10 different groups, and you want to ensure that the overall Type I error rate is controlled.
Duncan's Multiple Range Test:

When to use: Duncan's test is used when you have conducted an ANOVA and you want to perform pairwise comparisons to determine which groups are significantly different from each other. It is less conservative than Tukey's HSD.
Example: In an agricultural study, you are comparing the yields of different varieties of a crop, and you want to identify which varieties have significantly different yields.
Sidak Correction:

When to use: The Sidak correction is another method for adjusting the significance level to control the familywise error rate. It is suitable for pairwise comparisons when you want to control for multiple comparisons.
Example: You are comparing the performance of different advertising strategies across various markets, and you need to make pairwise comparisons while controlling the overall Type I error rate.
Fisher's Least Significant Difference (LSD):

When to use: Fisher's LSD is a less conservative post-hoc test that is suitable when you have conducted an ANOVA with three or more groups and you want to compare individual pairs of groups.
Example: In a psychology experiment, you have measured the reaction times of participants in different conditions, and you want to determine which pairs of conditions resulted in significantly different reaction times.
Example Situation:
Let's say you conducted an ANOVA to compare the performance of three different teaching methods on student test scores. The ANOVA results indicate that there is a statistically significant difference among the means of the three teaching methods. In this case, you would use a post-hoc test, such as Tukey's HSD or Bonferroni correction, to perform pairwise comparisons between the teaching methods to identify which specific methods lead to significantly different test scores. This would help you gain insights into which teaching methods are more effective than others.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [11]:
import numpy as np
import scipy.stats as stats

# Sample data for weight loss in each diet group
diet_A = np.array([2.1, 1.8, 2.5, 2.3, 1.9, 2.0, 1.6, 1.7, 2.2, 1.8,
                   2.0, 2.1, 1.9, 1.7, 2.3, 2.4, 2.0, 1.8, 2.2, 2.1,
                   1.7, 2.5, 2.4, 2.2, 2.0, 2.1, 1.9, 1.8, 2.3, 2.4,
                   2.1, 2.2, 2.0, 1.9, 1.8, 2.4, 2.3, 2.1, 2.2, 1.7,
                   1.8, 2.0, 2.1, 1.9, 2.4, 2.2, 1.8, 2.3, 2.5, 2.1])

diet_B = np.array([1.5, 1.2, 1.9, 1.6, 1.3, 1.4, 1.7, 1.8, 1.5, 1.6,
                   1.4, 1.2, 1.8, 1.6, 1.3, 1.9, 1.5, 1.7, 1.6, 1.4,
                   1.8, 1.9, 1.5, 1.4, 1.2, 1.7, 1.6, 1.8, 1.9, 1.3,
                   1.5, 1.7, 1.8, 1.2, 1.4, 1.6, 1.3, 1.9, 1.7, 1.5,
                   1.8, 1.6, 1.4, 1.2, 1.9, 1.7, 1.5, 1.3, 1.8, 1.6])

diet_C = np.array([2.9, 2.7, 2.5, 2.8, 3.0, 2.6, 2.7, 2.9, 3.1, 2.5,
                   2.6, 2.8, 3.0, 2.7, 2.6, 2.8, 2.9, 2.7, 3.1, 2.5,
                   2.8, 3.0, 2.7, 2.6, 2.9, 2.5, 3.2, 2.7, 2.8, 2.6,
                   2.9, 3.0, 2.8, 2.7, 2.6, 3.1, 2.5, 2.9, 2.8, 2.7,
                   2.6, 2.7, 3.0, 2.8, 2.9, 3.1, 2.5, 2.6, 2.7, 2.8])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There are significant differences in mean weight loss between the diets.")
else:
    print("There are no significant differences in mean weight loss between the diets.")

F-Statistic: 385.8565543496798
p-value: 3.1922763908583244e-59
There are significant differences in mean weight loss between the diets.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [12]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with software programs, experience level, and task completion times
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [12.3, 13.1, 11.9, 13.5, 12.8, 13.7, 11.6, 13.2, 12.5, 13.0,
             14.2, 14.0, 13.4, 14.1, 14.3, 9.8, 10.2, 9.5, 10.0, 10.4,
             11.8, 11.5, 12.0, 11.9, 11.6, 12.2, 11.4, 12.1, 11.7, 12.5]
})

# Fit a two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Output the results
print(anova_table)

                              sum_sq    df          F    PR(>F)
C(Software)                 1.608667   2.0   0.903069  0.418657
C(Experience)              28.033333   1.0  31.474551  0.000009
C(Software):C(Experience)   0.240667   2.0   0.135105  0.874284
Residual                   21.376000  24.0        NaN       NaN


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [13]:
import numpy as np
import scipy.stats as stats

# Generate example test scores for the control and experimental groups
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(75, 5, 50)  # Control group with a mean of 75 and standard deviation of 5
experimental_group = np.random.normal(80, 5, 50)  # Experimental group with a mean of 80 and standard deviation of 5

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Output the results
print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
    # You can proceed with post-hoc tests here if needed.
else:
    print("There is no significant difference in test scores between the groups.")

Two-Sample T-Test Results:
t-statistic: -6.872731683285833
p-value: 5.877565294167974e-10
There is a significant difference in test scores between the control and experimental groups.


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [17]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assume you have a dataset like this
data = pd.DataFrame({
    'A': np.repeat(['A1', 'A2'], 15),
    'B': np.tile(np.repeat(['B1', 'B2', 'B3'], 5), 2),
    'value': np.random.random(30)
})

# Fit the model
model = ols('value ~ C(A) + C(B) + C(A):C(B)', data).fit()

# Perform ANOVA and print the table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

             sum_sq    df         F    PR(>F)
C(A)       0.003603   1.0  0.039687  0.843771
C(B)       0.235059   2.0  1.294489  0.292493
C(A):C(B)  0.040964   2.0  0.225589  0.799719
Residual   2.179015  24.0       NaN       NaN
