In [None]:
Q1. Assumptions required to use ANOVA:
Independence: The observations within each group are assumed to be independent of each other.
Normality: The dependent variable follows a normal distribution within each group.
Homogeneity of variances: The variability of the dependent variable is equal across all groups.
Violations that could impact the validity of the results:
Violation of independence: If observations within groups are not independent, such as in a repeated measures design where the same subjects are measured multiple times, it violates the assumption of independence.
Violation of normality: If the dependent variable does not follow a normal distribution within each group, the ANOVA results may be unreliable. Non-normality can lead to biased estimates and incorrect p-values.
Violation of homogeneity of variances: If the variability of the dependent variable is not equal across groups, the assumption of homogeneity of variances is violated. This can affect the validity of the F-test in ANOVA and may lead to incorrect conclusions.

Q2. The three types of ANOVA and their situations of use:
One-Way ANOVA: It is used when comparing the means of three or more independent groups or levels of a single factor. For example, comparing the effectiveness of three different treatments on a disease outcome.
Two-Way ANOVA: It is used when there are two independent variables or factors. It examines the main effects of each factor and the interaction effect between them. For example, studying the effects of both drug dosage and gender on a health outcome.
Repeated Measures ANOVA: It is used when the same participants are measured under different conditions or at multiple time points. It analyzes within-subject differences across conditions. For example, investigating the effects of different teaching methods on student performance by measuring the same students' scores before and after each method.

Q3. Partitioning of variance in ANOVA and its importance:
The partitioning of variance in ANOVA refers to the division of the total variance of the dependent variable into different components. It is important because it helps understand the sources of variation and their contributions to the overall variance.
In ANOVA, the total variance (SST) is partitioned into two components: the explained variance (SSE) and the residual variance (SSR). The explained variance represents the variation accounted for by the independent variable(s) or factors, while the residual variance represents the unexplained or random variation.
By understanding the partitioning of variance, researchers can assess the proportion of variance explained by the factors of interest and determine if the observed differences are statistically significant or due to random chance



In [None]:
# Q4 ---------->
import numpy as np
from scipy import stats

# Sample data for three groups
group1 = np.array([2, 4, 6, 8, 10])
group2 = np.array([1, 3, 5, 7, 9])
group3 = np.array([0, 2, 4, 6, 8])

# Concatenate data from all groups
data = np.concatenate([group1, group2, group3])

# Group labels
labels = np.array(['Group1'] * len(group1) + ['Group2'] * len(group2) + ['Group3'] * len(group3))

# One-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Degrees of freedom
df_total = len(data) - 1
df_groups = len(np.unique(labels)) - 1
df_residual = df_total - df_groups

# Sum of squares
sst = np.sum((data - np.mean(data))**2)
sse = np.sum((group1 - np.mean(data))**2) + np.sum((group2 - np.mean(data))**2) + np.sum((group3 - np.mean(data))**2)
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)
print("F-statistic:", f_statistic)
print("p-value:", p_value)


In [None]:
# Q5 -------------->
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the data
data = pd.DataFrame({'Software': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'Experience': ['Novice', 'Experienced'] * 3,
                     'Time': [10, 12, 15, 13, 9, 11]})

# Fit the ANOVA model
model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Extract main effects and interaction effects
main_effects = anova_table.loc[['Software', 'Experience'], 'mean_sq']
interaction_effect = anova_table.loc['Software:Experience', 'mean_sq']

print("Main Effects:")
print(main_effects)
print("Interaction Effect:")
print(interaction_effect)


In [None]:
Q6. Interpretation of one-way ANOVA results:
Given an F-statistic of 5.23 and a p-value of 0.02 from a one-way ANOVA, we can conclude that there is evidence to reject the null hypothesis that the means of the groups are equal. The differences between the groups are statistically significant.
This means that at least one group significantly differs from the others in terms of the variable being measured. However, the ANOVA does not tell us which specific groups are different from each other. To determine the specific group differences, post-hoc tests or further analyses are needed.

Q7. Handling missing data in repeated measures ANOVA and potential consequences:
In repeated measures ANOVA, missing data can be handled using different methods depending on the nature and pattern of the missingness:
Complete case analysis: Only subjects with complete data across all conditions are included in the analysis. The missing data are simply ignored. This method may lead to biased results if the missing data are not missing completely at random (MCAR).
Pairwise deletion: Each analysis is conducted using available data for each pairwise comparison. It maximizes the available information but may introduce bias if the missing data are not MCAR.
Imputation: Missing values are replaced with estimated values based on the observed data. Imputation methods such as mean imputation, regression imputation, or multiple imputation can be used. However, imputation can introduce additional uncertainty and potentially affect the results.
The consequences of using different methods to handle missing data include potential bias in the estimates, loss of statistical power, and incorrect p-values. It is crucial to carefully consider the missing data mechanism and choose an appropriate method that aligns with the assumptions of the analysis.

Q8. Common post-hoc tests used after ANOVA and their use cases:
Tukey's Honestly Significant Difference (HSD): It is used to identify pairwise differences between all possible combinations of groups. It controls the family-wise error rate and is suitable when comparing all pairs of groups in a one-way ANOVA.
Bonferroni correction: It is a conservative method that adjusts the significance level for multiple comparisons. It is suitable when conducting multiple pairwise comparisons and controlling the overall Type I error rate.
Sidak correction: Similar to Bonferroni correction, it adjusts the significance level for multiple comparisons but is slightly less conservative. It is suitable when conducting multiple pairwise comparisons while controlling the overall Type I error rate.
Post-hoc tests are necessary when the null hypothesis is rejected in ANOVA, indicating that there are significant differences between groups. These tests help identify which specific group(s) differ significantly from each other

In [None]:
# Q9 ------------>
import scipy.stats as stats

# Weight loss data for three diets: A, B, and C
diet_a = [4, 6, 5, 3, 5]
diet_b = [2, 3, 4, 2, 1]
diet_c = [5, 7, 6, 8, 7]

# One-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


In [None]:
# Q10 ------->
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the data
data = pd.DataFrame({'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                     'Experience': ['Novice'] * 3 + ['Novice'] * 3 + ['Experienced'] * 3,
                     'Time': [15, 20, 18, 22, 19, 20, 17, 16, 14]})

# Fit the ANOVA model
model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model)

print(anova_table)


In [None]:
# Q11 ------->
import scipy.stats as stats

# Test scores for control group and experimental group
control_group = [80, 85, 90, 75, 82]
experimental_group = [90, 95, 92, 88, 85]

# Two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print("t-statistic:", t_statistic)
print("p-value:", p_value)


In [None]:
# Q 12--------->
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM

# Create a dataframe with the data
data = pd.DataFrame({'Day': ['Day1', 'Day2', 'Day3'] * 10,
                     'Store': ['StoreA'] * 10 + ['StoreB'] * 10 + ['StoreC'] * 10,
                     'Sales': [100, 95, 105, 80, 85, 90, 120, 110, 115] * 3})

# Fit the repeated measures ANOVA model
model = AnovaRM(data, 'Sales', 'Day', within=['Store']).fit()

print(model.summary())
