Ques 1:

Ans: ANOVA (Analysis of Variance) is a statistical method used to determine whether there are significant differences between the means of two or more groups. The validity of ANOVA results depends on certain assumptions that must be met:
Normality assumption: The data should follow a normal distribution within each group.
Homogeneity of variance assumption: The variance of the data should be equal across all groups.
Independence assumption: The observations in each group should be independent of each other.
If any of these assumptions are violated, it can impact the validity of ANOVA results.
Examples of violations that could impact the validity of ANOVA results are:
Non-normality: If the data does not follow a normal distribution within each group, the ANOVA results may be biased. For example, if the data is heavily skewed or has outliers, it may violate the normality assumption. In such cases, a non-parametric alternative to ANOVA may be more appropriate.
Heterogeneity of variance: If the variance of the data is not equal across all groups, the ANOVA results may be unreliable. For example, if the variance in one group is much larger than the variance in another group, it may violate the homogeneity of variance assumption. In such cases, a modified version of ANOVA, such as Welch's ANOVA, may be more appropriate.
Lack of independence: If the observations in each group are not independent, the ANOVA results may be invalid. For example, if the same individual is included in multiple groups, or if there is a clustering effect in the data, it may violate the independence assumption. In such cases, a mixed-effects model or a repeated measures ANOVA may be more appropriate.
It is important to check for violations of these assumptions before interpreting ANOVA results, as failure to meet these assumptions may lead to erroneous conclusions.

Ques 2:

Ans : The three types of ANOVA are:

One-way ANOVA: This type of ANOVA is used when there is only one independent variable and one dependent variable. It is used to test whether there are significant differences between the means of three or more groups. For example, a one-way ANOVA could be used to test whether there is a significant difference in the average income of people living in different regions of a country.

Two-way ANOVA: This type of ANOVA is used when there are two independent variables and one dependent variable. It is used to test whether there are significant differences between the means of groups formed by the combination of two independent variables. For example, a two-way ANOVA could be used to test whether there is a significant difference in the average weight of plants grown in different soil types and with different levels of fertilizer.

Mixed ANOVA: This type of ANOVA is used when there are two or more independent variables, but at least one of them is a within-subjects variable (repeated measures). A within-subjects variable is one where the same subject is measured multiple times under different conditions. A mixed ANOVA can help determine whether there are significant differences between the means of groups formed by the combination of two or more independent variables, while accounting for the potential influence of repeated measures. For example, a mixed ANOVA could be used to test whether there is a significant difference in the average reaction time of participants in a study who are exposed to different types of stimuli (independent variable 1) and who are tested under different levels of fatigue (independent variable 2).

Ques 3:

Ans: Partitioning of variance is the process of dividing the total variance in a dataset into its different components in ANOVA. ANOVA uses partitioning of variance to determine whether the differences between group means are statistically significant or not.

The total variance in the dataset is split into two components: the variance between groups (also known as the "explained variance") and the variance within groups (also known as the "unexplained variance"). The variance between groups measures the differences between group means, while the variance within groups measures the variation within each group.

The ratio of the between-groups variance to the within-groups variance is used to calculate the F-statistic, which is used to determine whether there is a significant difference between group means.

It is important to understand the concept of partitioning of variance in ANOVA because it provides a framework for analyzing the variation in a dataset and determining the sources of variation that contribute to the differences between groups. This information can be used to identify factors that are driving the differences between groups and to develop interventions or treatments that target these factors. In addition, understanding the partitioning of variance helps in interpreting the results of ANOVA and determining whether the observed differences between group means are statistically significant or due to chance.

Ques 4:

Ans: 

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample dataset
data = {'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
        'Value': [3, 4, 6, 5, 7, 8]}
df = pd.DataFrame(data)

# fit the one-way ANOVA model
model = ols('Value ~ Group', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# extract the sum of squares
SST = anova_table['sum_sq']['Group'] + anova_table['sum_sq']['Residual']
SSE = anova_table['sum_sq']['Group']
SSR = anova_table['sum_sq']['Residual']

print('Total sum of squares (SST):', SST)
print('Explained sum of squares (SSE):', SSE)
print('Residual sum of squares (SSR):', SSR)

Total sum of squares (SST): 17.499999999999993
Explained sum of squares (SSE): 15.999999999999995
Residual sum of squares (SSR): 1.5


Ques 5:

Ans: 

In [3]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample dataset
data = {'Group1': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C'],
        'Group2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
        'Value': [3, 4, 6, 5, 7, 8, 4, 5, 6, 7, 5, 6]}
df = pd.DataFrame(data)

# fit the two-way ANOVA model
model = ols('Value ~ Group1 + Group2 + Group1:Group2', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# extract the main effects and interaction effects
main_effect_group1 = anova_table['sum_sq']['Group1'] / anova_table['df']['Group1']
main_effect_group2 = anova_table['sum_sq']['Group2'] / anova_table['df']['Group2']
interaction_effect = anova_table['sum_sq']['Group1:Group2'] / anova_table['df']['Group1:Group2']

print('Main effect of Group1:', main_effect_group1)
print('Main effect of Group2:', main_effect_group2)
print('Interaction effect:', interaction_effect)

Main effect of Group1: 6.999999999999996
Main effect of Group2: 1.3333333333333293
Interaction effect: 0.3333333333333315


Ques 6:

Ans: In a one-way ANOVA, the F-statistic tests the null hypothesis that all the group means are equal against the alternative hypothesis that at least one group mean is different from the others. The p-value associated with the F-statistic indicates the probability of obtaining such an extreme result if the null hypothesis were true.

In this case, the obtained F-statistic of 5.23 and a p-value of 0.02 suggests that there is a statistically significant difference between the group means. We can reject the null hypothesis and conclude that at least one of the group means is different from the others.

The interpretation of the results may depend on the specific context and research question. However, a significant difference between the group means may suggest that the grouping variable has an effect on the outcome variable. Further analysis such as post-hoc tests or pairwise comparisons may be necessary to determine which groups differ significantly from each other.

It's also important to note that the effect size should be considered in addition to the statistical significance of the results. A small effect size may indicate that the difference between the group means, while statistically significant, is not practically significant or meaningful.

Ques 7:

Ans: In a repeated measures ANOVA, missing data can arise due to various reasons, such as participant dropouts or technical issues during data collection. Handling missing data appropriately is essential because it can affect the validity and reliability of the statistical analysis results.

There are several methods to handle missing data in a repeated measures ANOVA, including:

Listwise deletion: This method involves removing any participant with missing data on any of the measures, resulting in a reduced sample size. While this method is simple to apply, it can result in biased estimates and reduced statistical power, especially when the missing data are not missing completely at random (MCAR).

Pairwise deletion: This method involves using only the available data for each comparison. While it maximizes the sample size and statistical power, it can result in biased estimates and inconsistent results due to the different sample sizes across comparisons.

Imputation: This method involves estimating the missing data using statistical techniques such as mean imputation, regression imputation, or multiple imputation. While imputation can provide unbiased estimates and increase statistical power, it assumes that the missing data are missing at random (MAR), and the results can be sensitive to the imputation model assumptions.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA can include biased estimates, increased type I and type II errors, reduced statistical power, and inconsistency across analyses. Therefore, it is essential to assess the missing data pattern and choose the appropriate method based on the underlying assumptions and the research question. Additionally, sensitivity analyses should be conducted to evaluate the robustness of the results to different missing data handling methods.

Ques 8:

Ans: Post-hoc tests are used after ANOVA to determine which specific groups differ significantly from each other. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD) test: This test is used to determine the pairwise differences between all groups. It is appropriate when there are no prior hypotheses about which specific groups differ.

Bonferroni correction: This test adjusts the p-values to control for multiple comparisons. It is appropriate when there are many pairwise comparisons.

Scheffe's test: This test is used to determine the overall differences between groups, rather than pairwise differences. It is appropriate when there are specific hypotheses about which groups differ.

Games-Howell test: This test is used when the assumption of equal variances is violated. It is appropriate when there are many pairwise comparisons, and the groups have unequal variances.

Duncan's multiple range test: This test is used when there is a specific control group and the interest is in comparing other groups to the control group.

Student-Newman-Keuls test (SNK): This test is used when the assumption of equal variances is met. It is appropriate when there are many pairwise comparisons and the groups have equal variances.

An example of a situation where a post-hoc test might be necessary is when a researcher wants to compare the mean scores of three different treatments (e.g., drug A, drug B, and a placebo) on a dependent variable (e.g., blood pressure). After conducting an ANOVA and finding a significant main effect for the treatments, a post-hoc test can be conducted to determine which specific treatments differ significantly from each other. For instance, Tukey's HSD test can be used to compare all pairwise differences between the treatments. This can help identify which treatment is more effective than the others in reducing blood pressure.

Ques 9:

Ans: 

In [4]:
import scipy.stats as stats

# Define the weight loss data for each diet
diet_a = [5, 6, 7, 8, 3, 2, 4, 9, 8, 6, 7, 5, 3, 2, 1, 2, 3, 4, 5, 6,
          7, 8, 9, 10, 11, 5, 7, 6, 4, 8, 9, 6, 4, 2, 3, 4, 5, 6, 7, 8,
          9, 10, 3, 4, 5, 6, 7, 8, 9, 10]

diet_b = [7, 6, 5, 8, 9, 6, 4, 2, 3, 5, 7, 8, 9, 6, 5, 4, 7, 6, 5, 8,
          9, 7, 6, 5, 4, 3, 2, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10,
          3, 4, 5, 6, 7, 8, 9, 10, 11]

diet_c = [4, 5, 6, 7, 8, 9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 11, 4, 5, 6,
          7, 8, 9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 11, 4, 5, 6, 7, 8, 9, 10,
          11, 4, 5, 6, 7, 8, 9, 10, 11, 4, 5]

# Conduct a one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# Report the results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")

F-statistic:  5.37206924256811
p-value:  0.005600949528974236
There is a significant difference between the mean weight loss of the three diets.


In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Define the data
software_programs = ['A', 'B', 'C'] * 20
employee_experience = ['Novice'] * 30 + ['Experienced'] * 30
time_to_complete_task = [10, 12, 11, 14, 15, 13, 16, 18, 17, 12, 10, 11, 13, 14, 12, 10, 11, 12, 14, 15,
                         16, 14, 12, 13, 15, 16, 18, 12, 10, 11, 15, 16, 17, 11, 10, 12, 14, 15, 18, 19,
                         17, 16, 15, 14, 12, 13, 11, 12, 10, 13, 14, 17, 18, 19, 16, 13, 14, 12, 11, 10]

# Create a pandas DataFrame
data = pd.DataFrame({'software_programs': software_programs,
                     'employee_experience': employee_experience,
                     'time_to_complete_task': time_to_complete_task})

# Conduct a two-way ANOVA
model = ols('time_to_complete_task ~ C(software_programs) + C(employee_experience) + C(software_programs):C(employee_experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)

# Report the results
print(table)

                                                 sum_sq    df         F  \
C(software_programs)                           1.433333   2.0  0.101815   
C(employee_experience)                        12.150000   1.0  1.726125   
C(software_programs):C(employee_experience)    1.300000   2.0  0.092344   
Residual                                     380.100000  54.0       NaN   

                                               PR(>F)  
C(software_programs)                         0.903369  
C(employee_experience)                       0.194461  
C(software_programs):C(employee_experience)  0.911935  
Residual                                          NaN  


In [8]:
import numpy as np
from scipy.stats import ttest_ind

# Define the data
control_group = [80, 70, 85, 90, 75, 95, 85, 80, 70, 75, 85, 90, 70, 75, 85, 80, 70, 90, 75, 85,
                 80, 70, 75, 85, 90, 75, 85, 80, 70, 75, 85, 80, 70, 90, 75, 85, 80, 70, 75, 85,
                 80, 70, 75, 85, 90, 75, 85, 80, 70, 75, 85, 80, 70]
experimental_group = [90, 85, 95, 80, 75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80,
                      75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80,
                      75, 90, 85, 80, 75, 90, 85, 80, 75, 90, 85, 80]

# Conduct a two-sample t-test
t_stat, p_value = ttest_ind(control_group, experimental_group)

# Report the results
print('t-statistic:', t_stat)
print('p-value:', p_value)

t-statistic: -2.752033547369225
p-value: 0.006999891505388277


In [13]:
import pandas as pd
import scipy.stats as stats

# create a DataFrame with the sales data
sales_data = pd.DataFrame({
    'Store': ['A', 'A', 'A', ..., 'C', 'C', 'C'],
    'Sales': [100, 120, 80, ..., 90, 110, 95]
})

# conduct a one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    sales_data.loc[sales_data['Store'] == 'A', 'Sales'],
    sales_data.loc[sales_data['Store'] == 'B', 'Sales'],
    sales_data.loc[sales_data['Store'] == 'C', 'Sales']
)

print('F-statistic:', f_statistic)
print('p-value:', p_value)

F-statistic: nan
p-value: nan
