#1
Assumptions of ANOVA:

To use ANOVA (Analysis of Variance), certain assumptions must be met to ensure the validity of the results. These assumptions are:

Normality: The residuals (errors) should be normally distributed for each group.
Independence: The observations should be independent of each other.
Homogeneity of Variance: The variance of the residuals should be equal across all groups.
Random Sampling: The samples should be randomly selected from the population.
Violations of Assumptions:

If these assumptions are not met, the results of the ANOVA may be invalid or misleading. Here are some examples of violations:

1. Non-Normality:

Example: A dataset with outliers or skewed distributions.
Impact: Non-normality can lead to inaccurate p-values and false conclusions.
2. Non-Independence:

Example: Measurements taken from the same individual over time (e.g., repeated measures design).
Impact: Non-independence can lead to inflated Type I error rates and incorrect conclusions.
3. Heterogeneity of Variance:

Example: A dataset with unequal variances across groups (e.g., one group has much larger variance than others).
Impact: Heterogeneity of variance can lead to inaccurate p-values and false conclusions.
4. Non-Random Sampling:

Example: A convenience sample or a sample selected based on a specific criterion (e.g., only including people who responded to a survey).
Impact: Non-random sampling can lead to biased results and incorrect conclusions.

2.
Three Types of ANOVA:

One-Way ANOVA:
Also known as Single-Factor ANOVA
Compares the means of three or more groups based on a single independent variable (factor)
Example: Comparing the average scores of students from three different schools
Two-Way ANOVA:
Also known as Two-Factor ANOVA
Compares the means of groups based on two independent variables (factors)
Example: Comparing the average scores of students from three different schools and two different teaching methods
Repeated-Measures ANOVA:
Also known as Within-Subjects ANOVA
Compares the means of groups based on multiple measurements from the same subjects
Example: Comparing the average scores of students before and after a training program

When to Use Each:

One-Way ANOVA:
Use when you have a single independent variable with three or more levels (groups)
Use when you want to compare the means of different groups based on a single factor
Example: Comparing the average salaries of employees from different departments
Two-Way ANOVA:
Use when you have two independent variables with two or more levels each
Use when you want to examine the interaction between two factors and their effect on the dependent variable
Example: Comparing the average sales of different products in different regions
Repeated-Measures ANOVA:
Use when you have multiple measurements from the same subjects (e.g., before and after a treatment)
Use when you want to examine the effect of a treatment or intervention over time
Example: Comparing the average blood pressure of patients before and after a medication

#3


In ANOVA, the partitioning of variance refers to the process of dividing the total variance in the dependent variable into different components, each attributed to a specific source of variation. This is a fundamental concept in ANOVA, as it allows us to quantify the amount of variation explained by each factor and to determine the significance of the effects.Understanding the partitioning of variance is crucial in ANOVA because it allows us to:

Identify the Sources of Variation: By partitioning the variance, we can determine which factors contribute to the variation in the dependent variable and to what extent.
Quantify the Effect Size: The partitioning of variance enables us to calculate the effect size, which is a measure of the magnitude of the effect of the independent variable on the dependent variable.
Determine the Significance of the Effects: By comparing the between-groups variance to the within-groups variance, we can determine whether the differences between the groups are statistically significant.
Interpret the Results: Understanding the partitioning of variance is essential for interpreting the results of ANOVA, including the F-statistic, p-value, and eta-squared.

In [2]:
#4
import numpy as np
from scipy.stats import f_oneway

# Sample data
groups = ['A', 'B', 'C']
values = [23, 21, 19, 24, 20, 18, 25, 22, 17]

# Calculate means for each group
means = [np.mean(values[i:i+3]) for i in range(0, len(values), 3)]

# Calculate SST (Total Sum of Squares)
sst = np.sum((values - np.mean(values)) ** 2)

# Calculate SSE (Explained Sum of Squares)
sse = np.sum([len(groups[i]) * (means[i] - np.mean(values)) ** 2 for i in range(len(groups))])

# Calculate SSR (Residual Sum of Squares)
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)

SST: 60.0
SSE: 0.22222222222222066
SSR: 59.77777777777778


In [4]:
#5
import pandas as pd
from scipy.stats import f_oneway

# Sample data
data = pd.DataFrame({
    'A': ['Low', 'Low', 'Low', 'High', 'High', 'High'],
    'B': ['Small', 'Medium', 'Large', 'Small', 'Medium', 'Large'],
    'Value': [10, 12, 15, 18, 20, 22]
})

# Calculate main effects and interaction effects
main_effect_A = f_oneway(*[data[data['A'] == level]['Value'] for level in data['A'].unique()])[0]
main_effect_B = f_oneway(*[data[data['B'] == level]['Value'] for level in data['B'].unique()])[0]
interaction_AB = f_oneway(*[data[(data['A'] == level_A) & (data['B'] == level_B)]['Value'] for level_A in data['A'].unique() for level_B in data['B'].unique()])[0]

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction A:B:", interaction_AB)

Main Effect A: 17.064516129032256
Main Effect B: 0.34463276836158196
Interaction A:B: nan




#6
With an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are significant differences between the groups.


Here's a step-by-step interpretation:

Reject the null hypothesis: Since the p-value (0.02) is less than the typical significance level (0.05), we reject the null hypothesis that the means of the groups are equal.

Alternative hypothesis: We accept the alternative hypothesis that the means of the groups are not equal.

Significant differences: The F-statistic (5.23) indicates that the variance between groups is significantly larger than the variance within groups. This suggests that there are significant differences between the groups.

Post-hoc analysis: To determine which specific groups are different from each other, we would need to perform a post-hoc analysis, such as Tukey's HSD or Scheffé's method.

#7
Listwise deletion: Remove entire rows with missing values.

Pairwise deletion: Remove only the specific values that are missing, but keep the rest of the data.

Mean/median imputation: Replace missing values with the mean or median of the respective column.

Interpolation: Fill missing values using interpolation methods (e.g., linear, polynomial).

Multiple imputation: Use multiple imputation techniques, such as Bayesian imputation or multiple regression imputation.

#8

In Python, you can use the statsmodels library to perform post-hoc tests. Here are some common post-hoc tests and when to use each one:

Tukey's HSD (Honestly Significant Difference):
Use when: You want to compare all possible pairs of means to determine which groups are significantly different from each other.
Example: Comparing the means of three different fertilizers on plant growth.

Scheffé's method:
Use when: You want to compare all possible pairs of means, and you have a large number of groups.
Example: Comparing the means of five different teaching methods on student performance.

Dunnett's test:
Use when: You want to compare each group to a control group or a reference group.
Example: Comparing the means of three different medications to a placebo group.

Bonferroni correction:
Use when: You want to adjust for multiple comparisons to avoid Type I errors.
Example: Comparing the means of four different groups, and you want to adjust for the multiple comparisons.

In [1]:
#9

import numpy as np
import pandas as pd
from scipy import stats

# Seed for reproducibility
np.random.seed(0)

# Simulate weight loss data for 50 participants in three diet groups
group_A = np.random.normal(loc=5, scale=2, size=50//3)
group_B = np.random.normal(loc=6, scale=2, size=50//3)
group_C = np.random.normal(loc=7, scale=2, size=50//3 + 50%3)  # Handle the remainder participants

# Combine data into a DataFrame for clarity
data = pd.DataFrame({
    'Weight_Loss': np.concatenate([group_A, group_B, group_C]),
    'Diet': ['A'] * len(group_A) + ['B'] * len(group_B) + ['C'] * len(group_C)
})

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group_A, group_B, group_C)

print(f"F-statistic: {f_statistic}")
print(f"p-value: {p_value}")


F-statistic: 0.021473972469966216
p-value: 0.9787645487792678


In [10]:
#10
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
n = 30  # Number of employees

# Simulate employee data
employees = pd.DataFrame({
    'time': np.random.normal(loc=50, scale=10, size=n),  # Normally distributed task completion times
    'program': np.random.choice(['A', 'B', 'C'], size=n),  # Randomly assign software programs
    'experience': np.random.choice(['novice', 'experienced'], size=n)  # Randomly assign experience level
})

# Adjust times for different programs and experience levels
for idx, row in employees.iterrows():
    if row['program'] == 'A':
        employees.at[idx, 'time'] += 5  # Program A is slower on average
    elif row['program'] == 'B':
        employees.at[idx, 'time'] -= 5  # Program B is faster on average
    if row['experience'] == 'experienced':
        employees.at[idx, 'time'] -= 5  # Experienced employees are faster on average
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Define the model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=employees).fit()

# Perform ANOVA
anova_results = anova_lm(model, typ=2)

# Display the results
print(anova_results)


                               sum_sq    df         F    PR(>F)
C(program)                 653.740216   2.0  5.021757  0.015070
C(experience)              133.255682   1.0  2.047228  0.165376
C(program):C(experience)   161.994513   2.0  1.244374  0.306054
Residual                  1562.178803  24.0       NaN       NaN


In [2]:
#11
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data (replace with your actual data)
control_group = np.array([85, 78, 92, 88, 76, 95, 89, 91, 84, 90, 87, 93, 96, 82, 80, 94, 86, 98, 81, 83])
experimental_group = np.array([90, 92, 95, 88, 91, 96, 89, 94, 93, 97, 92, 90, 98, 91, 95, 99, 94, 96, 93, 92])

# Two-sample t-test
t_stat, p_val = stats.ttest_ind(control_group, experimental_group)
print("Two-sample t-test results:")
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_val:.4f}")

# If p-value is significant (e.g., < 0.05), proceed with post-hoc test
if p_val < 0.05:
    print("\nPost-hoc test results:")
    # Create a dataframe with the data
    df = pd.DataFrame({'Group': ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group),
                       'Score': np.concatenate((control_group, experimental_group))})

    # Perform Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(endog=df['Score'], groups=df['Group'], alpha=0.05)
    print(tukey_results)

Two-sample t-test results:
t-statistic: -3.76
p-value: 0.0006

Post-hoc test results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental     5.85 0.0006 2.7026 8.9974   True
---------------------------------------------------------


In [4]:
#12
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM
import statsmodels.stats.multicomp as multi

# Generate synthetic sales data
np.random.seed(0)

days = np.arange(1, 31)
sales_A = np.random.normal(loc=200, scale=20, size=30)
sales_B = np.random.normal(loc=220, scale=20, size=30)
sales_C = np.random.normal(loc=210, scale=20, size=30)

data = {
    'Day': np.tile(days, 3),
    'Store': np.repeat(['Store_A', 'Store_B', 'Store_C'], 30),
    'Sales': np.concatenate([sales_A, sales_B, sales_C])
}

df = pd.DataFrame(data)

# Perform repeated measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
anova_results = res.summary()

# Perform post-hoc test if ANOVA is significant
post_hoc_results = None
if res.anova_table['Pr > F'][0] < 0.05:
    mc = multi.MultiComparison(df['Sales'], df['Store'])
    post_hoc_results = mc.tukeyhsd()

print(anova_results)
if post_hoc_results:
    print(post_hoc_results)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.9395 2.0000 58.0000 0.3967

