# Assignment (13th March) : Statistics Assignments - 6

### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

**ANS:** The assumptions that could impact the validity of the results are as follows:
- Independence of observations
- Homogeneity of variances
- Normally distributed groups
- Violations: Unequal variances (heteroscedasticity), non-normal distribution, correlated samples.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

**ANS:** Types of ANOVA are as follows:

1. `One-way ANOVA:` Used for comparing means of three or more groups based on one factor.
2. `Two-way ANOVA:` Used for comparing means based on two factors, allowing interaction effect analysis.
3. `Repeated Measures ANOVA:` Used for comparing means when the same subjects are used for each treatment.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

**ANS:** Partitioning of Variance in ANOVA are as follows:

- Total variance is divided into variance between groups and within groups.
- Important to understand to determine how much variance is explained by the factor being tested.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np

# Sample data for three groups
group1 = [20, 23, 21, 24, 22]
group2 = [30, 29, 31, 28, 32]
group3 = [25, 27, 26, 24, 28]

# Combine all data into a single array
data = group1 + group2 + group3

# Grand mean
grand_mean = np.mean(data)

# Total Sum of Squares (SST)
SST = np.sum((data - grand_mean) ** 2)

# Sum of Squares Between groups (SSE)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
n1, n2, n3 = len(group1), len(group2), len(group3)
SSE = n1 * (group_means[0] - grand_mean) ** 2 + n2 * (group_means[1] - grand_mean) ** 2 + 
        n3 * (group_means[2] - grand_mean) ** 2

# Sum of Squares Within groups (SSR)
SSR = SST - SSE

print(f'SST: {SST}, SSE: {SSE}, SSR: {SSR}')

SST: 190.0, SSE: 160.0, SSR: 30.0


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    'response': [20, 21, 19, 22, 30, 29, 31, 28, 25, 27, 24, 26],
    'factor1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'factor2': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y']
})

# Fit the model
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                           sum_sq   df          F    PR(>F)
C(factor1)             162.666667  2.0  34.857143  0.000498
C(factor2)               0.333333  1.0   0.142857  0.718467
C(factor1):C(factor2)    0.666667  2.0   0.142857  0.869741
Residual                14.000000  6.0        NaN       NaN


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

**ANS:** Interpreting One-way ANOVA Results:

- F-statistic = 5.23, p-value = 0.02
- Conclusion: `Significant differences exist between groups`.
- Interpretation: At 5% significance level, we reject the null hypothesis, indicating at least one group mean is different.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

**ANS:** Handling Missing Data in Repeated Measures ANOVA:

- Methods: `Listwise deletion`, `pairwise deletion`, `mean substitution`, `imputation`.
- Consequences: Can bias results, reduce statistical power, or inflate type I error rates depending on the method used.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

**ANS:** Some of the common Post-hoc Tests used after ANOVA are as follows:

- Tukey's HSD: Controls for Type I error, used when comparing all possible pairs.
- Bonferroni: More conservative, used to control for multiple comparisons.
- Scheffé: Flexible, used for complex comparisons.
- `Example`: Use Tukey's HSD after ANOVA if you need to find which specific groups differ.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
from scipy.stats import f_oneway

# Sample data for three diets
diet_A = [2.5, 3.0, 3.2, 2.8, 3.1]
diet_B = [1.8, 2.0, 2.3, 1.9, 2.1]
diet_C = [3.5, 3.7, 3.8, 3.6, 3.9]

# Perform one-way ANOVA
F_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print(f'F-statistic: {F_statistic}, p-value: {p_value}')

F-statistic: 76.2733812949639, p-value: 1.504358775256285e-07


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
data = pd.DataFrame({
    'time': [45, 46, 44, 47, 33, 35, 34, 36, 28, 30, 29, 31, 60, 62, 61, 63, 55, 57, 56, 58, 50, 52, 51, 53],
    'software': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'A', 'A', 
                 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'experience': ['novice', 'novice', 'experienced', 'experienced', 'novice', 'novice', 'experienced', 
                   'experienced', 'novice', 'novice', 'experienced', 'experienced',
                   'novice', 'novice', 'experienced', 'experienced', 'novice', 'novice', 'experienced', 
                   'experienced', 'novice', 'novice', 'experienced', 'experienced']
})

# Fit the model
model = ols('time ~ C(software) * C(experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                                sum_sq    df         F    PR(>F)
C(software)                 688.000000   2.0  2.503335  0.109842
C(experience)                 4.166667   1.0  0.030321  0.863706
C(software):C(experience)     0.333333   2.0  0.001213  0.998788
Residual                   2473.500000  18.0       NaN       NaN


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
from scipy.stats import ttest_ind

# Sample data
control = [78, 85, 88, 90, 84, 82, 78, 79, 83, 85]
experimental = [85, 87, 89, 90, 92, 88, 85, 87, 91, 90]

# Conducting the t-test
t_statistic, p_value = ttest_ind(control, experimental)
print(f't-statistic: {t_statistic}, p-value: {p_value}')

t-statistic: -3.4709544440218836, p-value: 0.002726990368733668


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [9]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data - ensuring each subject has a unique observation per condition
data = pd.DataFrame({
    'subject': list(range(1, 11)) * 3,
    'store': ['A']*10 + ['B']*10 + ['C']*10,
    'sales': [100, 110, 115, 105, 120, 125, 130, 135, 140, 145,
              150, 140, 155, 145, 160, 165, 170, 175, 180, 185,
              200, 180, 210, 195, 220, 225, 230, 235, 240, 245]
})

# Conducting the repeated measures ANOVA
aovrm = AnovaRM(data, 'sales', 'subject', within=['store']).fit()
print(aovrm)

# If results are significant, follow up with a post-hoc test (Tukey HSD example)
# Prepare data for pairwise comparisons
pairwise_data = data.pivot(index='subject', columns='store', values='sales').reset_index()
pairwise_data = pd.melt(pairwise_data, id_vars=['subject'], value_vars=['A', 'B', 'C'], var_name='store', value_name='sales')

# Perform Tukey HSD test
tukey_results = pairwise_tukeyhsd(endog=pairwise_data['sales'], groups=pairwise_data['store'], alpha=0.05)
print(tukey_results)

               Anova
      F Value  Num DF  Den DF Pr > F
------------------------------------
store 859.5467 2.0000 18.0000 0.0000

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     A      B     40.0 0.0001 20.694  59.306   True
     A      C     95.5    0.0 76.194 114.806   True
     B      C     55.5    0.0 36.194  74.806   True
---------------------------------------------------
