Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
Ans 1:-
ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups to determine if there are significant differences between them.
To use ANOVA, several assumptions need to be met. 

In [None]:
Independence:
    The observations in each group must be independent of each other.
    This means that the data points in one group should not be related to or dependent on the data points in another group.

In [None]:
Normality:
    The data within each group should follow a normal distribution.
    This means that the data points in each group should be approximately normally distributed. 

In [None]:
Homogeneity of Variance:
    The variances of the groups should be approximately equal.
    This means that the spread of data points in each group should be similar.

In [None]:
Random Sampling:
    The data should be collected using random sampling techniques to ensure that the sample is representative of the population.
    Non-random sampling methods may introduce biases and affect the generalizability of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
Ans 2:-
One-Way ANOVA:
    One-Way ANOVA is used when there is one categorical independent variable (also known as a factor) with three or more groups, and a continuous dependent variable.
    It is used to determine if there are any significant differences in the means of the dependent variable across the different groups of the independent variable.
    One-Way ANOVA is appropriate when we want to compare the means of three or more groups.

In [None]:
Two-Way ANOVA:
    Two-Way ANOVA is used when there are two categorical independent variables (factors) and one continuous dependent variable.
    It allows us to test the main effects of each independent variable as well as the interaction effect between the two independent variables on the dependent
    variable.

In [None]:
Repeated Measures ANOVA:
    Repeated Measures ANOVA is used when the same participants are measured multiple times under different conditions.
    It is used to compare the means of a continuous dependent variable across multiple time points or conditions for the same group of participants.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
Ans 3:-
The partitioning of variance in ANOVA refers to the process of decomposing the total variance of the dependent variable into different components, each associated
with a specific source of variation.
In ANOVA, the total variance in the dependent variable is divided into two main components: the variance between groups (also known as the "explained variance")
and the variance within groups (also known as the "unexplained variance").

In [None]:
Assessing Group Differences:
    By partitioning the variance, ANOVA helps us determine if there are statistically significant differences between the means of the groups.
    If the variance between groups is significantly larger than the variance within groups, it suggests that there are real differences between the groups and not
    just random fluctuations.

In [None]:
Understanding the Effect Size:
    The partitioning of variance also provides information about the effect size, which indicates the magnitude of the differences between the groups.
    Effect size measures, such as eta-squared or partial eta-squared, quantify the proportion of variance in the dependent variable that can be attributed to the
    independent variable.

In [None]:
Validating the Model:
    By understanding the partitioning of variance, researchers can assess how well the model fits the data and if the independent variable(s) account for a
    substantial portion of the variability in the dependent variable.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
Ans 4:-
Step 1: Import the required libraries.
import numpy as np
import pandas as pd

In [None]:
Step 2: Prepare the data.
data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'score': [10, 12, 15, 8, 9, 11, 20, 22, 18]
}
df = pd.DataFrame(data)

In [None]:
Step 3: Calculate the group means.
group_means = df.groupby('group')['score'].mean()

In [None]:
Step 4: Calculate the Total Sum of Squares (SST).
overall_mean = df['score'].mean()
sst = np.sum((df['score'] - overall_mean) ** 2)

In [None]:
Step 5: Calculate the Explained Sum of Squares (SSE).
sse = np.sum((group_means - overall_mean) ** 2)

In [None]:
Step 6: Calculate the Residual Sum of Squares (SSR).
ssr = np.sum((df['score'] - df.groupby('group')['score'].transform('mean')) ** 2)

In [None]:
Step 7: Verify the relationship among SST, SSE, and SSR.
assert np.isclose(sst, sse + ssr)

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
Ans 5:-
Step 1: Import the required libraries.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
Step 2: Prepare the data.
data = {
    'group1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'group2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X'],
    'score': [10, 12, 15, 8, 9, 11, 20, 22, 18]
}

df = pd.DataFrame(data)

In [None]:
Step 3: Fit the two-way ANOVA model.
model = ols('score ~ group1 + group2 + group1:group2', data=df).fit()

In [None]:
Step 4: Calculate the main effects
main_effects = model.params[['group1[T.B]', 'group1[T.C]', 'group2[T.Y]']]

In [None]:
Step 5: Calculate the interaction effect.
interaction_effect = model.params['group1[T.B]:group2[T.Y]']

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
Ans 6:-
In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of two or more groups.
The p-value associated with the F-statistic indicates the probability of obtaining the observed F-statistic (or more extreme) if the null hypothesis is true.

In [None]:
In this scenario, the obtained F-statistic is 5.23, and the p-value is 0.02.
With a significance level of 0.05 (commonly used in hypothesis testing), the p-value is less than the significance level.
Therefore, we can conclude that there are significant differences between the means of the groups.

In [None]:
Interpretation:
    Since the p-value (0.02) is less than the significance level (0.05), we reject the null hypothesis.
    This means that there is sufficient evidence to suggest that at least one of the group means is significantly different from the others.
    In other words, there are significant differences in the population means of the groups being compared

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
Ans 7:-
Handling missing data in a repeated measures ANOVA is essential to ensure accurate and reliable results.
There are several methods to deal with missing data, and the choice of method can impact the validity and precision of the analysis.
Here are some common methods for handling missing data in a repeated measures ANOVA:

In [None]:
Complete Case Analysis (Listwise Deletion):
    This method involves excluding any participant with missing data from the analysis.
    It is a straightforward approach, but it can result in a loss of valuable information and reduced statistical power, especially if the missing data are not random.

In [None]:
Mean Imputation:
    Mean imputation involves replacing missing data with the mean of the observed values for that variable.
    While simple to implement, this method can lead to biased estimates and an underestimation of the standard errors, as it does not account for the variability
    introduced by missing values.

In [None]:
Last Observation Carried Forward (LOCF):
    In LOCF, missing data are replaced with the last observed value for that participant.
    This method is commonly used in longitudinal studies and clinical trials but can also lead to biased results if the missing data are not missing completely at random.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Ans 8:-
Post-hoc tests are used in ANOVA to perform pairwise comparisons between groups when the overall ANOVA test indicates a significant difference among groups.
These tests help to identify which specific group means are significantly different from each other.
Some common post-hoc tests used after ANOVA include:

In [None]:
Tukeys Honestly Significant Difference (HSD) Test:
    Tukeys HSD test is conservative and appropriate when the sample sizes are equal and the variances are homogeneous.
    It controls the family-wise error rate, making it suitable for situations where multiple pairwise comparisons are being made.

In [None]:
Bonferroni Correction:
    The Bonferroni correction is a simple and commonly used method to adjust the significance level for multiple comparisons.
    It divides the desired significance level (typically 0.05) by the number of pairwise comparisons to control the overall Type I error rate.

In [None]:
Scheffes Test:
    Scheffes test is more liberal and suitable for situations with unequal sample sizes and heterogeneous variances.
    It controls the family-wise error rate more effectively than other post-hoc tests when the number of groups is small.

In [None]:
Example:
    Suppose a study examines the effect of three different treatments (A, B, and C) on the performance of students.
    The overall ANOVA test indicates a significant difference among the treatments.
    To determine which specific treatments differ significantly from each other, a post-hoc test is necessary.

After conducting a Tukeys HSD test, the results show that treatment A has a significantly higher mean score than treatment B (p < 0.05), but there is no
significant difference between treatment A and treatment C (p > 0.05) or between treatment B and treatment C (p > 0.05).

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [1]:
# Ans 9:-
import numpy as np
from scipy.stats import f_oneway

np.random.seed(42)  
weight_loss_a = np.random.normal(loc=3.0, scale=0.5, size=50)
weight_loss_b = np.random.normal(loc=2.5, scale=0.7, size=50)
weight_loss_c = np.random.normal(loc=2.8, scale=0.6, size=50)

f_statistic, p_value = f_oneway(weight_loss_a, weight_loss_b, weight_loss_c)

print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 5.770957301178106
p-value: 0.0038657419424457332


In [None]:
F-statistic:
    The F-statistic measures the ratio of the between-group variability to the within-group variability.
    A larger F-statistic suggests more significant differences between the group means.
    
p-value: 
    The p-value indicates the probability of obtaining the observed F-statistic if there were no significant differences between the group means.
    A small p-value (typically less than 0.05) suggests that there are significant differences between at least one pair of diets.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
# Ans 10:-
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(42)

experience = np.random.choice([0, 1], size=90)

program = np.random.choice([0, 1, 2], size=90)

time_taken = np.random.normal(loc=30, scale=5, size=90)

data = pd.DataFrame({'Experience': experience, 'Program': program, 'TimeTaken': time_taken})

data['Experience'] = data['Experience'].map({0: 'Novice', 1: 'Experienced'})
data['Program'] = data['Program'].map({0: 'A', 1: 'B', 2: 'C'})

formula = 'TimeTaken ~ C(Experience) + C(Program) + C(Experience):C(Program)'
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model)

print(anova_table)


                            df       sum_sq    mean_sq         F    PR(>F)
C(Experience)              1.0    18.246591  18.246591  0.722394  0.397776
C(Program)                 2.0     0.545822   0.272911  0.010805  0.989255
C(Experience):C(Program)   2.0    41.108842  20.554421  0.813763  0.446650
Residual                  84.0  2121.713786  25.258497       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [3]:
# Ans 11:-
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(42)

control_scores = np.random.normal(loc=70, scale=10, size=100)

experimental_scores = np.random.normal(loc=75, scale=10, size=100)

data = pd.DataFrame({'Group': ['Control'] * 100 + ['Experimental'] * 100,
                     'TestScores': np.concatenate([control_scores, experimental_scores])})

control_scores = data[data['Group'] == 'Control']['TestScores']
experimental_scores = data[data['Group'] == 'Experimental']['TestScores']

t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print("Two-sample t-test results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

posthoc = pairwise_tukeyhsd(data['TestScores'], data['Group'])

print("\nPost-hoc Tukey's HSD test results:")
print(posthoc)


Two-sample t-test results:
T-statistic: -4.754695943505281
P-value: 3.819135262679478e-06

Post-hoc Tukey's HSD test results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615   0.0 3.6645 8.8585   True
--------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [2]:
# Ans 12:-
import numpy as np
import pandas as pd
import pingouin as pg
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(42)

days = np.arange(1, 31)
store_a_sales = np.random.normal(loc=1000, scale=100, size=30)
store_b_sales = np.random.normal(loc=1100, scale=120, size=30)
store_c_sales = np.random.normal(loc=950, scale=90, size=30)

data = pd.DataFrame({'Days': np.tile(days, 3),
                     'Store': np.repeat(['A', 'B', 'C'], 30),
                     'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])})

rm_anova = pg.rm_anova(data=data, dv='Sales', within='Days', subject='Store')

print("Repeated Measures ANOVA results:")
print(rm_anova)

posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])

print("\nPost-hoc Tukey's HSD test results:")
print(posthoc)


Repeated Measures ANOVA results:
  Source  ddof1  ddof2         F     p-unc       ng2       eps
0   Days     29     58  1.134273  0.334382  0.266097  0.068061

Post-hoc Tukey's HSD test results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj    lower    upper   reject
--------------------------------------------------------
     A      B  104.2752 0.0002   44.2094  164.341   True
     A      C  -30.0257 0.4611  -90.0915  30.0401  False
     B      C -134.3009    0.0 -194.3667 -74.2351   True
--------------------------------------------------------


In [1]:
pip install pingouin

Collecting pingouin
  Downloading pingouin-0.5.3-py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting pandas-flavor>=0.2.0
  Downloading pandas_flavor-0.6.0-py3-none-any.whl (7.2 kB)
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting outdated
  Downloading outdated-0.2.2-py2.py3-none-any.whl (7.5 kB)
Collecting xarray
  Downloading xarray-2023.7.0-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
Collecting littleutils
  Downloading littleutils-0.2.2.tar.gz (6.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: littleutils
  Building wheel for littleutils (setup.py) ... [?25ldone
[?25h  Created wheel for littleutils: filename=littleutils-0.2.2-py3-none-any.whl size=7028 sha256=d9c13b8046ba3b0d64950547b706836bc7d34