## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### >Normality of sampling distribution of means : The distribution of sample mean is normally distributed.
#### >Absence of outliers: Outlying score need to be romed from dataset.
#### >Homegety of variance: Each one of the population has same variance.
#### >Samples are independent and random: The value of an observation in one group does not affect or influence the value of an observation in another group.

### Examples of violations for each assumption and how they could impact the validity of ANOVA results:

#### 1. Violation of Independence Assumption:
Assumption: Observations are independent of each other.
##### Example: Conducting a study on the impact of a new teaching method on student performance, but using data from multiple students within the same classroom. The scores of students within the same class may be correlated due to shared factors like the teacher's teaching style.

Impact: Violation of independence can lead to incorrect standard errors, inflated Type I error rates, and biased parameter estimates. The results may appear more significant than they actually are.

#### 2. Violation of Normality Assumption:
Assumption: The residuals or errors are normally distributed.
##### Example: Analyzing exam scores of students, and the residuals of the regression model are skewed or exhibit heavy tails.

Impact: Non-normality can affect the accuracy of p-values, confidence intervals, and hypothesis tests. Statistical inferences based on these assumptions may be unreliable.

#### 3. Violation of Homoscedasticity Assumption:
Assumption: The variance of the residuals is constant across all levels of the independent variable(s).
##### Example: Comparing the heights of plants under different light conditions, and the spread of the residuals increases as the light intensity increases.

Impact: Violation of homoscedasticity can lead to incorrect standard errors, biased parameter estimates, and inaccurate p-values. Confidence intervals and hypothesis tests may be compromised.

#### 4. Violation of Absence of Outliers Assumption:
Assumption: There are no extreme outliers in the data.
##### Example: Analyzing survey responses on income levels, and one respondent reports an unusually high income far beyond the typical range.

Impact: Outliers can skew the means, distort variances, and impact the overall distribution. This can lead to incorrect significance tests and biased conclusions.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

#### 1. One-Way ANOVA:
One-Way ANOVA is used when you have one categorical independent variable (also known as a factor) with three or more levels or groups, and you want to determine if there are any statistically significant differences in means among these groups.

#### 2. Repeated Measures ANOVA:
Repeated Measures ANOVA is used when you have a design in which the same subjects are measured multiple times under different conditions or at different time points. 

#### 3. Factorial ANOVA:
Factorial ANOVA is used when you have two independent variables (factors), and you want to examine their individual and interactive effects on a dependent variable. It allows you to analyze the effects of each factor independently as well as their combined effects.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#### The partitioning of variance refers to the division of the total variance observed in a dataset into different components that can be attributed to specific sources of variation. ANOVA is a statistical technique used to compare means among two or more groups and determine if there are significant differences between them.

### It is important to understand this concept because: 
#### >Hypothesis Testing: ANOVA helps researchers determine if there are statistically significant differences among the means of different groups. By partitioning the variance into between-groups and within-groups components, ANOVA provides a way to assess whether the observed differences between groups are likely due to actual treatment effects or are simply a result of random variability.

#### >Interpretation: The partitioning of variance allows researchers to interpret the proportion of total variability that can be explained by the differences between groups (between-groups variance) as opposed to the variability within each group (within-groups variance).

#### >Effect Size: ANOVA's partitioning of variance can be used to calculate effect size measures such as eta-squared or partial eta-squared. These measures quantify the proportion of variability in the dependent variable that can be attributed to the independent variable, providing insights into the practical significance of the observed differences.

#### >Experimental Design: Understanding the partitioning of variance can help researchers design experiments more effectively. By controlling and minimizing within-groups variability and maximizing between-groups variability, researchers can increase the sensitivity of their study to detect meaningful treatment effects.

#### >Model Assessment: ANOVA's partitioning of variance is also crucial for assessing the adequacy of the statistical model. It helps researchers evaluate whether the model adequately explains the observed variation and whether any additional factors should be considered.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats

#Example data for three groups
group1 = np.array([25, 30, 35, 40, 45])
group2 = np.array([15, 20, 25, 30, 35])
group3 = np.array([40, 45, 50, 55, 60])

#Combine the data into a single array
data = np.concatenate([group1,group2,group3])

#Calculate overall mean
mean = np.mean(data)

#Calculate group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

#Calculate the total sum of squares(SST)
sst = np.sum((data - mean)**2)

#Calculate the exp[lained sum of squares(SSE)
sse = np.sum((group1_mean - mean)**2) * len(group1) + \
      np.sum((group2_mean - mean)**2) * len(group2) + \
      np.sum((group3_mean - mean)**2) * len(group3)

#Calculate the residual sum of squares(SSR)
ssr = sst - sse

#Degrees of freedom
df_total = len(data) - 1
df_groups = 3 - 1
df_residual = df_total - df_groups

#Mean squares
ms_groups = sse / df_groups
ms_residual = ssr / df_residual


print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)

Total Sum of Squares (SST): 2333.333333333333
Explained Sum of Squares (SSE): 1583.3333333333333
Residual Sum of Squares (SSR): 749.9999999999998


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [7]:
data = {
    'factor1': [1, 1, 2, 2, 3, 3, 4, 4],
    'factor2': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'dependent_variable': [10, 12, 15, 14, 18, 20, 9, 11]
}

In [8]:
df = pd.DataFrame(data)

In [5]:
df

Unnamed: 0,factor1,factor2,dependent_variable
0,1,A,10
1,1,B,12
2,2,A,15
3,2,B,14
4,3,A,18
5,3,B,20
6,4,A,9
7,4,B,11


In [11]:
#Perform two-way ANOVA using the ordinary least squares (OLS) method
formula = 'dependent_variable ~ C(factor1) + C(factor2) + C(factor1):C(factor2)'
model = ols(formula, data = df).fit()
anova_results = anova_lm(model)

#Extract main effects and interaction effects from the ANOVA results
main_effect_factor1 = anova_results.loc['C(factor1)', 'F']
main_effect_factor2 = anova_results.loc['C(factor2)', 'F']
interaction_effect = anova_results.loc['C(factor1):C(factor2)', 'F']

print("Main Effect Factor 1:", main_effect_factor1)
print("Main Effect Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)

Main Effect Factor 1: 0.0
Main Effect Factor 2: 0.0
Interaction Effect: 0.0


  (model.ssr / model.df_resid))


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

### Interpretation of the F-statistic:
The F-statistic is a ratio of variance between groups to variance within groups. A higher F-statistic suggests that the means of at least some of the groups are different from each other. In our case, the F-statistic is 5.23, indicating that there is some evidence of variability between the group means.

### Interpretation of the p-value:
The p-value represents the probability of observing the obtained F-statistic (or more extreme) under the assumption that there are no true differences between the group means (null hypothesis). A p-value of 0.02 indicates that if the null hypothesis were true (that is, if there were no actual differences between group means), we would expect to observe an F-statistic as extreme as 5.23 in only 2% of cases. Since this p-value is below the commonly chosen significance level (such as 0.05), it suggests that the differences between the groups are statistically significant.

#### Based on the F-statistic and the p-value, we can conclude that there is evidence to reject the null hypothesis. In other words, there are statistically significant differences between at least some of the group means. However, the ANOVA test itself does not tell us which specific groups are different from each other; it only indicates that there are differences somewhere among the groups.


## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

### Handling Missing Data:
#### a. Mean Imputation: Replace missing values with the mean of the available data for that variable. This method can distort the variability and relationships in the data, leading to biased estimates of group means and potentially underestimating standard errors.

#### b. Linear Interpolation: Estimate missing values based on a linear interpolation between adjacent observed data points. This method assumes a linear relationship and may not be appropriate for all types of data.

#### c. Multiple Imputation: Generate multiple plausible imputed datasets, analyze each separately, and then combine results to account for uncertainty due to missing data. Multiple imputation can provide more accurate estimates and standard errors, but it can be computationally intensive.

### Potential Consequences of Different Methods:
#### >Bias: Inaccurate handling of missing data can lead to biased estimates of group means, standard errors, and p-values. Biased estimates can result in incorrect conclusions about treatment effects.
#### >Type I Errors and Inflated Type II Errors: Inadequate handling of missing data can result in inflated Type I error rates (false positives) or Type II error rates (false negatives).

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Tukey's Honestly Significant Difference (HSD) Test:

>Use when you have conducted an ANOVA and want to perform all possible pairwise comparisons.

>Appropriate for situations where you have a relatively balanced design and you want to control the familywise error rate.

>Example: In a study comparing the effects of three different treatments on blood pressure, you want to determine which treatment groups have significantly different means.

#### Bonferroni Correction:

>Use when you are conducting multiple pairwise comparisons and need to control the overall familywise error rate.

>More conservative than some other methods, which can reduce the risk of Type I errors.

>Example: A psychology study measures the effects of four different interventions on test anxiety, and you want to compare each intervention to every other intervention.

#### Scheffe's Test:

>Use when you have unequal sample sizes and/or unequal variances across groups.

>Provides a more liberal correction and is suitable for situations where assumptions of equal variances and balanced designs are not met.

>Example: An educational study examines the effects of teaching methods on student performance, but the class sizes are different for each teaching method.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [14]:
import numpy as np
from scipy import stats

## Simulated weight loss data for three diets
diet_A = np.array([2.5, 3.1, 1.8, 2.3, 3.0, 2.1, 2.8, 2.5, 3.2, 2.9,
                   1.9, 2.6, 2.7, 2.2, 2.0, 3.3, 2.4, 2.6, 2.8, 2.3,
                   2.7, 2.1, 2.4, 2.0, 2.9, 2.5, 2.8, 2.6, 3.1, 2.3,
                   2.2, 2.4, 2.7, 2.9, 3.0, 2.6, 2.8, 2.5, 2.1, 2.3,
                   2.4, 2.0, 2.7, 2.8, 3.2, 2.6, 2.9, 3.1, 2.3, 2.2])

diet_B = np.array([3.5, 3.2, 3.9, 3.0, 3.4, 3.1, 3.7, 3.2, 3.8, 3.5,
                   3.6, 3.3, 3.2, 3.1, 3.4, 3.5, 3.2, 3.0, 3.3, 3.6,
                   3.4, 3.5, 3.2, 3.8, 3.9, 3.3, 3.1, 3.7, 3.4, 3.2,
                   3.6, 3.3, 3.5, 3.4, 3.7, 3.2, 3.8, 3.6, 3.9, 3.1,
                   3.2, 3.5, 3.6, 3.4, 3.2, 3.7, 3.3, 3.1, 3.9, 3.8])

diet_C = np.array([4.0, 4.5, 3.7, 4.1, 3.9, 4.2, 4.4, 3.8, 4.3, 4.0,
                   4.1, 4.2, 3.9, 4.4, 3.8, 4.3, 4.2, 4.0, 4.5, 3.7,
                   4.1, 3.9, 4.2, 4.4, 3.8, 4.3, 4.0, 4.1, 4.2, 3.9,
                   4.4, 3.8, 4.3, 4.2, 4.0, 4.5, 3.7, 4.1, 3.9, 4.2,
                   4.4, 3.8, 4.3, 4.0, 4.1, 4.2, 3.9, 4.4, 3.8, 4.3])

##Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

##Interpret the results
alpha = 0.05

print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences between diet means.")
else:
    print("Fail to reject the null hypothesis: No significant differences between diet means.")


F-statistic: 339.8596835805201
p-value: 7.44736929462682e-56
Reject the null hypothesis: There are significant differences between diet means.


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

## Simulated data for the example
data = {
    'Software': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Experience': ['Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced',
                   'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced'],
    'Time': [10.5, 12.2, 11.8, 9.9, 12.5, 10.8, 11.0, 11.8, 10.2, 10.8, 11.5, 11.2]
}

df = pd.DataFrame(data)

#Perform two-way ANOVA
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=df).fit()
anova_results = anova_lm(model)

#Interpret the results
alpha = 0.05

print("Two-Way ANOVA Results:")
print(anova_results)

if any(anova_results['PR(>F)'] < alpha):
    print("\nAt least one factor or interaction has a significant effect.")
else:
    print("\nNo significant effects or interactions were found.")


Two-Way ANOVA Results:
                            df    sum_sq   mean_sq         F    PR(>F)
C(Software)                2.0  4.406667  2.203333  5.352227  0.046340
C(Experience)              1.0  0.053333  0.053333  0.129555  0.731225
C(Software):C(Experience)  2.0  0.106667  0.053333  0.129555  0.880879
Residual                   6.0  2.470000  0.411667       NaN       NaN

At least one factor or interaction has a significant effect.


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

#Simulated test scores for the control and experimental groups
control_group = np.array([82, 78, 85, 76, 89, 72, 90, 88, 81, 85, 79, 84, 88, 86, 75, 80, 77, 83, 80, 87,
                          85, 79, 88, 81, 84, 78, 85, 82, 81, 87, 83, 89, 77, 80, 86, 84, 82, 78, 87, 75,
                          79, 83, 81, 80, 86, 83, 82, 88, 84, 79])

experimental_group = np.array([91, 88, 92, 89, 93, 86, 95, 94, 90, 92, 87, 91, 93, 92, 85, 89, 88, 91, 88, 92,
                               90, 87, 94, 89, 91, 86, 92, 88, 87, 93, 90, 95, 85, 88, 92, 91, 88, 86, 93, 85,
                               87, 89, 88, 86, 92, 89, 88, 93, 91, 86])

#Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

alpha = 0.05

print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant difference in test scores.")
    print("Performing post-hoc tests...")
    
    #Perform post-hoc tests (pairwise comparisons)
    results = stats.ttest_ind(control_group, experimental_group)
    corrected_p_values = multipletests(results.pvalue, method="bonferroni")[1]
    
    for i in range(len(corrected_p_values)):
        if corrected_p_values[i] < alpha:
            print(f"Group {i+1} significantly differs from the other group(s).")
else:
    print("\nFail to reject the null hypothesis: No significant difference in test scores.")


Two-Sample T-Test Results:
t-statistic: -10.197385641803868
p-value: 4.514955794732805e-17

Reject the null hypothesis: There is a significant difference in test scores.
Performing post-hoc tests...
Group 1 significantly differs from the other group(s).


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [7]:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

#Simulated daily sales data for the three stores
store_A = np.array([120, 125, 130, 140, 135, 122, 128, 138, 133, 128,
                    130, 136, 129, 125, 132, 131, 135, 130, 128, 127,
                    123, 130, 128, 140, 133, 128, 136, 129, 125, 130])

store_B = np.array([110, 115, 120, 125, 130, 125, 120, 118, 128, 132,
                    121, 129, 130, 135, 140, 138, 133, 129, 126, 120,
                    123, 130, 125, 130, 132, 128, 127, 122, 125, 130])

store_C = np.array([100, 105, 98, 115, 110, 112, 105, 108, 105, 110,
                    118, 120, 115, 122, 125, 120, 130, 128, 118, 115,
                    120, 123, 122, 130, 125, 128, 130, 133, 135, 130])

#Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_A, store_B, store_C)

alpha = 0.05

print("One-Way ANOVA Results:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("\nReject the null hypothesis: There is a significant difference in daily sales.")
    print("Performing post-hoc Tukey's HSD test...")

    #Combine data for the post-hoc test
    all_sales = np.concatenate((store_A, store_B, store_C))
    labels = ['Store A'] * len(store_A) + ['Store B'] * len(store_B) + ['Store C'] * len(store_C)

    #Perform post-hoc Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(all_sales, labels, alpha=alpha)
    print(tukey_results)
else:
    print("\nFail to reject the null hypothesis: No significant difference in daily sales.")


One-Way ANOVA Results:
F-statistic: 18.731778543632355
p-value: 1.7169253438768638e-07

Reject the null hypothesis: There is a significant difference in daily sales.
Performing post-hoc Tukey's HSD test...
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
Store A Store B     -3.6 0.1597  -8.2404  1.0404  False
Store A Store C -11.6333    0.0 -16.2738 -6.9929   True
Store B Store C  -8.0333 0.0002 -12.6738 -3.3929   True
-------------------------------------------------------
