# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumption required to use ANOVA are given below:
    
1-Normality of Sampling Distribution of Means.(the distribution of sample mean is normaly distributed)

2-Absence of Outliers.(Outlying score need to be removed from dataset.)

3-Homogenity of variance.(Each one of the population has same variance)
Population variace in different lavels of each independence variable are equal.

4-Sample are independent and random.



If any of these assumptions are violated, the results of the ANOVA test may not be valid or reliable. Here are some examples of violations that could impact the validity of the results:

1Independence violation: If the observations in one group are not independent of those in another group, this can lead to biased results. For example, if a study measures the weight of siblings and includes both siblings in the analysis, the observations are not independent, and the ANOVA test results may be invalid.

2-Normality violation: If the data in each group are not normally distributed, the ANOVA test may not be appropriate. For example, if a study measures the salaries of employees in a company and the data are skewed, the ANOVA test may not provide accurate results.

3-Homogeneity of variance violation: If the variances in the different groups are not equal, the ANOVA test may not be appropriate. For example, if a study compares the performance of two different training programs and the variance of scores in one program is much higher than the other, the ANOVA test may not provide accurate results.

It is important to check for these assumptions before conducting an ANOVA test and to address any violations that are found. There are also alternative statistical tests available that may be more appropriate for data that violates these assumptions.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three types of ANOVA:

1-One-way ANOVA: This is used when there is one categorical independent variable (also known as a factor) with three or more levels, and one continuous dependent variable. One-way ANOVA is used to test whether there are any significant differences between the means of the groups defined by the levels of the categorical variable. For example, a one-way ANOVA could be used to test whether there is a significant difference in the mean weight of three different breeds of dogs.

2-Two-way ANOVA: This is used when there are two independent variables (factors), both categorical, with two or more levels each, and one continuous dependent variable. Two-way ANOVA is used to test whether there are any significant main effects of each independent variable, as well as any significant interaction effect between the two independent variables on the dependent variable. For example, a two-way ANOVA could be used to test whether there is a significant effect of gender and age group on the mean score of a cognitive ability test.

3-Repeated-measures ANOVA: This is used when there is one independent variable (factor) that is measured at multiple time points or conditions, and one continuous dependent variable. Repeated-measures ANOVA is used to test whether there are any significant differences between the means of the repeated measures or conditions, and to control for the effects of individual differences by treating each participant as their own control. For example, a repeated-measures ANOVA could be used to test whether there is a significant effect of a training program on the mean reaction time of participants, where reaction time is measured before and after the training program.

It is important to choose the appropriate type of ANOVA based on the research question and the design of the study. One-way ANOVA is used when there is one independent variable, two-way ANOVA is used when there are two independent variables, and repeated-measures ANOVA is used when there are repeated measurements or conditions of one independent variable.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of decomposing the total variance of the dependent variable into its component parts, which are attributed to the different sources of variation in the study. These sources of variation can be divided into two categories: within-group variance (also called error variance) and between-group variance (also called treatment or factor variance). The process of partitioning variance is essential for understanding how much of the total variation in the dependent variable is explained by the independent variables in the study.

In ANOVA, the total variance of the dependent variable is partitioned into three components:

1-The within-group variance, which represents the variation in the dependent variable that is not explained by the independent variable(s) or factor(s) in the study. This variance is typically assumed to be due to random error or chance factors.

2-The between-group variance, which represents the variation in the dependent variable that is attributed to the independent variable(s) or factor(s) in the study. This variance reflects the differences between the means of the groups defined by the levels of the independent variable(s).

3-The residual or error variance, which is the remaining variance after accounting for both the within-group and between-group variance.

Understanding the partitioning of variance is important for several reasons. First, it allows researchers to determine whether there are any significant differences between the means of the groups defined by the levels of the independent variable(s) in the study. Second, it helps researchers to identify the sources of variation that are most important for explaining the variability in the dependent variable. Finally, it enables researchers to estimate the effect size of the independent variable(s) or factor(s) on the dependent variable, which is a crucial piece of information for interpreting the practical significance of the findings.

In summary, the partitioning of variance is a fundamental concept in ANOVA that allows researchers to determine whether there are significant differences between groups, identify the sources of variation, and estimate the effect size of the independent variable(s) or factor(s) on the dependent variable.





# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [4]:
import numpy as np
from scipy.stats import f_oneway

# Generate some example data
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(12, 2, 30)
group3 = np.random.normal(15, 2, 30)

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Create an array of labels for the groups
labels = np.concatenate([np.repeat("Group 1", 30), np.repeat("Group 2", 30), np.repeat("Group 3", 30)])

# Conduct the one-way ANOVA
f_statistic, p_value = f_oneway(group1, group2, group3)

# Calculate the degrees of freedom
df_total = len(data) - 1
df_groups = len(np.unique(labels)) - 1
df_error = df_total - df_groups

# Calculate the sum of squares
grand_mean = np.mean(data)
sst = np.sum((data - grand_mean) ** 2)
sse = np.sum((np.mean(group1) - grand_mean) ** 2) * 30 + np.sum((np.mean(group2) - grand_mean) ** 2) * 30 + np.sum((np.mean(group3) - grand_mean) ** 2) * 30
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)


SST: 719.7241202987873
SSE: 345.76318887909326
SSR: 373.960931419694


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [13]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate some example data
np.random.seed(123)
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(12, 2, 30)
group3 = np.random.normal(15, 2, 30)
factor1 = np.repeat(["A", "B", "C"], 30)
factor2 = np.tile(["X", "Y", "Z"], 30)

# Combine the data into a DataFrame
df = pd.DataFrame({"group": np.concatenate([group1, group2, group3]),
                   "factor1": factor1,
                   "factor2": factor2})

# Fit the two-way ANOVA model
model = ols("group ~ factor1 + factor2 + factor1:factor2", data=df).fit()

# Extract the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Calculate the main effects and interaction effects
main_effect_1 = anova_table.loc["factor1", "sum_sq"] / anova_table["sum_sq"].sum()
main_effect_2 = anova_table.loc["factor2", "sum_sq"] / anova_table["sum_sq"].sum()
interaction_effect = anova_table.loc["factor1:factor2", "sum_sq"] / anova_table["sum_sq"].sum()

print("Main effect of factor 1:", main_effect_1)
print("Main effect of factor 2:", main_effect_2)
print("Interaction effect:", interaction_effect)


Main effect of factor 1: 0.41905394441143456
Main effect of factor 2: 0.02145915933432884
Interaction effect: 0.05251778276398887


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret these results?

If a one-way ANOVA yields an F-statistic of 5.23 and a p-value of 0.02, it means that there is evidence of a significant difference between the groups. Specifically, it means that the variance between the groups is greater than the variance within the groups, which suggests that there are significant differences in the means of the groups and we can conclude that at least one group has a mean that is significantly different from the others.

However, we cannot determine which specific groups are different from each other based solely on the ANOVA results. Further analysis, such as post-hoc tests or confidence intervals, may be needed to determine the nature of the differences between the groups.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA can be challenging, as the repeated nature of the design means that missing data can have a more profound impact on the analysis than in other designs. There are several methods for handling missing data in a repeated measures ANOVA, including:

1-Pairwise deletion: This method involves omitting any cases with missing data for a particular variable, only using cases that have complete data for that variable. This method is easy to implement, but it can result in a loss of statistical power and can produce biased estimates if the missing data is not missing completely at random.

2-Listwise deletion: This method involves omitting any cases that have missing data for any of the variables included in the analysis. This method avoids the potential bias of pairwise deletion, but it can also lead to a loss of statistical power if a substantial portion of the data is missing.

3-Imputation: This method involves replacing missing data with estimated values based on the available data. There are several imputation methods, including mean imputation, regression imputation, and multiple imputation. Imputation can be a useful method for reducing the impact of missing data, but it can also introduce bias if the imputation model is misspecified.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Some common post-hoc tests include:

1-Tukey's HSD (honestly significant difference): This test compares all possible pairs of means and controls the family-wise error rate, making it a good choice for situations where multiple comparisons are being made.

2-Bonferroni correction: This test is a more conservative alternative to Tukey's HSD and adjusts the alpha level for each comparison. It is a good choice when making a small number of comparisons.

3-Scheffe's test: This test is a conservative test that can be used when the sample sizes are unequal, and there are more than two groups.

4-Dunnett's test: This test is used when there is a control group and the other groups are being compared to the control.

5-Games-Howell test: This test is used when the assumption of equal variances across groups is not met.

A situation where a post-hoc test might be necessary is if an ANOVA indicates a significant difference between at least two groups, but it is unclear which specific groups are different from each other. For example, a researcher might conduct an ANOVA to compare the mean scores of three different treatment groups on a measure of anxiety. If the ANOVA shows a significant difference between the groups, a post-hoc test such as Tukey's HSD or Bonferroni correction can be used to determine which specific groups differ from each other in terms of their anxiety scores.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [16]:
import pandas as pd
import scipy.stats as stats

# create a dataframe with the weight loss data for each diet
data = {'Diet A': [5.2, 4.5, 3.7, 4.1, 5.0, 4.9, 4.6, 4.4, 4.8, 4.2, 4.7, 4.9, 4.3, 4.0, 4.6, 3.8, 4.5, 4.1, 4.7, 4.4, 4.2, 4.3, 4.0, 3.6, 4.8],
        'Diet B': [4.7, 4.4, 4.2, 4.0, 4.5, 4.1, 4.3, 4.8, 4.6, 4.5, 4.0, 3.9, 3.8, 3.7, 4.1, 4.2, 4.5, 3.9, 3.5, 3.8, 3.6, 3.7, 3.9, 3.5, 4.1],
        'Diet C': [3.5, 3.7, 3.8, 3.1, 3.6, 3.9, 3.4, 3.2, 3.5, 3.4, 3.3, 3.7, 3.1, 3.6, 3.8, 3.5, 3.4, 3.6, 3.9, 3.1, 3.3, 3.0, 3.2, 3.1, 3.3]}
df = pd.DataFrame(data)

# conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(df['Diet A'], df['Diet B'], df['Diet C'])

# report results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is evidence of significant differences between the mean weight loss of the three diets.")
else:
    print("There is not enough evidence to conclude that there are significant differences between the mean weight loss of the three diets.")


F-statistic: 47.67360331577585
p-value: 6.511859804819824e-14
There is evidence of significant differences between the mean weight loss of the three diets.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [17]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample data frame
data = {'program': ['A', 'B', 'C'] * 20,
        'experience': ['novice'] * 30 + ['experienced'] * 30,
        'time': [12, 15, 13, 18, 19, 16, 14, 17, 12, 16, 13, 14, 16, 18, 20, 17, 16, 14, 15, 12] * 3}
df = pd.DataFrame(data)

model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print ANOVA table
print(anova_table)

                                sum_sq    df             F    PR(>F)
C(program)                4.760003e-28   2.0  4.168670e-29  1.000000
C(experience)             1.500000e-01   1.0  2.627311e-02  0.871840
C(program):C(experience)  1.120000e+01   2.0  9.808628e-01  0.381568
Residual                  3.083000e+02  54.0           NaN       NaN


From the ANOVA table, we can see that neither the main effect of program nor the main effect of experience is significant, as their p-values are both greater than 0.05. However, the interaction effect between program and experience is also not significant, as its p-value is greater than 0.05. Therefore, we can conclude that there are no significant differences in the average time it takes to complete the task using the three different software programs,

# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [18]:
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# generate data for control and experimental groups
control_scores = np.random.normal(loc=75, scale=10, size=100)
experimental_scores = np.random.normal(loc=80, scale=10, size=100)

t_stat, p_val = stats.ttest_ind(control_scores, experimental_scores)
print("Two-sample t-test results:")
print("t-statistic =", t_stat)
print("p-value =", p_val)

if p_val < 0.05:
    data = {"score": np.concatenate([control_scores, experimental_scores]),
            "group": np.concatenate([np.repeat("control", 100), np.repeat("experimental", 100)])}
    df = sm.stats.multicomp.pairwise_tukeyhsd(data["score"], data["group"])

    print("Post-hoc test results:")
    print(df.summary())

Two-sample t-test results:
t-statistic = -2.4757439553787837
p-value = 0.014135299962093376
Post-hoc test results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   3.3729 0.0141 0.6863 6.0596   True
---------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

In [19]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

#generate sample data
store_a_sales = np.random.normal(1000, 100, 30)
store_b_sales = np.random.normal(1200, 150, 30)
store_c_sales = np.random.normal(900, 120, 30)

sales_df = pd.DataFrame({
    'Store A': store_a_sales,
    'Store B': store_b_sales,
    'Store C': store_c_sales
})

sales_df_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_df_melted.columns = ['day', 'store', 'sales']

# Perform repeated measures ANOVA
rm_anova = AnovaRM(sales_df_melted, 'sales', 'day', within=['store'])
res = rm_anova.fit()
print(res)

               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store 46.4540 2.0000 58.0000 0.0000

