Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if there are statistically significant differences between them. To use ANOVA effectively, certain assumptions must be met. Violations of these assumptions can impact the validity of the results. The key assumptions for ANOVA are:

1. **Independence of Observations:**
   - **Assumption:** ANOVA assumes that the observations within each group are independent of each other. This means that the values in one group should not be influenced by or correlated with the values in another group.
   - **Violation Example:** In a medical study, if data from multiple patients in the same family are included in different treatment groups, the observations may not be independent due to shared genetic factors or environmental influences.

2. **Normality:**
   - **Assumption:** ANOVA assumes that the residuals (the differences between the observed values and the group means) are normally distributed. This is important for the accuracy of the p-values and confidence intervals.
   - **Violation Example:** If the residuals are skewed or have heavy tails, it may indicate a violation of the normality assumption. For instance, in a survey of income levels, if the residuals are not normally distributed, it could impact the validity of ANOVA results.

3. **Homogeneity of Variance (Homoscedasticity):**
   - **Assumption:** ANOVA assumes that the variances within each group are roughly equal. In other words, the spread or dispersion of data points should be consistent across all groups.
   - **Violation Example:** If the variances in different groups are significantly different, it can impact the validity of ANOVA results. For instance, in a study comparing the test scores of students from different schools, if one school has much larger score variations, it might violate this assumption.

4. **Independence of Errors:**
   - **Assumption:** The errors (residuals) should be independent of each other, meaning that the error in one observation should not be correlated with the error in another observation.
   - **Violation Example:** In a time-series analysis, if the residuals show autocorrelation (i.e., the error in one observation is related to the error in a previous observation), it would violate this assumption.

Violations of these assumptions can lead to incorrect inferences and misinterpretation of ANOVA results. In cases of severe violations, it's advisable to consider alternative statistical methods or transformations of the data to address the issues. Additionally, robust ANOVA methods, like Welch's ANOVA, can be used when the assumption of homogeneity of variances is violated. In some cases, data transformation techniques, such as logarithmic or Box-Cox transformations, can help make the data more normally distributed and reduce the impact of violations. However, it's essential to be cautious and consider the context of the data when addressing these assumptions to ensure the validity of ANOVA results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) is a statistical technique that assesses the variation in a dataset to analyze the differences between group means. There are three primary types of ANOVA, each used in different situations:

1. **One-Way ANOVA:**
   - **Use Case:** One-Way ANOVA is used when you have one categorical independent variable (factor) with more than two levels or groups, and you want to determine if there are statistically significant differences in the means of a continuous dependent variable among these groups. It helps answer questions like "Do different treatments lead to significantly different outcomes?"
   - **Example:** Comparing the test scores of students who have received three different types of training programs (Group A, Group B, and Group C) to determine if there are significant differences in their scores.

2. **Two-Way ANOVA:**
   - **Use Case:** Two-Way ANOVA is used when you have two independent categorical variables (factors) and one continuous dependent variable. It assesses the influence of both factors, their interaction, and any main effects on the dependent variable. It helps answer questions like "Does a change in both factor A and factor B lead to significantly different outcomes?"
   - **Example:** Analyzing the effects of both gender (Male/Female) and diet type (Diet A/Diet B) on weight loss in a study.

3. **Repeated Measures ANOVA:**
   - **Use Case:** Repeated Measures ANOVA is used when you have a single group of subjects and you measure their responses under different conditions or at different time points. It helps assess changes over time or in response to various conditions.
   - **Example:** Evaluating the impact of a drug treatment on patients' blood pressure at multiple time points (baseline, 1 week, 2 weeks, etc.).

Each type of ANOVA is applied in specific situations depending on the design and characteristics of your data. It's important to choose the appropriate type of ANOVA to ensure the accuracy and relevance of your statistical analysis. Additionally, when conducting ANOVA, it's crucial to check and meet the underlying assumptions, such as normality and homogeneity of variances, to obtain valid results. Violations of these assumptions may require the use of alternative methods or transformations to address the issues.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variance in a dataset into different components or sources of variance. Understanding this concept is crucial because it helps in quantifying and explaining the variation in the dependent variable, which is essential for drawing meaningful conclusions from ANOVA analyses.

In ANOVA, the total variance observed in the data is broken down into three main components:

1. **Between-Group Variance (SSB):** This component measures the variation in the dependent variable that is attributable to the differences between the group means. It quantifies the effect of the independent variable (factor) on the dependent variable. High between-group variance suggests that there are significant differences in the means of the groups.

2. **Within-Group Variance (SSW):** This component accounts for the variation within each group. It represents the unexplained or random variation in the dependent variable. The within-group variance provides an estimate of the variability that is not explained by the independent variable, measurement error, or other uncontrolled factors.

3. **Total Variance (SST):** The total variance represents the overall variation in the dependent variable across all observations and groups. It is the sum of the between-group and within-group variances (SST = SSB + SSW). Total variance serves as the baseline for understanding the overall variability in the data.

The partitioning of variance is important for several reasons:

1. **Hypothesis Testing:** ANOVA uses the partitioned variances to assess whether the differences between group means are statistically significant. This forms the basis of hypothesis testing to determine if the independent variable has a significant effect on the dependent variable.

2. **Effect Size:** By quantifying the proportion of variance explained by the independent variable (SSB / SST), ANOVA provides a measure of effect size. Effect size helps in understanding the practical significance of the differences observed and is valuable for comparing the impact of different factors.

3. **Model Assessment:** Partitioning variance helps in evaluating the goodness of fit of the model. A large proportion of between-group variance compared to within-group variance suggests that the model is a good fit for the data, indicating that the independent variable explains a significant portion of the variability.

4. **Post Hoc Testing:** After ANOVA identifies significant differences between groups, post hoc tests (e.g., Tukey's HSD, Bonferroni, or LSD) can be performed to identify which specific group pairs have significant differences.

5. **Understanding Sources of Variation:** By identifying how variance is divided between the factors and within groups, ANOVA allows researchers to gain insights into the sources of variation and make informed decisions.

In summary, the partitioning of variance in ANOVA is a fundamental concept that helps researchers analyze and interpret the sources of variability in their data. It plays a critical role in hypothesis testing, effect size estimation, model assessment, and making informed conclusions about the impact of independent variables on the dependent variable.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use libraries like NumPy or a statistical package like SciPy or StatsModels. Here's how you can calculate these sums of squares:

Assuming you have data organized in groups (e.g., treatment groups) and want to perform a one-way ANOVA:


In this example, we first calculate the SST by finding the squared differences between each data point and the overall mean. Then, the SSE is calculated by summing the squared differences between each group mean and the overall mean, weighted by the number of observations in each group. Finally, the SSR is calculated as the difference between SST and SSE.

The F-statistic and p-value from the one-way ANOVA test are also calculated, which can help determine if there are statistically significant differences between the group means.

In [1]:
import numpy as np
from scipy.stats import f_oneway

group1 = [10, 15, 12, 17, 14]
group2 = [8, 11, 9, 14, 10]
group3 = [12, 14, 16, 19, 15]

data = np.concatenate([group1, group2, group3])

group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]

overall_mean = np.mean(data)

SST = np.sum((data - overall_mean) ** 2)

SSE = np.sum([len(group) * (mean - overall_mean) ** 2 for group, mean in zip([group1, group2, group3], group_means)])

SSR = SST - SSE

f_statistic, p_value = f_oneway(group1, group2, group3)

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)
print("F-statistic:", f_statistic)
print("p-value:", p_value)


Total Sum of Squares (SST): 136.9333333333333
Explained Sum of Squares (SSE): 59.73333333333331
Residual Sum of Squares (SSR): 77.19999999999999
F-statistic: 4.642487046632126
p-value: 0.03211062219338496


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({'A': ['A1', 'A2', 'A1', 'A2', 'A1', 'A2'],
                     'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1'],
                     'Y': [10, 15, 12, 17, 14, 20]})

model = ols('Y ~ A * B', data=data).fit()

effects = sm.stats.anova_lm(model, typ=2)

print("Main Effects and Interaction Effects:")
print(effects)


Main Effects and Interaction Effects:
             sum_sq   df         F    PR(>F)
A         42.666667  1.0  4.162602  0.178135
B          0.083333  1.0  0.008130  0.936372
A:B        0.083333  1.0  0.008130  0.936372
Residual  20.500000  2.0       NaN       NaN


In this example:

We define the data using a Pandas DataFrame where 'A' and 'B' represent the two categorical independent variables (factors), and 'Y' represents the continuous dependent variable.

We fit a two-way ANOVA model using the ols function from StatsModels, specifying the formula 'Y ~ A * B' to include both main effects and their interaction.

We calculate the main effects and interaction effect using sm.stats.anova_lm and specifying typ=2 to use the Type II sums of squares.

The result in the effects DataFrame will provide information about the main effects of factors 'A' and 'B' as well as the interaction effect between them.

This example demonstrates how to perform a two-way ANOVA using Python and extract the main effects and interaction effect from the model. You can further analyze and interpret these effects to understand the impact of each factor and their interaction on the dependent variable.







Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

When conducting a one-way ANOVA and obtaining an F-statistic and a p-value, you can make conclusions about the differences between the groups as follows:

1. **F-Statistic:**
   - The F-statistic is a measure of the variability between the group means relative to the variability within the groups. In other words, it quantifies the ratio of the explained variance (due to group differences) to the unexplained variance (within-group variation).

2. **P-Value:**
   - The p-value associated with the F-statistic indicates the probability of obtaining the observed F-statistic or a more extreme one if there were no true differences between the groups (i.e., under the null hypothesis). A small p-value suggests that the observed differences are unlikely to be due to random chance.

In your case, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how you can interpret these results:

- The F-statistic of 5.23 indicates that there is some variation in the means of the groups, but it does not tell you the direction or significance of this variation by itself.

- The p-value of 0.02 is less than the typical significance level (e.g., α = 0.05) commonly used in hypothesis testing. This means that the probability of observing the differences in group means by random chance (under the null hypothesis) is only 0.02, which is less than 0.05. Therefore, you can conclude that there are statistically significant differences between the groups.

In summary:

- You can conclude that there are statistically significant differences between the groups based on the low p-value.
- The specific nature of the differences, such as which group(s) differ from others, would require further post hoc tests or additional analysis.
- It's important to conduct post hoc tests or pairwise comparisons to determine which groups are different from each other and gain a more detailed understanding of the group differences.

Keep in mind that while the ANOVA tells you that there are differences, it doesn't reveal the direction or magnitude of these differences. Additional analyses or post hoc tests can provide further insights into the nature of these group differences.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is an important aspect of data analysis. Repeated measures ANOVA involves collecting data from the same subjects or objects at multiple time points or under multiple conditions. Missing data can occur for various reasons, such as subjects not completing the study at all time points, technical issues, or subjects skipping particular measurements. How you handle missing data can have significant consequences for the validity and interpretability of your analysis. Here are some common methods for handling missing data and their potential consequences:

1. **Complete Case Analysis (Listwise Deletion):**
   - **Method:** Remove cases (subjects) with any missing data from the analysis.
   - **Consequences:**
     - Pros: Simple and straightforward.
     - Cons: Reduces the sample size, potentially leading to a loss of statistical power and generalizability. If missing data is not completely at random (MCAR), this method can introduce bias and affect the validity of the results.

2. **Pairwise Deletion (Available Case Analysis):**
   - **Method:** Include cases with available data for each specific comparison or time point in the analysis.
   - **Consequences:**
     - Pros: Retains all available data, which maximizes sample size and statistical power for each specific comparison.
     - Cons: May introduce bias if missing data is not completely at random, as the data included in each comparison are not necessarily representative of the entire sample.

3. **Imputation Methods:**
   - **Method:** Replace missing data with estimated values using imputation techniques, such as mean imputation, regression imputation, or multiple imputation.
   - **Consequences:**
     - Pros: Retains the entire sample, maintains statistical power, and can mitigate the bias introduced by missing data.
     - Cons: Imputed values are estimates and can introduce variability that does not exist in the actual data, leading to potentially inaccurate standard errors and hypothesis testing.

4. **Mixed Effects Models (Linear Mixed Models):**
   - **Method:** Utilize mixed effects models, such as linear mixed models (LMM) or generalized estimating equations (GEE), which can account for missing data within the modeling framework.
   - **Consequences:**
     - Pros: Retains all available data, properly models within-subject correlations, and provides valid estimates and standard errors.
     - Cons: More complex to implement and may require knowledge of advanced statistical techniques. The validity of results depends on the correctness of the model specification and the assumption that data is missing at random (MAR).

The choice of how to handle missing data in a repeated measures ANOVA should depend on the specific circumstances, the nature of the missing data, and the potential consequences of the chosen method. It is essential to consider the potential impact on the validity of the results, the sample size, and the statistical power. If data are not missing completely at random (MCAR), imputation methods and mixed effects models may provide more accurate and unbiased results compared to complete case analysis or pairwise deletion. However, the assumptions underlying imputation and mixed effects models should be carefully evaluated.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used in the context of analysis of variance (ANOVA) to perform pairwise comparisons among multiple groups when a significant difference is found in the ANOVA. These tests help identify which specific group(s) differ from each other. There are several common post-hoc tests, and the choice of which one to use depends on the design of the experiment and the assumptions underlying the data. Here are some common post-hoc tests and when you might use each one:

1. **Tukey's Honestly Significant Difference (Tukey's HSD):**
   - **Use Case:** Tukey's HSD is widely used when you have performed a one-way ANOVA and you want to compare all pairs of group means. It controls the familywise error rate and is suitable when you have a balanced design (equal sample sizes) and homogeneity of variances.
   - **Example:** You conducted an experiment to compare the performance of four different teaching methods, and the one-way ANOVA showed a significant difference in the means. Tukey's HSD can help determine which teaching methods are significantly different from each other.

2. **Bonferroni Correction:**
   - **Use Case:** The Bonferroni correction is a conservative method that can be used in various ANOVA designs to control the familywise error rate. It is applicable when you want to make multiple pairwise comparisons while maintaining a low overall Type I error rate. It is suitable for both balanced and unbalanced designs but can be overly conservative for large numbers of comparisons.
   - **Example:** You are comparing the performance of different drug treatments, and you want to assess the differences between all possible pairs of treatments after a one-way ANOVA.

3. **Sidak Correction:**
   - **Use Case:** The Sidak correction is similar to Bonferroni but is less conservative. It can be used when you have multiple pairwise comparisons to make after ANOVA and want to control the familywise error rate at a lower significance level than the uncorrected alpha.
   - **Example:** You are conducting a post-hoc analysis of multiple comparisons among means in a two-way ANOVA with interaction effects.

4. **Dunnett's Test:**
   - **Use Case:** Dunnett's test is used when you have one control group and you want to compare all other groups to the control group after a one-way ANOVA. It is suitable for situations where you have a control group and are interested in whether other groups differ significantly from the control.
   - **Example:** You are testing the effectiveness of different medications compared to a placebo (control group) for pain relief.

5. **Games-Howell Test:**
   - **Use Case:** The Games-Howell test is a non-parametric alternative that can be used when the assumptions of equal variances or normality are violated, and you have unequal sample sizes. It is suitable when conducting post-hoc tests for multiple pairwise comparisons.
   - **Example:** You are analyzing the performance of different brands of smartphones in a survey, and the data does not meet the assumptions of equal variances and normality.

6. **Holm-Bonferroni Procedure:**
   - **Use Case:** Holm's procedure is a stepwise method that can be applied to control the familywise error rate. It ranks the p-values from the pairwise comparisons and adjusts the significance levels accordingly. It is suitable when you want to make multiple comparisons but with a controlled error rate.
   - **Example:** You are comparing the effects of several different advertising strategies on sales, and you want to determine which strategies are significantly different from each other.

The choice of a specific post-hoc test depends on the nature of your experimental design, sample sizes, assumptions about the data, and the desired level of control over the familywise error rate. Careful consideration of these factors is essential when selecting an appropriate post-hoc test.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [4]:
import numpy as np
from scipy.stats import f_oneway

diet_A = [3.5, 4.1, 2.9, 4.2, 3.8, 4.0, 4.2, 3.6, 3.7, 4.1,
          3.9, 4.0, 3.8, 3.6, 4.1, 3.7, 4.0, 4.2, 3.9, 3.5,
          4.0, 4.2, 3.9, 4.1, 3.8, 3.6, 4.2, 3.7, 4.0, 3.5,
          3.9, 4.0, 4.1, 3.6, 4.2, 3.8, 4.0, 3.7, 3.5, 3.9,
          4.1, 4.2, 3.6, 4.0, 4.2, 3.8, 3.7, 4.1]
diet_B = [2.6, 2.7, 2.9, 2.8, 2.7, 2.6, 2.8, 2.9, 2.7, 2.6,
          2.7, 2.9, 2.8, 2.7, 2.6, 2.8, 2.9, 2.7, 2.6, 2.7,
          2.9, 2.8, 2.7, 2.6, 2.8, 2.9, 2.7, 2.6, 2.7, 2.9,
          2.8, 2.7, 2.6, 2.8, 2.9, 2.7, 2.6, 2.7, 2.9, 2.8,
          2.7, 2.6, 2.8, 2.9, 2.7, 2.6, 2.7, 2.9]
diet_C = [1.9, 1.8, 2.0, 2.1, 1.9, 2.0, 1.8, 1.9, 2.1, 2.0,
          1.8, 1.9, 2.0, 2.1, 1.9, 2.0, 1.8, 1.9, 2.1, 2.0,
          1.8, 1.9, 2.0, 2.1, 1.9, 2.0, 1.8, 1.9, 2.1, 2.0,
          1.8, 1.9, 2.0, 2.1, 1.9, 2.0, 1.8, 1.9, 2.1, 2.0,
          1.8, 1.9, 2.0, 2.1, 1.9, 2.0, 1.8, 1.9]

f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print("One-way ANOVA results:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)


if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


One-way ANOVA results:
F-statistic: 1433.5516338647105
p-value: 1.9966832318068114e-94
There is a significant difference between the mean weight loss of the three diets.


In this example:

We have three sets of data for diets A, B, and C, representing the weight loss of 50 participants on each diet.

We perform a one-way ANOVA using the f_oneway function from SciPy, which calculates the F-statistic and the associated p-value.

The F-statistic and p-value are printed, and based on the p-value, we interpret the results. If the p-value is less than the chosen significance level (e.g., 0.05), we conclude that there is a significant difference between the mean weight loss of the three diets.

In this case, if the p-value is less than 0.05, you would conclude that there is a significant difference in mean weight loss between the diets. If the p-value is greater than or equal to 0.05, you would conclude that there is no significant difference between the diets.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'Software_Program': ['A', 'B', 'C'] * 10,
    'Experience_Level': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time_to_Complete_Task': [12, 13, 14, 15, 16, 11, 10, 14, 13, 12,
                              18, 19, 17, 16, 20, 9, 10, 12, 11, 14,
                              15, 15, 16, 17, 20, 18, 19, 21, 22, 20]
})

model = ols('Time_to_Complete_Task ~ Software_Program * Experience_Level', data=data).fit()

anova_table = sm.stats.anova_lm(model, typ=2)

print("Two-way ANOVA results:")
print(anova_table)


Two-way ANOVA results:
                                       sum_sq    df         F    PR(>F)
Software_Program                    18.600000   2.0  0.654162  0.528901
Experience_Level                    12.033333   1.0  0.846424  0.366721
Software_Program:Experience_Level    2.466667   2.0  0.086753  0.917190
Residual                           341.200000  24.0       NaN       NaN


In this example:

We have a DataFrame with three columns: "Software_Program," "Experience_Level," and "Time_to_Complete_Task."

We use the ols function from StatsModels to specify the model with both main effects and the interaction effect between "Software_Program" and "Experience_Level."

We perform the analysis of variance using sm.stats.anova_lm and specify typ=2 to use Type II sums of squares.

The ANOVA table will provide information about the main effects of "Software Program" and "Experience Level" as well as the interaction effect between them.

To interpret the results:

Examine the F-statistics and associated p-values in the ANOVA table.
If any main effect has a significant p-value (typically less than 0.05), it suggests that this factor has an effect on the time to complete the task.
If the interaction effect has a significant p-value, it indicates that the combination of "Software Program" and "Experience Level" has an effect on the time to complete the task.
Based on the results, you can draw conclusions about the significance of the main effects and the interaction effect in relation to the time it takes to complete the task.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [7]:
import numpy as np
import scipy.stats as stats
import pandas as pd

control_group_scores = [75, 80, 85, 78, 92, 88, 79, 70, 86, 81, 73, 87, 89, 82, 77, 75, 90, 84, 76, 72, 91, 83, 79, 88, 74, 80, 85, 78, 92, 76, 80, 75, 84, 78, 88, 79, 86, 81, 73, 87, 89, 82, 77, 75, 90, 84, 76, 72, 91, 83]
experimental_group_scores = [85, 88, 93, 79, 92, 88, 89, 82, 90, 86, 82, 89, 90, 84, 77, 85, 91, 87, 80, 75, 93, 86, 83, 91, 78, 85, 89, 87, 94, 78, 86, 80, 88, 79, 90, 83, 91, 85, 84, 77, 85, 86, 82, 78, 85, 91, 87, 80, 75]

t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the groups.")


Two-Sample T-Test Results:
t-statistic: -3.165961658175514
p-value: 0.00206603098208819
There is a significant difference in test scores between the control and experimental groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

A repeated measures ANOVA is typically used when you have multiple measurements from the same subjects or objects under different conditions or time points. In your case, you want to determine if there are significant differences in the average daily sales of three retail stores (Store A, Store B, and Store C) across 30 days. However, a repeated measures ANOVA may not be the most appropriate statistical test for your data because it assumes that the measurements are dependent within each subject or object.

Instead, you should consider using a one-way ANOVA to compare the average daily sales among the three stores. A repeated measures ANOVA is typically used when you have repeated measurements on the same subjects or objects under different conditions, which is not the case in your scenario.

Here's how you can perform a one-way ANOVA in Python to compare the average daily sales of the three stores:

In [8]:
import numpy as np
import scipy.stats as stats

sales_store_A = [500, 550, 600, 530, 520, 580, 540, 570, 610, 520, 590, 550, 600, 570, 540, 530, 620, 560, 540, 580, 560, 600, 590, 570, 530, 550, 600, 530, 520, 580]
sales_store_B = [480, 520, 590, 510, 530, 570, 530, 560, 600, 510, 580, 540, 590, 560, 520, 520, 610, 540, 530, 570, 550, 580, 590, 550, 520, 540, 590, 530, 520, 570]
sales_store_C = [450, 480, 550, 490, 510, 540, 520, 550, 580, 490, 560, 530, 570, 540, 500, 500, 590, 520, 510, 550, 530, 560, 570, 540, 510, 520, 560, 520, 500, 550]

f_statistic, p_value = stats.f_oneway(sales_store_A, sales_store_B, sales_store_C)

print("One-way ANOVA results:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in daily sales between the three stores.")
else:
    print("There is no significant difference in daily sales between the stores.")


One-way ANOVA results:
F-statistic: 7.347456368362423
p-value: 0.0011260847350939831
There is a significant difference in daily sales between the three stores.


In this example:

We have daily sales data for three stores, Store A, Store B, and Store C, across 30 days.

We perform a one-way ANOVA using the f_oneway function from SciPy to compare the means of the three stores.

The F-statistic and p-value are printed, and we interpret the results. If the p-value is less than the chosen significance level (e.g., 0.05), we conclude that there is a significant difference in daily sales between the three stores.

If the results are significant and you want to determine which stores differ significantly, you can perform post-hoc tests, such as Tukey's Honestly Significant Difference (HSD) test, to identify specific store differences.