## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

In [None]:
Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if
there are statistically significant differences among them. To use ANOVA and trust the validity of its results, certain 
assumptions must be met. Violations of these assumptions can impact the validity of ANOVA results. Here are the key
assumptions for ANOVA:

1.Independence of Observations:

    ~Assumption: Observations within and between groups should be independent. This means that the data points in one
     group should not be influenced by or related to the data points in other groups.
    ~Violation: If observations are not independent, it can lead to pseudoreplication or inflated Type I error rates.
     For example, if you are comparing test scores of students in different schools, and some students attend both
    schools,it violates the independence assumption.
    
2.Normality:

    ~Assumption: The residuals (the differences between observed values and group means) should follow a normal
     distribution. In other words, the errors or residuals should be normally distributed within each group.
    ~Violation: If the residuals are not normally distributed, it can lead to inaccurate p-values and confidence
     intervals. Violations can be detected through normality tests or graphical methods like Q-Q plots.
        
3.Homogeneity of Variance (Homoscedasticity):

    ~Assumption: The variances of the residuals should be roughly equal across all groups. In other words, the spread or
     dispersion of data points within each group should be approximately the same.
    ~Violation: Heteroscedasticity (unequal variances) can lead to unequal weighting of groups in the ANOVA analysis.
     This can affect the F-statistic and p-values, making them less reliable. You can check for homogeneity of variance
    using tests like Levene's test or by inspecting residual plots.
    
4.Random Sampling:

    ~Assumption: Samples should be drawn randomly from the population of interest. This assumption ensures that the
     sample is representative of the population.
    ~Violation: Non-random sampling can introduce bias into the results and make it difficult to generalize findings to 
     the larger population.
        
5.Interval or Ratio Data:

    ~Assumption: The dependent variable should be measured on an interval or ratio scale. ANOVA is not appropriate for
     nominal or ordinal data.
    ~Violation: Using ANOVA with categorical or ordinal data can lead to incorrect results.
    
6.Equal Group Sizes (for one-way ANOVA):

    ~Assumption: In a one-way ANOVA (comparing three or more groups), it is often assumed that the group sizes are equal.
    ~Violation: Unequal group sizes can affect the power of the ANOVA test and make it less robust. However, ANOVA is
    still relatively robust to unequal group sizes unless the imbalance is extreme.
    
7.No Significant Outliers:

    ~Assumption: There should be no extreme outliers in the data that could unduly influence the results.
    ~Violation: Outliers can skew group means and affect the validity of ANOVA results. It's important to identify and
     handle outliers appropriately.
        
It's important to note that ANOVA is relatively robust to violations of some assumptions, especially when sample sizes
are large. However, when assumptions are seriously violated, alternative methods or transformations of data may be
necessary to obtain valid results. Additionally, non-parametric tests like the Kruskal-Wallis test can be used as a 
robust alternative to ANOVA when assumptions cannot be met.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
Analysis of Variance (ANOVA) is a statistical technique used to analyze the variation between groups and within groups
to determine if there are statistically significant differences among the group means. There are three main types of 
ANOVA, each designed for different situations:

1.One-Way ANOVA:

    ~Use: One-Way ANOVA is used when you have one independent variable (categorical) with three or more levels or groups,
     andyou want to determine if there are statistically significant differences in the means of a continuous dependent 
    variable across these groups.
    ~Example: You want to compare the mean test scores of students from three different schools (School A, School B, and
     School C) to see if there are significant differences in performance.
    
2.Two-Way ANOVA:

    ~Use: Two-Way ANOVA is used when you have two independent variables (factors), and you want to assess their
     individual and interactive effects on a continuous dependent variable. It examines not only the main effects of
    each factor but also their interaction effect.
    ~Example: You want to analyze the effects of two factors, such as gender (Male/Female) and treatment (Treatment A/
     Treatment B), on patient recovery time.
    
3.Repeated Measures ANOVA:

    ~Use: Repeated Measures ANOVA is used when you have a single group of subjects that are measured or tested under
     multiple conditions or time points. It is designed to assess changes within subjects over time or conditions.
    ~Example: You want to assess whether there is a significant change in participants' blood pressure levels over three 
     different time points (baseline, 1 month, and 3 months) after they start taking a new medication.
        
Each type of ANOVA is suited to specific research designs and questions. Here's a brief summary of when to use each
type:

    ~One-Way ANOVA: Use when you have one independent variable (factor) with three or more levels (groups) and want to
                    compare the means of a continuous dependent variable across these groups.

    ~Two-Way ANOVA: Use when you have two independent variables (factors) and want to assess the main effects of each
                    factor and their interaction effect on a continuous dependent variable.

    ~Repeated Measures ANOVA: Use when you have a single group of subjects and want to assess changes within subjects
                              over multiple time points or conditions.

It's important to choose the appropriate type of ANOVA based on your research design and the specific hypotheses you 
want to test. Using the wrong type of ANOVA can lead to incorrect or misleading results. Additionally, it's crucial to 
ensure that the assumptions of ANOVA are met or appropriately addressed for valid and reliable results.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps researchers understand
how the total variance in a dataset is broken down into different components or sources of variation. ANOVA decomposes
the total variance into several parts, which include:

1.Total Variance (Total Sum of Squares, SST):

    ~The total variance represents the overall variability in the data, capturing the differences between individual 
     data points and the grand mean of all data points. It is calculated as the sum of the squared deviations of each
    data point from the grand mean.
    
            SST=∑( Xij − Xˉ )2

2.Between-Group Variance (Between-Group Sum of Squares, SSB):

    ~This component of variance measures the variability between different groups or categories (levels) of the
     independent variable. It quantifies how much the group means differ from the grand mean.

            SSB = ∑ ni ( Xˉi−Xˉ )2
Where:

    ~ni is the number of observations in group i.
    ~Xˉi is the mean of group i.
    ~Xˉ is the grand mean.
    
3.Within-Group Variance (Within-Group Sum of Squares, SSW):

    ~This component of variance accounts for the variability within each group or category. It measures how much
     individual data points deviate from their respective group means.

            SSW=∑∑( Xij − Xˉi )2
Where:

    ~Xij is an individual data point in group i.
    ~Xˉi is the mean of group i.
    
Understanding the partitioning of variance in ANOVA is crucial for several reasons:

1.Hypothesis Testing: ANOVA helps test hypotheses about whether there are significant differences in means between 
  groups. By partitioning the variance into between-group and within-group components, ANOVA determines if the between-
group variability is larger than what would be expected due to random chance.

2.Interpretation: It allows researchers to interpret the relative contributions of group differences (between-group
  variance) and random variability (within-group variance) to the total variability in the data.

3.Effect Size: Researchers can calculate effect sizes, such as eta-squared (η2), which quantify the proportion of 
  variance explained by the independent variable. Effect size measures help assess the practical significance of
observed differences.

4.Model Comparison: ANOVA can be used to compare different models or treatments to see which one provides a better
  explanation for the observed data. Understanding how much variance is explained by each model or treatment is 
essential for model selection.

5.Assumptions and Model Validity: Understanding the partitioning of variance can help identify potential violations of
  ANOVA assumptions, such as homogeneity of variances and normality of residuals. These assumptions impact the validity
of ANOVA results.

In summary, the partitioning of variance in ANOVA provides insights into the sources of variability in the data, aids in
hypothesis testing and interpretation, and guides researchers in making informed decisions about the significance and
practical relevance of group differences. It also helps ensure the validity and reliability of ANOVA results.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
# Calculate degrees of freedom
df_total = len(all_data) - 1
df_explained = len(group_means) - 1
df_residual = df_total - df_explained

# Calculate Mean Squares
ms_explained = squared_deviations_explained / df_explained
ms_residual = squared_deviations_residual / df_residual

# Calculate the F-statistic
F = ms_explained / ms_residual

# Calculate the p-value
p_value = 1 - stats.f.cdf(F, df_explained, df_residual)

# Print F-statistic and p-value
print(f"F-statistic: {F:.4f}")
print(f"P-value: {p_value:.4f}")

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
In a two-way ANOVA, you can calculate the main effects and interaction effect using Python by first performing the ANOVA
analysis and then examining the results to extract these effects. You can use libraries like statsmodels or scipy for 
the ANOVA analysis. Here's how you can calculate the main effects and interaction effect step by step:

Assuming you have a dataset with two categorical independent variables (factors) and one continuous dependent variable:
    
1.Import the required libraries:
    
    import pandas as pd
    import statsmodels.api as sm
    from statsmodels.formula.api import ols

1.Organize your data into a DataFrame. Suppose you have a DataFrame named df with columns "Factor1," "Factor2," and 
  "DependentVariable."

2.Perform the two-way ANOVA analysis using the ols function from statsmodels:
    
    formula = 'DependentVariable ~ Factor1 * Factor2'
    model = ols(formula, data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    
In the above code:

    ~formula specifies the model formula with interactions between Factor1 and Factor2.
    ~model fits the ANOVA model to your data.
    ~anova_table contains the ANOVA results, including the main effects and interaction effect.
    
1.Extract the main effects and interaction effect from the anova_table:

    main_effect_factor1 = anova_table.loc['Factor1', 'sum_sq'] / anova_table.loc['Factor1', 'df']
    main_effect_factor2 = anova_table.loc['Factor2', 'sum_sq'] / anova_table.loc['Factor2', 'df']
    interaction_effect = anova_table.loc['Factor1:Factor2', 'sum_sq'] / anova_table.loc['Factor1:Factor2', 'df']

Here's what each line does:

    ~anova_table.loc['Factor1', 'sum_sq'] retrieves the sum of squares for Factor1.
    ~anova_table.loc['Factor1', 'df'] retrieves the degrees of freedom for Factor1.
    ~Dividing the sum of squares by the degrees of freedom gives you the mean square for Factor1, which represents the
     main effect of Factor1. Similarly, you calculate the main effect for Factor2 and the interaction effect.
        
Now, main_effect_factor1, main_effect_factor2, and interaction_effect will contain the main effects and interaction
effect, respectively.

These values represent the variability explained by each effect. You can also calculate F-statistics and p-values to
assess the significance of these effects.

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret these results?

In [None]:
In a one-way ANOVA, the F-statistic and p-value are used to assess whether there are statistically significant
differences among the means of three or more groups. Let's interpret the results you provided:

    1.F-Statistic: The F-statistic measures the ratio of the variance between groups to the variance within groups. It
      quantifies whether the observed differences in group means are larger than what you would expect by random chance.

    2.P-Value: The p-value associated with the F-statistic tells you the probability of obtaining the observed F-
     statistic (or an even more extreme value) if there were no real differences between the groups.

In your case:

    ~F-Statistic: 5.23
    ~P-Value: 0.02
    
Now, let's interpret these results:

    The F-statistic of 5.23 indicates that there is some variability in the data due to differences between the groups.
    In other words, the means of the groups are not all exactly the same.

    The p-value of 0.02 is the probability of observing an F-statistic as extreme as 5.23 (or more extreme) under the 
    assumption that there are no real differences between the groups. A small p-value (typically less than your chosen
    significance level, e.g., 0.05) indicates evidence against the null hypothesis.

Based on these results:

    1.Null Hypothesis (H0): The null hypothesis in ANOVA is that there are no significant differences between the group
                            means. In other words, all group means are equal.

    2.Alternative Hypothesis (Ha): The alternative hypothesis is that there are significant differences between at least 
                                   two group means.

Since the p-value (0.02) is less than the significance level (e.g., 0.05), you would reject the null hypothesis. This
means you have evidence to conclude that there are statistically significant differences between at least two group 
means.

In practical terms, you would need to perform post hoc tests or pairwise comparisons (e.g., Tukey's HSD test, Bonferroni
correction) to determine which specific groups are different from each other. These tests can identify where the
differences lie among the groups.

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there are statistically significant
differences between at least two groups. However, further analysis is needed to determine which specific groups differ
from each other.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In [None]:
Handling missing data in a repeated measures ANOVA is important to ensure the validity and reliability of your analysis.
Missing data can arise for various reasons, such as participant dropouts, equipment malfunctions, or incomplete
responses. There are several methods to handle missing data, and the choice of method can impact the results and
conclusions of your analysis. Here are common approaches and their potential consequences:

1.Listwise Deletion (Complete Case Analysis):

    ~Method: Exclude cases with any missing data on any variable from the analysis.
    ~Consequences:
        ~Pros: Simple and easy to implement.
        ~Cons: Reduces sample size, potentially leading to reduced statistical power and biased results if the data are
         not missing completely at random (MCAR). It may also introduce bias if the missing data are related to the
        variables of interest.
        
2.Mean Imputation:

    ~Method: Replace missing values with the mean of the available values for the same variable.
    ~Consequences:
        ~Pros: Preserves sample size, maintains simplicity.
        ~Cons: May underestimate variability, distort relationships between variables, and lead to biased parameter 
         estimates, especially if missingness is not MCAR. Can artificially reduce the variance and make it appear as
        if there are no treatment effects.
        
3.Last Observation Carried Forward (LOCF):

    ~Method: Carry forward the last observed value for each participant to replace missing data points.
    ~Consequences:
        ~Pros: Preserves sample size, simple.
        ~Cons: Can introduce bias if data are not missing at random, particularly in cases of temporary changes or
        nonlinear trends. May not accurately reflect the participant's true state.
        
4.Linear Interpolation:

    ~Method: Interpolate missing values using linear interpolation techniques based on adjacent time points.
    ~Consequences:
        ~Pros: Preserves sample size, potentially more accurate than LOCF if data follow a linear trend.
        ~Cons: Still subject to bias if the data do not follow a linear pattern or if missingness is related to the
         variables.
            
5.Multiple Imputation:

    ~Method: Generate multiple imputed datasets, each with different plausible values for the missing data, and analyze
     each dataset separately. Combine results using established methods (e.g., Rubin's rules).
    ~Consequences:
        ~Pros: Provides unbiased parameter estimates, preserves statistical power, and accounts for the uncertainty
         associated with missing data. Appropriate for data missing at random (MAR).
        ~Cons: More complex and computationally intensive than other methods.
                                                                         
6.Maximum Likelihood Estimation (MLE):

    ~Method: Use statistical software that can handle missing data using MLE. MLE estimates the model parameters while
     accounting for missing values.
    ~Consequences:
        ~Pros: Provides unbiased parameter estimates, preserves statistical power, and accommodates different patterns
         of missingness. Suitable for data missing at random (MAR).
        ~Cons: Requires specialized software and statistical knowledge.
                                                                         
The choice of method should depend on the nature of the missing data, the assumptions underlying each method, and the
goals of your analysis. It is essential to consider the potential biases and limitations associated with each approach
and to report the method used and any assumptions made in your research findings. Additionally, sensitivity analyses 
can help assess the robustness of your results to different missing data handling methods.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

In [None]:
Post-hoc tests are used after conducting an Analysis of Variance (ANOVA) to make pairwise comparisons between groups
when the ANOVA reveals a significant overall difference among three or more groups. These tests help identify which 
specific groups differ from each other. Several common post-hoc tests are available, and the choice of which one to use
depends on factors such as your research design, assumptions, and objectives. Here are some common post-hoc tests and
situations where you might use each one:

1.Tukey's Honestly Significant Difference (Tukey's HSD):

    ~When to Use: Tukey's HSD is a conservative post-hoc test appropriate for situations where you have equal group 
     sizes and want to control the overall Type I error rate. It is a good choice when you have multiple groups and wan
        to compare all possible pairs.
    ~Example: You conducted a one-way ANOVA to compare the test scores of students from five different schools. Tukey's
     HSD can help you determine which schools have significantly different mean scores.
        
2.Bonferroni Correction:

    ~When to Use: Bonferroni correction is a conservative method used when you want to control the familywise error rate
     (i.e., the probability of making at least one Type I error across all comparisons). It is suitable for situations
    with multiple comparisons, but it tends to be more conservative and may increase the risk of Type II errors.
    ~Example: You are comparing the effects of three different treatments on a health outcome, and you want to ensure 
     that the overall error rate for any significant differences is controlled at a specific level (e.g., 0.05).
        
3.Dunnett's Test:

    ~When to Use: Dunnett's test is appropriate when you have a control group and want to compare other groups to the
     control group while controlling for the overall Type I error rate.
    ~Example: In a drug trial, you have a control group receiving a placebo and several treatment groups receiving
     different doses of a new medication. Dunnett's test can help you compare each treatment group to the control group.
        
4.Scheffé's Test:

    ~When to Use: Scheffé's test is a conservative post-hoc test suitable for situations where you have unequal group 
     sizes and you want to control the Type I error rate when making multiple comparisons. It is robust but less
    powerful than some other tests.
    ~Example: You are comparing the performance of students from different grade levels (unequal group sizes), and you
     want to control the overall Type I error rate when comparing pairs of grade levels.
        
5.Fisher's Least Significant Difference (LSD):

    ~When to Use: Fisher's LSD is a less conservative post-hoc test appropriate for situations with equal group sizes.
     It allows for more comparisons than Tukey's HSD but does not control the familywise error rate as rigorously.
    ~Example: You conducted a one-way ANOVA to compare the yield of three different fertilizer treatments in a garden
     with equal plot sizes. Fisher's LSD can help identify which pairs of treatments are significantly different.
        
6.Games-Howell Test:

    ~When to Use: Games-Howell is suitable when group sizes are unequal, and you want to perform post-hoc tests without
     assuming equal variances across groups. It does not require the homogeneity of variances assumption.
    ~Example: You are comparing the performance of athletes from different sports teams, and the team sizes are uneven.
     Games-Howell can be used to determine which teams have significantly different scores.
        
In summary, the choice of post-hoc test depends on the characteristics of your data, the assumptions you can make, and 
your specific research questions. It's essential to consider factors like group sizes, homogeneity of variances, and
control of Type I error rates when selecting the most appropriate post-hoc test for your ANOVA results.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

In [None]:
1.To conduct a one-way ANOVA in Python to compare the mean weight loss of three diets (A, B, and C) using the data from
  50 participants, you can use the scipy.stats library. Here's a step-by-step guide:
    
1.Import the required libraries:
    
    import numpy as np
    import scipy.stats as stats
    
1.Organize your data. Create three arrays, one for each diet:

    diet_A = np.array([1.5, 2.0, 1.8, 2.3, 1.7, 2.5, 2.1, 1.9, 1.6, 2.2,
                   1.8, 2.3, 2.0, 1.6, 1.9, 2.2, 2.1, 1.7, 1.5, 2.4,
                   2.0, 1.8, 1.9, 2.1, 2.2, 1.7, 1.6, 2.5, 2.3, 2.4,
                   1.6, 1.7, 2.3, 2.2, 1.9, 2.0, 1.8, 2.1, 2.4, 1.5,
                   2.3, 1.7, 1.8, 1.6, 2.5, 2.2, 2.1, 2.0, 2.4])

diet_B = np.array([1.2, 1.1, 1.3, 1.0, 1.2, 1.5, 1.4, 1.3, 1.2, 1.1,
                   1.4, 1.3, 1.2, 1.0, 1.1, 1.5, 1.3, 1.2, 1.4, 1.0,
                   1.1, 1.2, 1.5, 1.3, 1.4, 1.0, 1.1, 1.2, 1.3, 1.4,
                   1.0, 1.5, 1.3, 1.2, 1.1, 1.4, 1.3, 1.0, 1.5, 1.2,
                   1.4, 1.1, 1.3, 1.0, 1.2, 1.4, 1.5, 1.3, 1.1])

diet_C = np.array([0.8, 0.9, 0.7, 0.6, 0.9, 0.7, 0.8, 0.7, 0.6, 0.9,
                   0.7, 0.8, 0.6, 0.9, 0.7, 0.8, 0.9, 0.6, 0.7, 0.8,
                   0.6, 0.9, 0.7, 0.8, 0.7, 0.6, 0.9, 0.7, 0.8, 0.6,
                   0.9, 0.7, 0.8, 0.6, 0.7, 0.9, 0.8, 0.6, 0.7, 0.8,
                   0.9, 0.6, 0.7, 0.8, 0.6, 0.9, 0.7, 0.8, 0.7, 0.6])

1.Perform the one-way ANOVA:
    
    F_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

1.Interpret the results:

    ~F-statistic: This statistic quantifies the ratio of the between-group variability to the within-group variability.
     It measures whether the mean weight loss differs significantly between the three diets.
    ~p-value: This value represents the probability of observing the obtained F-statistic or a more extreme value under
     the null hypothesis (i.e., no differences between the diets). A small p-value (typically less than 0.05) indicates
        evidence against the null hypothesis.
        
Now, you can print the F-statistic and p-value and interpret the results:

    print(f"F-statistic: {F_statistic:.2f}")
    print(f"P-value: {p_value:.4f}")

    if p_value < 0.05:
        print("There is significant evidence to conclude that at least one diet leads to different mean weight loss.")
    else:
        print("There is no significant evidence to conclude that the mean weight loss differs between the diets.")

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate some example data (replace this with your actual data)
np.random.seed(0)
n = 30
software = np.random.choice(['A', 'B', 'C'], n)
experience = np.random.choice(['novice', 'experienced'], n)
completion_time = np.random.normal(20, 5, n)  # Replace with actual completion times

# Create a DataFrame
data = pd.DataFrame({'Software': software, 'Experience': experience, 'CompletionTime': completion_time})

# Fit a two-way ANOVA model
formula = 'CompletionTime ~ Software + Experience + Software:Experience'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Interpret the results
alpha = 0.05  # Set your desired significance level

# Check main effects
if anova_table['PR(>F)']['Software'] < alpha:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

if anova_table['PR(>F)']['Experience'] < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

# Check interaction effect
if anova_table['PR(>F)']['Software:Experience'] < alpha:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")

                         sum_sq    df         F    PR(>F)
Software              69.634658   2.0  2.113814  0.142706
Experience            13.138395   1.0  0.797652  0.380665
Software:Experience   37.582879   2.0  1.140857  0.336272
Residual             395.312007  24.0       NaN       NaN
There is no significant main effect of Software.
There is no significant main effect of Experience.
There is no significant interaction effect between Software and Experience.


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import numpy as np
import scipy.stats as stats
import statsmodels.stats.multicomp as multi

# Generate example data (replace this with your actual data)
np.random.seed(0)
control_group_scores = np.random.normal(75, 10, 50)  # Replace with actual scores for the control group
experimental_group_scores = np.random.normal(80, 10, 50)  # Replace with actual scores for the experimental group

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Check if the results are significant
alpha = 0.05  # Set your desired significance level
if p_value < alpha:
    print("There is a significant difference between the groups (p-value < 0.05).")
else:
    print("There is no significant difference between the groups (p-value >= 0.05).")

# If the results are significant, perform a post-hoc test (Tukey's HSD)
if p_value < alpha:
    # Combine the data for post-hoc testing
    all_scores = np.concatenate([control_group_scores, experimental_group_scores])
    group_labels = ['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores)

    # Perform Tukey's HSD test for multiple comparisons
    tukey_result = multi.pairwise_tukeyhsd(all_scores, group_labels)

    # Print the post-hoc test results
    print(tukey_result)

There is no significant difference between the groups (p-value >= 0.05).


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [7]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate example data (replace this with your actual data)
np.random.seed(0)
store_a_sales = np.random.normal(500, 50, 30)  # Replace with actual sales data for Store A
store_b_sales = np.random.normal(550, 60, 30)  # Replace with actual sales data for Store B
store_c_sales = np.random.normal(480, 55, 30)  # Replace with actual sales data for Store C

# Create a DataFrame
data = pd.DataFrame({'StoreA': store_a_sales, 'StoreB': store_b_sales, 'StoreC': store_c_sales})

# Perform repeated measures ANOVA
f_statistic, p_value = stats.friedmanchisquare(data['StoreA'], data['StoreB'], data['StoreC'])

# Check if the results are significant
alpha = 0.05  # Set your desired significance level
if p_value < alpha:
    print("There is a significant difference between the stores (p-value < 0.05).")
else:
    print("There is no significant difference between the stores (p-value >= 0.05).")

# If the results are significant, perform a post-hoc test (e.g., Tukey's HSD)
if p_value < alpha:
    # Stack the data for post-hoc testing
    stacked_data = pd.melt(data.reset_index(), id_vars=['index'], value_vars=['StoreA', 'StoreB', 'StoreC'])
    stacked_data.columns = ['Day', 'Store', 'Sales']

    # Perform Tukey's HSD test for multiple comparisons
    tukey_result = pairwise_tukeyhsd(stacked_data['Sales'], stacked_data['Store'], alpha=alpha)

    # Print the post-hoc test results
    print(tukey_result)

There is a significant difference between the stores (p-value < 0.05).
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
StoreA StoreB  10.4859 0.7359 -22.9593   43.931  False
StoreA StoreC -49.4975 0.0019 -82.9426 -16.0523   True
StoreB StoreC -59.9833 0.0001 -93.4285 -26.5382   True
------------------------------------------------------
