# Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.


In [1]:
# Analysis of Variance (ANOVA) is a statistical technique used to compare means among different groups or treatments. To use
# ANOVA effectively and interpret its results accurately, certain assumptions need to be met. Violations of these assumptions
# can affect the validity and reliability of the ANOVA results. The main assumptions for using ANOVA are:

#1. Independence: Observations within each group should be independent of each other. This assumption implies that the values in 
# one group are not influenced by or dependent on the values in another group.

#2. Normality: The residuals (the differences between the observed values and the group means) should be normally distributed for
# each group. This assumption is important because ANOVA relies on the assumption of normality to conduct accurate hypothesis
# tests.

#3. Homogeneity of Variance (Homoscedasticity): The variability of scores within each group should be roughly equal across all
# groups. In other words, the variances of the groups should be homogeneous. If the variances are significantly different,
# it can affect the validity of the F-test used in ANOVA.

# 4.Equal Sample Sizes (for one-way ANOVA): In a one-way ANOVA (when comparing means across more than two groups), having roughly
# equal sample sizes in each group helps maintain the validity of the F-test and the overall analysis.

# 5.Random Sampling: The data should be collected using a random sampling process to ensure that the results can be generalized to
# the larger population from which the samples were drawn.

# If these assumptions are not met, the results of ANOVA might be compromised. However, ANOVA is known to be fairly robust to
# violations of the normality assumption, especially when sample sizes are reasonably large. In cases where the assumptions 
# are severely violated, there are alternative non-parametric tests available that might be more appropriate, such as the
# Kruskal-Wallis test for comparing group means or medians.

# Before applying ANOVA, it's a good practice to check these assumptions using diagnostic plots (like normal probability plots
# or residual plots) and statistical tests. If the assumptions are not met, appropriate data transformations or alternative
# statistical methods might be needed.

# Examples of Violation and Impact on Validity:

# Independence Violation: Let's say you're comparing the productivity of workers before and after a training program. However,
# if the same workers are included in both the "before" and "after" groups, their performance might be influenced by factors
# beyond the training, like their prior knowledge.

# Normality Violation: Consider a study investigating the effect of a new drug on pain relief. If the pain scores are not
# normally distributed within one or more treatment groups, the ANOVA results might be skewed or invalid, potentially leading
# to incorrect conclusions about the drug's effectiveness.

# Homoscedasticity Violation: Imagine a study analyzing the impact of different fertilizers on crop yield. If the variability
# of yields among farms using one fertilizer is much larger than among farms using another, the assumption of homoscedasticity
# is violated, and the ANOVA results could be affected.

# Unequal Sample Sizes Violation: In a study comparing the response times of participants exposed to different stimuli, if one
# stimulus group has a much larger sample size than the others, it could disproportionately influence the ANOVA results, 
# potentially leading to inaccurate conclusions about stimulus effects.


# What are the three types of ANOVA, and in what situations would each be used?

In [2]:
# There are three main types of Analysis of Variance (ANOVA), each designed to handle different types of experimental designs 
# and research questions. The three types of ANOVA are:

# 1.One-Way ANOVA: One-way ANOVA is used when you have one independent variable (factor) with more than two levels or groups. It's
# used to determine whether there are any statistically significant differences among the means of these groups. For example,
# if you're comparing the average test scores of students from different schools (School A, School B, School C), you would use
# a one-way ANOVA.

# 2.Two-Way ANOVA: Two-way ANOVA is used when you have two independent variables, both with multiple levels or groups, and you
# want to examine their individual and interactive effects on the dependent variable. This is often used in experimental
# designs with factorial structures. For instance, if you're investigating the effects of both gender and different teaching
# methods on student performance, you would use a two-way ANOVA.

# 3.Repeated Measures ANOVA (or within-subjects ANOVA): Repeated Measures ANOVA is used when you're measuring the same subjects
# or entities under multiple conditions or at different time points. This type of ANOVA is used to analyze within-subjects
# designs, where each subject's responses are measured repeatedly. For example, if you're testing the effectiveness of a drug
# treatment by measuring patients' pain levels before treatment, after one week, and after two weeks, you would use a repeated
# measures ANOVA.

# In Summary:

# Use One-Way ANOVA when you have one independent variable and you want to compare means across more than two groups.

# Use Two-Way ANOVA when you have two independent variables and you want to analyze their main effects and interactions on the
# dependent variable.

# Use Repeated Measures ANOVA when you're measuring the same subjects under different conditions or time points and want to
# determine if there are significant differences across these conditions or time points.

# Choosing the appropriate type of ANOVA depends on the nature of your data, research design, and the specific hypotheses 
# you're testing. It's important to select the right type of ANOVA to ensure the validity and accuracy of your statistical
# analysis.

# What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [4]:
# The partitioning of variance in ANOVA refers to the process of dividing the total variability in a dataset into different
# components, each attributed to specific sources of variation. This concept is important because it allows us to understand 
# the relative contributions of different factors to the variability observed in the data, helping us draw meaningful 
# conclusions about the relationships between variables and the effects of different treatments or conditions.

# In ANOVA, the total variance is decomposed into several components:
# 1.Between-Group Variance: This component of variance captures the differences among the means of different groups or 
# treatments. It represents the variability that can be explained by the effect of the independent variable(s) being studied.
# If the between-group variance is large compared to the within-group variance, it suggests that the independent variable(s) 
# have a significant impact on the dependent variable.

# 2.Within-Group Variance: This component of variance accounts for the variation within each group or treatment. It represents
# the variability that cannot be attributed to the independent variable(s) and is often associated with random variation,
# measurement error, or other uncontrolled factors.

# 3.Total Variance: This is the overall variability observed in the data, encompassing both the between-group and within-group
# variances. It represents the total variation present in the data.

# The importance of understanding the partitioning of variance in ANOVA includes:

# 1.Interpretation of Results: By quantifying the contributions of different sources of variation, researchers can determine
# the significance of the factors being studied. This helps in interpreting the meaningfulness of observed differences between
# groups or treatments.

# 2.Hypothesis Testing: ANOVA uses the partitioned variance to calculate F-statistics and p-values, which are crucial for
# hypothesis testing. Understanding how these statistics are calculated from the partitioned variance is essential for 
# evaluating the significance of the results.

# 3.Experimental Design: Understanding the partitioning of variance can guide researchers in designing experiments to maximize
# the effect of the independent variable(s) and minimize the influence of random variability.

# 4.Model Building: In more complex ANOVA scenarios, such as factorial ANOVA with multiple factors, understanding the
# partitioning of variance helps researchers construct models that accurately represent the relationships between variables.

# 5.Decision Making: By comprehending how much variability is accounted for by different factors, researchers can make informed
# decisions about the practical significance of their findings and the applicability of their results.

# In summary, the partitioning of variance in ANOVA provides insights into the distribution of variability within a dataset,
# enabling researchers to assess the impact of different factors and draw meaningful conclusions from their analyses. It's a 
# fundamental concept that underpins the interpretation of ANOVA results and the understanding of how various sources of 
# variation contribute to the observed patterns in the data.

# How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


In [5]:
import numpy as np
from scipy import stats

# Simulated data for different groups
group1 = np.array([10, 12, 15, 18, 20])
group2 = np.array([25, 28, 30, 32, 35])
group3 = np.array([40, 42, 45, 48, 50])

# Combine the data from all groups
all_data = np.concatenate([group1, group2, group3])

# Calculate the overall mean
overall_mean = np.mean(all_data)

# Calculate the Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate the group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Calculate the Explained Sum of Squares (SSE)
sse = np.sum((group1_mean - overall_mean)**2) * len(group1) + \
      np.sum((group2_mean - overall_mean)**2) * len(group2) + \
      np.sum((group3_mean - overall_mean)**2) * len(group3)

# Calculate the Residual Sum of Squares (SSR)
ssr = sst - sse

# Degrees of freedom
df_total = len(all_data) - 1
df_group = 3 - 1  # Number of groups minus 1
df_residual = df_total - df_group

# Calculate mean square values
mse = ssr / df_residual
msb = sse / df_group

# Calculate F-statistic
f_statistic = msb / mse

# Calculate p-value using F-distribution
p_value = 1 - stats.f.cdf(f_statistic, df_group, df_residual)

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)
print("Degrees of Freedom (Total, Group, Residual):", df_total, df_group, df_residual)
print("Mean Square Error (MSE):", mse)
print("Mean Square Between (MSB):", msb)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


Total Sum of Squares (SST): 2444.0
Explained Sum of Squares (SSE): 2250.0
Residual Sum of Squares (SSR): 194.0
Degrees of Freedom (Total, Group, Residual): 14 2 12
Mean Square Error (MSE): 16.166666666666668
Mean Square Between (MSB): 1125.0
F-Statistic: 69.58762886597938
p-value: 2.5015153382046407e-07


# In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [7]:
# In a two-way ANOVA, we are interested in both the main effects of two independent variables (factors) and the interaction
# effect between them. You can calculate these effects using Python and libraries like NumPy and SciPy. Here's how you can 
# calculate the main effects and interaction effect:

import numpy as np
from scipy import stats

# Simulated data for a two-way ANOVA
factor1 = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3])
factor2 = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])
response = np.array([10, 12, 15, 20, 22, 25, 30, 32, 35])

# Calculate the overall mean
overall_mean = np.mean(response)

# Calculate the main effects of Factor 1
group_means_factor1 = [np.mean(response[factor1 == level]) for level in np.unique(factor1)]
main_effect_factor1 = np.mean(group_means_factor1) - overall_mean

# Calculate the main effects of Factor 2
group_means_factor2 = [np.mean(response[factor2 == level]) for level in np.unique(factor2)]
main_effect_factor2 = np.mean(group_means_factor2) - overall_mean

# Calculate the interaction effect
interaction_matrix = np.zeros((np.max(factor1), np.max(factor2)))
for i in range(len(response)):
    interaction_matrix[factor1[i] - 1, factor2[i] - 1] = response[i]

row_means = np.mean(interaction_matrix, axis=1)
col_means = np.mean(interaction_matrix, axis=0)
interaction_effect = np.mean(interaction_matrix) - (np.sum(row_means) + np.sum(col_means)) / interaction_matrix.size

# Calculate degrees of freedom
df_factor1 = len(np.unique(factor1)) - 1
df_factor2 = len(np.unique(factor2)) - 1
df_interaction = df_factor1 * df_factor2

# Calculate mean square values
ms_factor1 = (np.sum([(m - overall_mean) ** 2 for m in group_means_factor1]) / df_factor1) if df_factor1 > 0 else 0
ms_factor2 = (np.sum([(m - overall_mean) ** 2 for m in group_means_factor2]) / df_factor2) if df_factor2 > 0 else 0
ms_interaction = (np.sum((interaction_matrix - row_means[:, np.newaxis] - col_means)**2) / df_interaction) if df_interaction > 0 else 0

# Calculate F-statistics
f_factor1 = ms_factor1 / ms_interaction if ms_interaction > 0 else 0
f_factor2 = ms_factor2 / ms_interaction if ms_interaction > 0 else 0

# Calculate p-values using F-distribution
p_factor1 = 1 - stats.f.cdf(f_factor1, df_factor1, df_interaction)
p_factor2 = 1 - stats.f.cdf(f_factor2, df_factor2, df_interaction)

print("Main Effect of Factor 1:", main_effect_factor1)
print("Main Effect of Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)
print("F-statistic for Factor 1:", f_factor1)
print("p-value for Factor 1:", p_factor1)
print("F-statistic for Factor 2:", f_factor2)
print("p-value for Factor 2:", p_factor2)


Main Effect of Factor 1: 0.0
Main Effect of Factor 2: 0.0
Interaction Effect: 7.444444444444443
F-statistic for Factor 1: 0.08910670527957229
p-value for Factor 1: 0.9165132445169973
F-statistic for Factor 2: 0.005643424667706245
p-value for Factor 2: 0.9943803719631039


# Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?



In [8]:
In a one-way ANOVA, the F-statistic is used to test whether there are statistically significant differences among the means 
of two or more groups. The p-value associated with the F-statistic indicates the probability of observing such extreme
results (or more extreme) under the assumption that the null hypothesis is true (i.e., there are no significant differences
among the groups).

Given your results:

- F-statistic: 5.23
- p-value: 0.02

Here's how you can interpret these results:

1. F-Statistic: The F-statistic is a measure of the ratio of the variance between the group means to the variance within the
groups. A higher F-statistic suggests that the variability between the group means is relatively larger compared to the
variability within the groups.

2. p-Value: The p-value is the probability of obtaining an F-statistic as extreme as the one observed, assuming the null
hypothesis is true. In other words, it measures the evidence against the null hypothesis. A smaller p-value suggests stronger
evidence against the null hypothesis.

Interpretation:

- Since the p-value (0.02) is less than the common significance level of 0.05 (or 5%), we can conclude that there is 
  statistically significant evidence to reject the null hypothesis.
- This implies that there are significant differences among at least some of the groups' means. In other words, the observed
  differences in group means are unlikely to have occurred due to random chance alone.
- However, the p-value does not tell us which specific groups are different from each other or the direction of those 
  differences. It only indicates that at least one group differs significantly from the others.

In summary, based on the F-statistic and the associated p-value, you can conclude that there are significant differences
among the groups' means. Further analysis, such as post hoc tests or pairwise comparisons, may be needed to determine which
specific groups differ from each other and to provide a more detailed understanding of the nature of these differences.

SyntaxError: unterminated string literal (detected at line 11) (3331619664.py, line 11)

# In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


In [9]:
Handling missing data in a repeated measures ANOVA is an important consideration to ensure the validity and reliability of
your analysis. Different methods can be used to address missing data, each with its own implications and potential 
consequences. Here are some common methods and their potential consequences:

1. Listwise Deletion (Complete Case Analysis):
   - This approach involves excluding participants with missing data on any variable involved in the analysis.
   - Potential Consequences: While it's conceptually simple, it can lead to reduced sample sizes and potential bias if the 
     missing data are not missing completely at random (MCAR). It can also affect the representativeness of the sample.

2. Pairwise Deletion (Available Case Analysis):
   - This method uses all available data for each variable, so participants with some missing data may be included in some
    analyses but not others.
   - Potential Consequences: It retains more data compared to listwise deletion, but can lead to inconsistencies in sample
     sizes across analyses, which may complicate interpretation and comparisons.

3. Imputation Methods:
   - Imputation involves replacing missing values with estimated values. Common imputation methods include mean imputation,
     median imputation, and regression imputation.
   - Potential Consequences: Imputation can artificially reduce variance and may distort relationships in the data. The choice
     of imputation method can affect the results. Imputing data that were not actually observed may lead to biased estimates.

4. Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):
   - These methods involve using the last observed value for a participant's missing data (LOCF) or using the next observed
     value (NOCB).
   - Potential Consequences: These methods assume that missing data points are similar to adjacent observed data points, 
     which might not be the case. They can lead to biased estimates and underestimate the variability between time points.

5. Mixed-Effects Models:
   - Mixed-effects models can handle missing data by utilizing all available information, even if a participant has some
     missing data points.
   - Potential Consequences: While mixed-effects models are robust and can provide unbiased estimates under the missing at
     random (MAR) assumption, their complexity might require more advanced statistical knowledge to implement and interpret.

The consequences of using different methods to handle missing data can include biased estimates, distorted variability, 
reduced statistical power, and inaccurate hypothesis testing. The choice of method depends on the nature of the missing data,
the underlying assumptions of the analysis, and the research context. It's important to carefully consider the implications
of your chosen method and document your decisions transparently in your analysis report. If missing data are extensive or 
patterns are complex, consulting with a statistician is recommended to ensure appropriate handling.

SyntaxError: unterminated string literal (detected at line 7) (2156936998.py, line 7)

# What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.


In [11]:
After conducting an ANOVA and finding a significant difference among group means, post-hoc tests are often performed to
determine which specific groups differ from each other. Post-hoc tests help prevent making Type I errors (false positives)
by comparing groups only when necessary. Some common post-hoc tests include:

1. Tukey's Honestly Significant Difference (HSD):
   - Tukey's HSD compares all possible pairs of group means and controls the overall familywise error rate. It's appropriate 
    when you have a moderate to large number of groups and want to test all pairwise comparisons.
   - Example: A researcher conducts a one-way ANOVA to compare the effectiveness of four different exercise programs on 
    weight loss. The ANOVA indicates a significant difference among the means, so Tukey's HSD is used to determine which
    programs lead to significantly different weight loss.

2. Bonferroni Correction:
   - The Bonferroni correction adjusts the significance level for each individual comparison to maintain an overall familywise
     error rate. It's conservative and suitable when you're making a large number of comparisons.
   - Example: In a study comparing the test scores of students from five different schools, the ANOVA shows a significant 
     difference. To avoid inflating the Type I error rate, Bonferroni-corrected pairwise t-tests are used to compare each 
    school's scores with the others.

3. Dunn's Test (Non-parametric):
   - Dunn's test is a non-parametric post-hoc test suitable for situations where ANOVA assumptions are violated  
    (e.g., non-normality, unequal variances). It performs pairwise comparisons while controlling the familywise error rate.
   - Example: An ANOVA comparing the completion times of three different running shoe brands reveals a significant difference.
    As the data might not be normally distributed, Dunn's test is chosen for pairwise comparisons.

4. Scheffe's Method:
   - Scheffe's method is a conservative post-hoc test that provides robust control over the familywise error rate. It's
    suitable when sample sizes are unequal and variances might differ significantly.
   - Example: A research study involving different types of therapies yields a significant result in the ANOVA. Due to 
     varying sample sizes and potential unequal variances, the researcher opts for Scheffe's method to perform pairwise 
    comparisons.

5. Fisher's Least Significant Difference (LSD):
   - Fisher's LSD is a relatively less conservative post-hoc test that compares all pairs of group means. It's used when ANOVA
    assumptions are met, and you want to test pairwise comparisons.
   - Example: An ANOVA is used to analyze the effects of different fertilizer treatments on crop yield. The ANOVA result is
     significant, prompting the use of Fisher's LSD to determine which treatments lead to significantly different yields.

The choice of post-hoc test depends on the specific research context, the nature of the data, the number of comparisons, and
assumptions such as normality and homogeneity of variances. It's important to select a post-hoc test that aligns with the
analysis and to report the chosen method transparently in your research findings.

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 17)

# A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.




In [12]:
import numpy as np
from scipy import stats

# Simulated weight loss data for diets A, B, and C
diet_A = np.array([2.5, 3.2, 4.0, 2.8, 3.5, 2.0, 3.1, 2.9, 3.7, 2.2,
                   2.8, 2.5, 2.9, 2.3, 3.1, 3.8, 3.6, 3.0, 3.5, 2.7,
                   3.4, 2.1, 2.6, 3.2, 3.5])
diet_B = np.array([1.8, 2.0, 1.5, 1.9, 2.3, 1.6, 1.7, 2.1, 2.5, 1.8,
                   1.9, 1.7, 2.0, 1.6, 2.2, 2.3, 2.1, 2.4, 1.5, 1.9,
                   2.0, 1.7, 1.8, 2.3, 2.4])
diet_C = np.array([4.5, 4.2, 4.8, 4.0, 4.7, 4.3, 4.1, 4.6, 4.4, 4.9,
                   4.2, 4.0, 4.5, 4.4, 4.7, 4.6, 4.3, 4.2, 4.8, 4.1,
                   4.6, 4.3, 4.5, 4.4, 4.2])

# Combine the data from all diets
all_data = np.concatenate([diet_A, diet_B, diet_C])

# Create group labels
groups = np.repeat(['A', 'B', 'C'], 25)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is significant evidence to reject the null hypothesis.")
    print("There are significant differences in mean weight loss among the diets.")
else:
    print("There is insufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in mean weight loss among the diets.")


F-Statistic: 248.46909620991258
p-value: 4.804261970146147e-33
There is significant evidence to reject the null hypothesis.
There are significant differences in mean weight loss among the diets.


# A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.






In [13]:
import numpy as np
from scipy import stats

# Simulated data: Time taken by employees for each program and experience level
program_A_novice = np.array([25.1, 27.3, 26.5, 28.2, 25.8, 26.9, 27.6, 28.5, 27.7, 26.3,
                             27.8, 28.0, 25.6, 26.8, 27.4, 25.9, 28.7, 26.4, 27.2, 28.4,
                             26.7, 27.9, 28.3, 26.1, 25.5, 27.1, 26.6, 28.8, 25.4, 26.2])
program_B_novice = np.array([29.2, 29.6, 28.9, 29.8, 30.0, 29.4, 28.7, 29.1, 30.2, 28.5,
                             29.9, 28.8, 29.3, 29.7, 29.5, 30.1, 28.6, 29.0, 28.4, 29.8,
                             29.3, 28.7, 28.5, 29.6, 29.9, 28.3, 29.7, 29.4, 30.2, 28.2])
program_C_novice = np.array([31.5, 31.3, 31.2, 30.9, 31.6, 30.8, 31.0, 31.4, 30.7, 31.1,
                             30.5, 31.2, 31.7, 30.6, 31.3, 31.1, 30.8, 30.9, 31.5, 30.7,
                             31.4, 31.2, 31.6, 30.6, 30.8, 31.0, 31.3, 31.2, 30.9, 31.1])

program_A_exp = np.array([23.3, 22.8, 23.9, 22.5, 23.1, 22.6, 23.7, 22.4, 23.0, 23.8,
                          22.9, 23.2, 23.6, 22.7, 22.8, 23.4, 22.3, 23.5, 23.2, 22.8,
                          23.6, 22.5, 23.1, 22.7, 23.4, 23.3, 22.9, 22.8, 23.5, 22.6])
program_B_exp = np.array([24.7, 24.2, 24.8, 25.0, 24.5, 24.3, 24.9, 24.4, 24.6, 24.1,
                          24.2, 24.5, 24.8, 25.0, 24.4, 24.7, 24.9, 24.6, 24.3, 24.8,
                          24.7, 24.2, 24.5, 24.6, 24.9, 24.3, 24.8, 24.5, 24.7, 24.6])
program_C_exp = np.array([26.4, 26.8, 26.7, 26.5, 26.9, 26.3, 26.6, 26.8, 26.4, 26.7,
                          26.5, 26.6, 26.9, 26.7, 26.3, 26.4, 26.8, 26.6, 26.5, 26.7,
                          26.4, 26.9, 26.6, 26.5, 26.3, 26.8, 26.7, 26.6, 26.9, 26.5])

# Combine data and create labels for programs and experience levels
all_times = np.concatenate([program_A_novice, program_B_novice, program_C_novice,
                            program_A_exp, program_B_exp, program_C_exp])
program_labels = np.repeat(['A', 'B', 'C'], 30)
experience_labels = np.tile(['Novice', 'Experienced'], 90)

# Perform two-way ANOVA
f_statistic, p_value = stats.f_oneway(program_A_novice, program_B_novice, program_C_novice,
                                      program_A_exp, program_B_exp, program_C_exp)

# Print results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is significant evidence to reject the null hypothesis.")
    print("There are significant differences in mean completion times among the groups.")
else:
    print("There is insufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in mean completion times among the groups.")


F-Statistic: 820.9838291914634
p-value: 5.887361195881858e-119
There is significant evidence to reject the null hypothesis.
There are significant differences in mean completion times among the groups.


# An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.






In [14]:
import numpy as np
from scipy import stats

# Simulated test scores for control and experimental groups
control_scores = np.array([75, 78, 80, 72, 68, 70, 82, 85, 76, 79,
                           88, 74, 77, 81, 84, 79, 72, 86, 73, 75,
                           89, 70, 81, 83, 78, 75, 76, 80, 71, 74,
                           77, 80, 85, 72, 76, 78, 81, 83, 79, 82,
                           88, 74, 77, 80, 72, 75, 76, 79, 70, 84])

experimental_scores = np.array([85, 88, 89, 82, 79, 81, 91, 92, 86, 87,
                                93, 80, 83, 88, 90, 85, 78, 91, 79, 82,
                                94, 81, 87, 89, 84, 83, 85, 87, 80, 82,
                                84, 87, 92, 78, 82, 85, 88, 90, 86, 89,
                                94, 81, 84, 86, 79, 82, 84, 87, 81, 91])

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print t-test results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the t-test results
alpha = 0.05
if p_value < alpha:
    print("There is significant evidence to reject the null hypothesis.")
    print("There is a significant difference in test scores between the groups.")
else:
    print("There is insufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in test scores between the groups.")

# Post-hoc test (if the t-test results are significant)
if p_value < alpha:
    posthoc_p_values = stats.ttest_ind(control_scores, experimental_scores, equal_var=False).pvalue
    posthoc_correction = statsmodels.stats.multitest.multipletests(posthoc_p_values, alpha=alpha, method='bonferroni')
    significant_pairs = np.where(posthoc_correction[0])[0]
    print("Significant pairwise comparisons:", significant_pairs)


t-statistic: -7.771991726997306
p-value: 7.750673585758236e-12
There is significant evidence to reject the null hypothesis.
There is a significant difference in test scores between the groups.


NameError: name 'statsmodels' is not defined

In [None]:
A researcher wants to know if there are any significant differences in the average daily sales of three 
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store 
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any 
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to 
determine which store(s) differ significantly from each other.

In [15]:
# A repeated measures ANOVA is typically used when you have data collected from the same subjects or entities under multiple
# conditions or time points. In your scenario, it seems like you have a one-way repeated measures design where you're interested
# in comparing the average daily sales of three retail stores (Store A, Store B, and Store C) over 30 days.

# However, it's important to note that a repeated measures ANOVA is not suitable for this scenario because it's designed for 
# within-subjects designs, where the same participants are measured under different conditions. In your case, you're looking 
# at different stores across days, which is a between-subjects design. Instead, you should use a one-way ANOVA or an equivalent
# non-parametric test for independent samples.

import numpy as np
from scipy import stats

# Simulated daily sales data for three stores
store_A_sales = np.array([1200, 1300, 1100, 1250, 1400, 1350, 1280, 1250, 1300, 1220,
                          1180, 1400, 1320, 1275, 1290, 1310, 1260, 1295, 1240, 1325,
                          1330, 1205, 1265, 1300, 1285, 1255, 1275, 1230, 1225, 1300])
store_B_sales = np.array([1100, 1150, 1000, 1120, 1220, 1185, 1150, 1080, 1125, 1180,
                          1050, 1200, 1165, 1135, 1110, 1205, 1195, 1145, 1170, 1215,
                          1190, 1085, 1160, 1210, 1125, 1175, 1165, 1105, 1090, 1140])
store_C_sales = np.array([1500, 1450, 1550, 1475, 1490, 1425, 1460, 1505, 1480, 1440,
                          1405, 1430, 1475, 1415, 1435, 1420, 1445, 1495, 1485, 1465,
                          1470, 1445, 1480, 1465, 1455, 1410, 1425, 1400, 1435, 1450])

# Combine the data from all stores
all_sales = np.concatenate([store_A_sales, store_B_sales, store_C_sales])

# Create group labels
store_labels = np.repeat(['Store A', 'Store B', 'Store C'], 30)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Print results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is significant evidence to reject the null hypothesis.")
    print("There are significant differences in average daily sales among the stores.")
else:
    print("There is insufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in average daily sales among the stores.")


F-Statistic: 287.0602420933422
p-value: 4.8627371922113013e-39
There is significant evidence to reject the null hypothesis.
There are significant differences in average daily sales among the stores.
