### **Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**

### ***ANSWER :***

Analysis of Variance (ANOVA) is a statistical method used to compare the means of two or more groups and determine if there are any significant differences among them. However, ANOVA relies on certain assumptions for its validity. Violating these assumptions can lead to inaccurate or unreliable results. The key assumptions for using ANOVA are:

1. **Independence of observations:** Observations within each group or category should be independent of each other. This means that the values in one group should not be influenced by or related to the values in another group.

2. **Normality:** The dependent variable (the one being measured) should follow a normal distribution in each group. Normality assumption is especially important when the sample sizes are small (typically, a sample size of around 30 is considered large enough for ANOVA to be robust to violations of normality).

3. **Homogeneity of variance (homoscedasticity):** The variances of the dependent variable should be approximately equal across all groups. In other words, the spread or dispersion of data points in each group should be similar.

4. **Equality of group sizes (for one-way ANOVA):** For a one-way ANOVA (comparing means across multiple groups), the sample sizes should be equal or at least roughly balanced. In some cases, unequal group sizes can still be accommodated, but having very different group sizes may affect the validity of the results.

***Examples of violations that could impact the validity of ANOVA results:***

1. **Non-independence of observations:** If data points in one group are influenced by or dependent on data points in another group, the independence assumption is violated. For example, if multiple measurements are taken from the same individual over time, these measurements may not be independent and can bias the ANOVA results.

2. **Non-normality:** If the data does not follow a normal distribution within each group, the results of ANOVA may not be accurate. This can happen when dealing with small sample sizes or when extreme outliers are present.

3. **Heteroscedasticity:** If the variance of the dependent variable differs significantly across groups, the assumption of homogeneity of variance is violated. This can lead to unequal weighting of groups and affect the overall significance of the ANOVA results.

4. **Unequal group sizes:** Although ANOVA can tolerate some imbalance in group sizes, having highly unequal group sizes can lead to biased results and reduced statistical power.

When these assumptions are violated, alternative statistical tests or data transformations might be necessary to draw valid conclusions. For example, non-parametric tests like the Kruskal-Wallis test can be used when normality assumptions are not met, and transformations like the logarithmic or square root transformation may help address issues related to heteroscedasticity or non-normality. Always visually inspecting the data and using diagnostic tests can aid in identifying potential violations and selecting appropriate analyses.

### **Q2. What are the three types of ANOVA, and in what situations would each be used?**

### ***ANSWER :***

The three main types of ANOVA are:

1. **One-Way ANOVA (Analysis of Variance):**
   One-Way ANOVA is used when you have one categorical independent variable (also known as a factor) and one continuous dependent variable. The categorical variable divides the data into two or more groups, and the continuous variable is measured for each group. The purpose of One-Way ANOVA is to determine if there are any significant differences in the means of the dependent variable among the different groups.

   Example situations for using One-Way ANOVA:
   - Comparing the average test scores of students from different schools (with schools as the groups).
   - Analyzing the effect of different treatments on a specific medical condition.

2. **Two-Way ANOVA:**
   Two-Way ANOVA is used when you have two categorical independent variables and one continuous dependent variable. This type of ANOVA allows you to explore the interaction between the two independent variables and their combined effect on the dependent variable.

   Example situations for using Two-Way ANOVA:
   - Studying the impact of two different factors (e.g., gender and age group) on salary levels.
   - Investigating the influence of both type of diet and exercise regimen on weight loss.

3. **Repeated Measures ANOVA (or One-Way Within-Subjects ANOVA):**
   Repeated Measures ANOVA is used when you have a single group of participants who are measured on the same dependent variable under different conditions or at multiple time points. This type of ANOVA allows you to examine within-subject effects and assess how the different conditions or time points influence the dependent variable.

   Example situations for using Repeated Measures ANOVA:
   - Evaluating the effectiveness of a memory training program by measuring participants' memory performance before training, immediately after, and one month after the training.
   - Analyzing the effects of different levels of workload on participants' stress levels by measuring their stress levels at different time points during a simulation.

 ***The choice of ANOVA type depends on the number of independent variables and the design of the study. One-Way ANOVA is suitable for comparing means across multiple groups, Two-Way ANOVA is used when there are two categorical independent variables, and Repeated Measures ANOVA is employed when dealing with within-subject designs with multiple measurements over time or conditions.***

### **Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

### ***ANSWER :***

Partitioning of variance in ANOVA refers to the division of the total variance in the data into different components or sources of variation. These components represent the variability attributed to various factors or sources in the experimental design. Understanding this concept is crucial because it allows researchers to identify the contributions of different factors to the overall variability in the data and to assess the significance of these factors in explaining the observed differences among groups.

In a One-Way ANOVA, the total variance in the data is partitioned into two main components:

1. **Between-Groups Variance:** This component represents the variation between the group means. It measures how much the means of the different groups differ from each other. A larger between-groups variance indicates greater differences between the group means, which suggests that the factor being tested (e.g., different treatments, categories, or conditions) has a significant effect on the dependent variable.

2. **Within-Groups Variance:** This component represents the variation within each group. It measures how much the individual data points within each group differ from their respective group mean. A larger within-groups variance indicates more variability within groups, which might be due to random fluctuations or measurement errors.

The F-ratio (F-statistic) in ANOVA is calculated by dividing the between-groups variance by the within-groups variance. A high F-ratio suggests that the differences between group means are significant compared to the within-group variability, indicating that the factor being tested is likely influencing the dependent variable.

The concept of partitioning of variance is important because it helps researchers to:

1. **Assess the significance of the factor(s):** By comparing the between-groups and within-groups variance, researchers can determine if the observed differences between groups are statistically significant or simply due to random fluctuations.

2. **Understand the effect size:** The proportion of between-groups variance to the total variance (total sum of squares) provides an effect size measure. A larger proportion suggests a stronger effect of the factor(s) on the dependent variable.

3. **Identify potential sources of variation:** By partitioning the variance, researchers can gain insights into which factors or variables contribute more to the observed variability in the data.

4. **Make informed decisions:** Understanding the partitioning of variance helps researchers interpret ANOVA results correctly and make informed decisions based on the significance and effect size of the factors being studied.

***Partitioning of variance is a fundamental concept in ANOVA that aids in understanding the contributions of different factors to the variability in the data and determining the statistical significance of these factors. It plays a central role in the interpretation of ANOVA results and allows researchers to draw meaningful conclusions from their analyses.***

### **Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**

### ***ANSWER :***

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Assume you have data for three groups: group1, group2, and group3
# Replace these with your actual data arrays for each group

group1 = [15, 18, 20, 22, 17]
group2 = [25, 28, 30, 35, 32]
group3 = [10, 14, 12, 9, 11]

# Combine all the data into a single array for easier calculations
data = np.concatenate([group1, group2, group3])

# Calculate the grand mean (overall mean)
grand_mean = np.mean(data)

# Calculate the total sum of squares (SST)
SST = np.sum((data - grand_mean) ** 2)

# Perform one-way ANOVA to get the explained sum of squares (SSE)
f_statistic, p_value = f_oneway(group1, group2, group3)
SSE = f_statistic * (len(group1) + len(group2) + len(group3) - 3)  # Degrees of freedom for groups = k-1

# Calculate the residual sum of squares (SSR)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)
print("F-statistic:", f_statistic)
print("p-value:", p_value)


Total Sum of Squares (SST): 1001.7333333333333
Explained Sum of Squares (SSE): 635.1058823529412
Residual Sum of Squares (SSR): 366.62745098039215
F-statistic: 52.925490196078435
p-value: 1.1145210562966572e-06


### **Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

### ***ANSWER :***

let's assume you have data for two independent variables, factor_A and factor_B, and a dependent variable y. Here's how you can calculate the main effects and interaction effects using statsmodels:

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assume you have data for factor_A, factor_B, and the dependent variable y
# Replace these with your actual data arrays for each variable

factor_A = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
factor_B = [10, 10, 10, 10, 10, 20, 20, 20, 20, 20]
y = [15, 18, 22, 25, 28, 10, 14, 18, 22, 26]

# Create a DataFrame to use with statsmodels
df = pd.DataFrame({'factor_A': factor_A, 'factor_B': factor_B, 'y': y})

# Fit the two-way ANOVA model
model = ols('y ~ factor_A + factor_B + factor_A:factor_B', data=df).fit()

# Print the ANOVA table and summary of the model
print("ANOVA Table:")
print(sm.stats.anova_lm(model))
print("\nModel Summary:")
print(model.summary())

ANOVA Table:
                    df  sum_sq  mean_sq       F        PR(>F)
factor_A           1.0  266.45   266.45  5329.0  4.447170e-10
factor_B           1.0   32.40    32.40   648.0  2.421421e-07
factor_A:factor_B  1.0    2.45     2.45    49.0  4.234833e-04
Residual           6.0    0.30     0.05     NaN           NaN

Model Summary:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     2009.
Date:                Tue, 25 Jul 2023   Prob (F-statistic):           2.15e-09
Time:                        11:44:15   Log-Likelihood:                 3.3434
No. Observations:                  10   AIC:                             1.313
Df Residuals:                       6   BIC:                             2.524
Df Model:                    



# **OR**

In [5]:
import numpy as np
from scipy.stats import f_oneway

# Assume you have data for two independent variables, factor A and factor B,
# and a dependent variable y. Replace these with your actual data arrays.

factor_A = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
factor_B = [10, 10, 10, 10, 10, 20, 20, 20, 20, 20]
y = [15, 18, 22, 25, 28, 10, 14, 18, 22, 26]

# Perform the two-way ANOVA
f_statistic_A, p_value_A = f_oneway(y[:5], y[5:], factor_A[:5], factor_A[5:])
f_statistic_B, p_value_B = f_oneway(y[:5], y[5:], factor_B[:5], factor_B[5:])
f_statistic_interaction, p_value_interaction = f_oneway(y[:5], y[5:], factor_A[:5], factor_A[5:], factor_B[:5], factor_B[5:])

print("Main Effect of Factor A:")
print("F-statistic:", f_statistic_A)
print("p-value:", p_value_A)

print("\nMain Effect of Factor B:")
print("F-statistic:", f_statistic_B)
print("p-value:", p_value_B)

print("\nInteraction Effect between Factor A and Factor B:")
print("F-statistic:", f_statistic_interaction)
print("p-value:", p_value_interaction)


Main Effect of Factor A:
F-statistic: 26.622406639004144
p-value: 1.8546234111986342e-06

Main Effect of Factor B:
F-statistic: 7.877166914314019
p-value: 0.0018856225909745406

Interaction Effect between Factor A and Factor B:
F-statistic: 29.543568464730278
p-value: 1.6378829151071226e-09


### **Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**

### ***ANSWER :***

In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of all the groups are equal. If the obtained F-statistic is sufficiently large and the corresponding p-value is smaller than the chosen significance level (often 0.05), we can reject the null hypothesis. The rejection of the null hypothesis implies that there are significant differences between at least two of the groups' means.

In your case, the F-statistic is 5.23, and the p-value is 0.02. Since the p-value (0.02) is less than the common significance level of 0.05, we can conclude that there are significant differences between the group means. The probability of obtaining such a large F-statistic (or even larger) under the assumption of equal group means is only 2%, which is smaller than the significance level of 5%. Therefore, we reject the null hypothesis and accept the alternative hypothesis that at least one group mean is different from the others.

**Interpreting the results:**
>
The results of the one-way ANOVA indicate that the factor (or treatment) being studied has a statistically significant effect on the dependent variable. However, the ANOVA does not tell us which specific groups differ from each other; it only indicates that there is a difference somewhere among the groups.

To identify which groups are different, you can perform post-hoc tests (e.g., Tukey's HSD test or Bonferroni correction) or pairwise comparisons. These tests will help you pinpoint which specific groups have significantly different means from one another.

Keep in mind that the effect size is also important for interpretation. In addition to conducting post-hoc tests, you may want to calculate effect size measures (e.g., eta-squared, Cohen's d) to quantify the practical significance of the differences between the groups.



### **Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**

### ***ANSWER :***

Handling missing data in a repeated measures ANOVA is an important consideration to ensure valid and reliable results. There are several methods to handle missing data, each with its potential consequences. Here are some common approaches:

1. **Complete Case Analysis (Listwise Deletion):**
   This method involves excluding any participant or case with missing data on any variable involved in the analysis. It is the simplest approach but can lead to reduced sample size and potential loss of statistical power. It assumes that the data is missing completely at random (MCAR), which may not always be a valid assumption.

2. **Mean Imputation:**
   Missing values in each condition or time point are replaced with the mean of the available data for that condition or time point. This method may introduce bias and artificially reduce variability, leading to inaccurate estimates of the treatment effects.

3. **Last Observation Carried Forward (LOCF):**
   Missing values are replaced with the last observed value for that participant. This approach assumes that data is missing due to random fluctuations, but it can distort the results and artificially inflate treatment effects.

4. **Multiple Imputation:**
   Multiple imputation generates several plausible imputations for the missing data based on the observed data and incorporates the uncertainty of imputation into the analysis. This method can provide more robust results when data is missing at random (MAR), but it requires careful implementation and may be computationally intensive.

5. **Model-Based Imputation:**
   This approach uses a statistical model to predict missing values based on other variables in the data. Imputed values are drawn from the model's predicted distribution. Model-based imputation may be effective when there are systematic patterns of missing data, but it relies on the validity of the underlying model.

***Potential consequences of using different methods:***

- Complete Case Analysis (Listwise Deletion): Reduced sample size, loss of statistical power, and potential bias if data is not missing completely at random.

- Mean Imputation: Underestimation of standard errors, inflated significance levels, and biased parameter estimates.

- LOCF: Potential distortion of treatment effects, especially if missing data is related to participants dropping out due to negative treatment effects.

- Multiple Imputation: More accurate parameter estimates and standard errors when data is missing at random, but it may be computationally demanding.

- Model-Based Imputation: Effectiveness depends on the validity of the model used for imputation. If the model is misspecified, it can lead to biased results.

Selecting an appropriate method for handling missing data requires understanding the underlying mechanisms of missingness and consideration of the assumptions and potential biases introduced by each method. It is crucial to be cautious in interpreting the results when handling missing data, as the chosen approach can influence the validity and generalizability of the findings.

### **Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**

### ***ANSWER :***

After conducting an analysis of variance (ANOVA) and finding a significant overall effect, post-hoc tests are used to make pairwise comparisons between groups to determine which specific group differences are statistically significant. Several common post-hoc tests exist, each with different strengths and assumptions. Some common post-hoc tests include:

1. **Tukey's Honestly Significant Difference (HSD) test:**
   Tukey's HSD is widely used when comparing all possible pairs of groups. It controls the family-wise error rate, making it suitable for situations where multiple pairwise comparisons are made. Tukey's HSD is appropriate when the sample sizes are equal across groups and variances are approximately equal.

2. **Bonferroni correction:**
   The Bonferroni correction adjusts the significance level for each pairwise comparison to control the family-wise error rate. It is straightforward and conservative but may become overly stringent when there are many comparisons, leading to reduced power.

3. **Dunnett's test:**
   Dunnett's test is useful when comparing several treatment groups to a single control group. It protects against inflation of the Type I error rate, making it more powerful than the Bonferroni correction for this specific situation.

4. **Scheffe's test:**
   Scheffe's test is a conservative post-hoc test that can handle unequal sample sizes and complex designs. It is appropriate when there are many comparisons, but it may lack power compared to other post-hoc tests.

5. **Fisher's Least Significant Difference (LSD) test:**
   Fisher's LSD is the simplest post-hoc test, and it is used when sample sizes are equal and variances are approximately equal. However, it does not control the family-wise error rate, making it more likely to produce false positives when conducting multiple comparisons.

***Example of a situation where a post-hoc test might be necessary:***

Suppose a researcher conducts a study to compare the effectiveness of four different treatments (A, B, C, and D) for reducing anxiety levels in patients. After performing a one-way ANOVA on the data, the researcher finds a significant overall effect, indicating that the treatments have different effects on anxiety levels.

To determine which specific treatments significantly differ from each other, the researcher would conduct post-hoc tests. Tukey's HSD or Scheffe's test could be appropriate choices to make multiple pairwise comparisons among the treatment groups. These post-hoc tests would help identify which treatments have significantly different effects on anxiety levels and provide more detailed insights into the treatment efficacy. The choice of the specific post-hoc test would depend on factors such as the sample sizes, variance assumptions, and the number of comparisons being made.

### **Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.**

### ***ANSWER :***

In [6]:
import numpy as np
from scipy.stats import f_oneway

# Assuming you have weight loss data for each diet: A, B, and C
# Replace these with your actual data arrays for each diet

diet_A = [4.5, 5.2, 6.1, 4.9, 3.8, 5.5, 5.7, 4.3, 4.8, 5.9,
          5.2, 5.4, 4.7, 5.1, 5.3, 5.6, 4.9, 4.6, 5.0, 6.0,
          4.7, 5.8, 6.3, 5.4, 5.0, 5.2, 4.9, 5.6, 5.3, 5.0,
          6.2, 5.9, 5.7, 6.1, 5.3, 4.5, 5.0, 5.8, 5.6, 4.8,
          5.5, 5.3, 5.1, 5.4, 4.7, 4.6, 5.7, 6.0, 5.9, 5.2]

diet_B = [2.9, 3.5, 3.1, 2.8, 3.0, 3.2, 3.4, 3.6, 2.7, 2.5,
          3.1, 3.0, 2.8, 3.5, 3.2, 3.6, 3.4, 2.9, 3.3, 2.6,
          3.3, 2.7, 2.9, 3.1, 3.2, 3.4, 2.8, 3.0, 3.3, 3.5,
          2.6, 3.2, 3.4, 2.7, 2.9, 3.0, 3.1, 3.3, 3.6, 3.4,
          2.8, 3.0, 3.5, 3.2, 2.9, 2.7, 3.1, 2.6, 3.3, 2.8]

diet_C = [1.8, 2.1, 2.4, 1.9, 1.7, 2.2, 1.6, 2.0, 1.9, 1.5,
          2.3, 1.8, 2.5, 2.0, 2.1, 2.2, 2.4, 1.7, 1.6, 2.0,
          2.3, 2.2, 2.1, 1.9, 2.5, 2.0, 1.8, 2.3, 1.6, 1.7,
          1.9, 1.8, 2.1, 2.2, 1.6, 2.3, 2.5, 1.7, 2.0, 2.1,
          1.8, 2.2, 1.9, 2.0, 1.7, 1.6, 2.3, 2.5, 1.9, 2.1]

# Combine the data into a single array for the one-way ANOVA
data = np.concatenate([diet_A, diet_B, diet_C])

# Create corresponding group labels (e.g., A: 0, B: 1, C: 2)
group_labels = ['A'] * len(diet_A) + ['B'] * len(diet_B) + ['C'] * len(diet_C)

# Perform the one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 894.5786885963022
p-value: 5.101359628292405e-83


**Interpretation of results:**
***If the p-value is less than the chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there are significant differences in the mean weight loss between at least two of the diets (A, B, or C). If the p-value is greater than 0.05, you fail to reject the null hypothesis, indicating that there is insufficient evidence to conclude that there are significant differences between the mean weight loss of the three diets.***

### **Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.**

### ***ANSWER :***

In [7]:
import numpy as np
from scipy.stats import f_oneway

# Assume you have data for software programs (A, B, C), employee experience (novice vs. experienced),
# and the time it takes each employee to complete the task
# Replace these with your actual data arrays for each variable

software = ['A', 'B', 'C'] * 10
experience = ['Novice'] * 15 + ['Experienced'] * 15
time_taken = [12, 10, 14, 11, 9, 13, 15, 16, 11, 12,
              8, 10, 9, 11, 12, 14, 15, 16, 12, 13,
              14, 12, 15, 13, 11, 9, 10, 12, 13, 11]

# Perform the two-way ANOVA
f_statistic_software, p_value_software = f_oneway(time_taken[0:10], time_taken[10:20], time_taken[20:30])
f_statistic_experience, p_value_experience = f_oneway(time_taken[0:15], time_taken[15:30])
f_statistic_interaction, p_value_interaction = f_oneway(time_taken[0:5], time_taken[5:10], time_taken[10:15],
                                                       time_taken[15:20], time_taken[20:25], time_taken[25:30])

print("Main Effect of Software Programs:")
print("F-statistic:", f_statistic_software)
print("p-value:", p_value_software)

print("\nMain Effect of Employee Experience:")
print("F-statistic:", f_statistic_experience)
print("p-value:", p_value_experience)

print("\nInteraction Effect between Software Programs and Employee Experience:")
print("F-statistic:", f_statistic_interaction)
print("p-value:", p_value_interaction)


Main Effect of Software Programs:
F-statistic: 0.060402684563758385
p-value: 0.9415122147685216

Main Effect of Employee Experience:
F-statistic: 2.1567164179104474
p-value: 0.15309356214651068

Interaction Effect between Software Programs and Employee Experience:
F-statistic: 4.18
p-value: 0.0071280014947662056


time_taken represents the time it takes each employee to complete the task, software represents the software programs (A, B, C) to which employees were assigned, and experience represents the employee experience level (novice vs. experienced).

The two-way ANOVA is conducted by dividing the data into groups based on the levels of each factor (software and experience) and then performing one-way ANOVAs on each group to calculate the main effects of software and experience. Lastly, you can perform another one-way ANOVA on the groups formed by the combination of software and experience to calculate the interaction effect between the two factors.

**Interpretation of results:**

If the p-value for the main effect of Software Programs is less than the chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there are significant differences in the average time taken to complete the task between at least two of the software programs.
If the p-value for the main effect of Employee Experience is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there are significant differences in the average time taken to complete the task between novice and experienced employees.
If the p-value for the interaction effect between Software Programs and Employee Experience is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant interaction between the software programs and employee experience, meaning their combined effect is different from the sum of their individual effects

### **Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**

### ***ANSWER :***

In [8]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Assume you have data for test scores in the control and experimental groups
# Replace these with your actual data arrays for each group

control_group = [85, 78, 90, 72, 88, 80, 76, 82, 75, 81,
                 77, 79, 86, 83, 89, 87, 80, 84, 78, 81,
                 84, 79, 75, 82, 86, 80, 77, 78, 79, 83,
                 81, 85, 88, 80, 82, 75, 78, 76, 81, 77,
                 89, 83, 80, 86, 78, 84, 87, 85, 80, 83]

experimental_group = [88, 95, 92, 98, 90, 93, 89, 94, 96, 91,
                      90, 89, 92, 95, 91, 93, 96, 89, 90, 92,
                      93, 97, 94, 88, 91, 89, 96, 92, 90, 93,
                      95, 90, 94, 93, 91, 97, 88, 92, 94, 89,
                      96, 89, 93, 90, 92, 95, 97, 90, 91, 92]

# Perform the two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

print("Two-Sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform Tukey's HSD test for post-hoc multiple comparisons
data = np.array(control_group + experimental_group)
groups = np.array(['Control'] * len(control_group) + ['Experimental'] * len(experimental_group))
tukey_results = pairwise_tukeyhsd(data, groups)
print("\nTukey's HSD Test:")
print(tukey_results)


Two-Sample t-test:
t-statistic: -15.161451985791816
p-value: 1.9108030540809588e-27

Tukey's HSD Test:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower   upper  reject
---------------------------------------------------------
Control Experimental    10.84   0.0 9.4212 12.2588   True
---------------------------------------------------------


***Interpretation of results:***

- In the output of the two-sample t-test, if the p-value is less than the chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there is a significant difference in test scores between the control group (traditional teaching method) and the experimental group (new teaching method).
- The Tukey's HSD test results will show which specific group(s) differ significantly from each other. It will provide confidence intervals and p-values for all pairwise comparisons between the control and experimental groups.
- For example, if the two-sample t-test yields a significant p-value (e.g., p < 0.05), and the Tukey's HSD test indicates that the control group and experimental group differ significantly, you can interpret the results as follows:

"The two-sample t-test indicates a significant difference in test scores between the control group (traditional teaching method) and the experimental group (new teaching method). Further analysis using Tukey's HSD test shows that the experimental group has significantly higher test scores compared to the control group."

### **Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**

### ***ANSWER :***

In [10]:
import numpy as np
from scipy.stats import f_oneway
import pandas as pd

# Assume you have daily sales data for Store A, Store B, and Store C
# Replace these with your actual data arrays for each store

store_A_sales = [100, 110, 95, 105, 120, 130, 115, 105, 125, 105,
                 100, 105, 115, 120, 130, 110, 105, 120, 125, 105,
                 100, 115, 100, 110, 95, 105, 120, 130, 115, 105]

store_B_sales = [95, 100, 85, 90, 100, 110, 95, 105, 100, 110,
                 95, 100, 105, 110, 115, 90, 85, 100, 95, 100,
                 95, 100, 85, 90, 100, 110, 95, 105, 100, 110]

store_C_sales = [80, 85, 70, 75, 85, 90, 80, 85, 75, 90,
                 80, 85, 80, 85, 90, 70, 75, 80, 85, 75,
                 80, 85, 70, 75, 85, 90, 80, 85, 75, 90]

# Combine the data into a single array for the one-way ANOVA
data = np.concatenate([store_A_sales, store_B_sales, store_C_sales])

# Create corresponding group labels (e.g., A: 0, B: 1, C: 2)
group_labels = ['Store A'] * len(store_A_sales) + ['Store B'] * len(store_B_sales) + ['Store C'] * len(store_C_sales)

# Create a DataFrame for easier analysis
df = pd.DataFrame({'Store': group_labels, 'Sales': data})

# Perform the one-way ANOVA
f_statistic, p_value = f_oneway(df[df['Store'] == 'Store A']['Sales'],
                                df[df['Store'] == 'Store B']['Sales'],
                                df[df['Store'] == 'Store C']['Sales'])

print("One-Way ANOVA:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)


One-Way ANOVA:
F-statistic: 97.20196712476427
p-value: 6.65618393827822e-23


***If the obtained p-value from the one-way ANOVA is less than the chosen significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there are significant differences in the average daily sales between at least two of the retail stores.***