#Q1

Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups to determine if there are statistically significant differences between them. To use ANOVA effectively and ensure the validity of the results, certain assumptions must be met. These assumptions include:

1. Independence of Observations: The observations within each group or treatment level should be independent of each other. In other words, the data points should not be related or influenced by each other. Violation of this assumption can occur in various ways, such as when data points are collected from repeated measurements on the same subject without accounting for the correlation, which can lead to inflated significance levels.

2. Normally Distributed Data: The data within each group should follow a roughly normal distribution. This means that the values should be symmetrically distributed around the mean, with most data points falling close to the mean and fewer data points in the tails. Violations of this assumption can result in inaccurate p-values and confidence intervals. For example, if the data are heavily skewed or have outliers, ANOVA results may not be reliable.

3. Homogeneity of Variance (Homoscedasticity): The variance within each group should be roughly equal. This assumption is also known as homoscedasticity. Violations occur when the variability in one group is significantly different from the variability in another group. Unequal variances can lead to Type I errors (false positives) or Type II errors (false negatives). Visual inspection of a scatterplot of residuals or formal statistical tests like Levene's test can be used to check for homoscedasticity.

Examples of Violations and Their Impact on ANOVA Results:

1. Non-Normal Data: If the data within groups are not normally distributed, ANOVA results may be unreliable. For instance, if the data are heavily skewed, ANOVA can produce incorrect p-values, potentially leading to incorrect conclusions about group differences. Transformations like logarithmic or rank-based transformations may be applied to make the data more normal before conducting ANOVA.

2. Heteroscedasticity: When the variances within groups are unequal, ANOVA results may not accurately reflect group differences. In this case, the F-statistic used in ANOVA may not follow the expected distribution, leading to incorrect p-values. One solution is to use a modified ANOVA approach like Welch's ANOVA or to transform the data to stabilize the variances.

3. Lack of Independence: Violations of the independence assumption can occur in longitudinal or repeated measures designs, where measurements on the same subjects are correlated over time. Ignoring this correlation can lead to underestimated standard errors and inflated significance levels. To address this, techniques like repeated measures ANOVA or mixed-effects models should be used, accounting for the within-subject correlation.

In practice, it's essential to check these assumptions before relying on ANOVA results. If the assumptions are seriously violated, alternative non-parametric tests or data transformations may be considered, or researchers may need to reevaluate their study design and data collection methods to better meet these assumptions.

#Q2

Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups or treatments to determine if there are statistically significant differences among them. There are three main types of ANOVA, each suited for different situations:

1. One-Way ANOVA:
   - When to use: One-way ANOVA is used when you have one categorical independent variable (factor) with three or more levels (groups or treatments) and you want to determine if there are statistically significant differences in the means of a continuous dependent variable among these groups.
   - Example: You want to compare the mean test scores of students from three different schools to see if there is a statistically significant difference in performance.

2. Two-Way ANOVA:
   - When to use: Two-way ANOVA is used when you have two categorical independent variables (factors) and you want to investigate the effects of both factors simultaneously on a continuous dependent variable. It allows you to assess not only the main effects of each factor but also their interaction.
   - Example: You want to study the effects of both gender and educational level on students' test scores. Gender has two levels (male and female), and educational level has three levels (high school, undergraduate, and graduate).

3. Three-Way (and Higher) ANOVA:
   - When to use: Three-way ANOVA and higher-order ANOVAs are extensions of the two-way ANOVA and are used when you have more than two categorical independent variables (factors). These designs allow you to investigate the effects of multiple factors on a continuous dependent variable, including interactions among all the factors.
   - Example: You are conducting a study on the effects of temperature, humidity, and time of day on plant growth. Temperature has three levels (low, medium, high), humidity has two levels (low and high), and time of day has two levels (morning and afternoon).

In addition to these main types of ANOVA, there are variations and extensions of ANOVA techniques, such as repeated measures ANOVA (used for within-subject designs with repeated measurements), mixed-design ANOVA (combining between-subject and within-subject factors), and non-parametric versions of ANOVA for data that do not meet the assumptions of normality or homoscedasticity.

The choice of which type of ANOVA to use depends on your research design, the number of categorical factors you are examining, and the specific hypotheses you want to test. It's crucial to select the appropriate ANOVA design that matches your study's structure to ensure that you can draw meaningful conclusions from your data.

#Q3

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps us understand how the total variation in a dataset is divided into different components or sources. This concept is important because it allows us to assess whether the differences between groups or treatments in an experiment are statistically significant or if they could have occurred by chance.

In simple terms, imagine you have a bunch of data points, and you want to know if there are real differences between groups. The total variation in your data can be broken down into three main parts:

1. Between-Group Variation: This part of the variation measures how different the group means are from each other. If this variation is relatively large compared to the overall variation, it suggests that there are significant differences between the groups, and the treatment or group membership has an effect.

2. Within-Group Variation: This part of the variation measures how much individual data points within each group vary from their group's mean. If this variation is relatively small, it indicates that data points within the same group are similar to each other.

3. Error Variation: This is the leftover variation that cannot be attributed to either the between-group or within-group differences. It's essentially the random noise or variability in your data that you can't explain with your treatment or grouping.

Now, why is understanding this partitioning important?

- **Hypothesis Testing:** ANOVA helps us conduct hypothesis tests to determine if the differences between groups are statistically significant. If the between-group variation is much larger than what you'd expect by random chance (as indicated by the error variation), then you have evidence to support that the groups are indeed different due to the treatment or factors you're studying.

- **Effect Size:** By examining the proportion of the total variation that is explained by the between-group variation, you can get a sense of the size of the effect. Larger proportions indicate stronger effects.

- **Interpretation:** It helps you understand where the variability in your data is coming from. Are the differences mainly due to the treatment, or is it just random variability within and between groups?

In summary, partitioning of variance in ANOVA is important because it helps us quantify and understand the sources of variation in our data. It's a tool for assessing whether the differences we observe between groups are likely due to the factors we're studying or if they could have occurred by random chance.

In [1]:
#Q4

import numpy as np
from scipy import stats

# Sample data for each group
group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([20, 22, 24, 26, 28])
group3 = np.array([30, 32, 34, 36, 38])

# Combine all data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate the grand mean
grand_mean = np.mean(data)

# Calculate the Total Sum of Squares (SST)
sst = np.sum((data - grand_mean) ** 2)

# Calculate the Explained Sum of Squares (SSE)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
sse = np.sum([(np.mean(group) - grand_mean) ** 2 * len(group) for group in [group1, group2, group3]])

# Calculate the Residual Sum of Squares (SSR)
ssr = sst - sse

# Calculate degrees of freedom
df_total = len(data) - 1
df_group = len(group_means) - 1
df_error = df_total - df_group

# Calculate mean squares
mse = ssr / df_error

# Calculate F-statistic
f_statistic = (sse / df_group) / mse

# Calculate p-value
p_value = 1 - stats.f.cdf(f_statistic, df_group, df_error)

print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSE): {sse}")
print(f"Residual Sum of Squares (SSR): {ssr}")
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


Total Sum of Squares (SST): 1120.0
Explained Sum of Squares (SSE): 1000.0
Residual Sum of Squares (SSR): 120.0
F-statistic: 50.0
P-value: 1.5127924217761546e-06


#Q5

import numpy as np

import scipy.stats as stats

import statsmodels.api as sm

from statsmodels.formula.api import ols


data = {
    'FactorA': factorA_data,
    'FactorB':factorB_data,
    'Y': dependent_variable_data
}


model = ols("Y - FactorA * FactorB", data = data).fit()


anova_table = sm.stats.anova_lm(model, type = 2)

main_effect_A = anova_table['PR(>F)']['FactorA']

main_effect_B = anova_table['PR(>F)']["FactorB"]


interaction_effect = anova_table['PR(>F)']["FactorA:FactorB"]

print(f"Main effect A p_value: {main_effect_A}")

print(f"Main effect B p_value: {main_effect_B}")

print(f"interaction effect p-value: {interaction_effect}")

#Q6

In a one-way ANOVA, the F-statistic is used to test whether there are statistically significant differences between the means of three or more groups. The p-value associated with the F-statistic indicates the probability of obtaining such an F-statistic if there were no real differences between the groups (i.e., if the null hypothesis were true).

In your scenario:

- F-statistic: 5.23
- p-value: 0.02

Here's how to interpret these results:

1. **Null Hypothesis (H0):** The null hypothesis in a one-way ANOVA typically states that there are no significant differences between the group means, i.e., all group means are equal.

2. **Alternative Hypothesis (Ha):** The alternative hypothesis suggests that there is at least one group mean that is significantly different from the others.

Based on the p-value:

- If p-value ≤ α (the chosen significance level, often 0.05), you reject the null hypothesis (H0). This indicates that there is sufficient evidence to conclude that at least one group mean is different from the others.

In your case, with a p-value of 0.02 (which is less than the common significance level of 0.05), you would reject the null hypothesis. Therefore, you can conclude:

**Conclusion:** There are statistically significant differences between at least two of the groups. In other words, the means of at least two groups are not equal.

However, the ANOVA test itself doesn't tell you which specific group(s) differ from the others. To identify which group(s) are different, you may need to perform post-hoc tests (e.g., Tukey's HSD or Bonferroni correction) or conduct pairwise comparisons between groups.

In summary, an F-statistic of 5.23 with a p-value of 0.02 indicates that there are significant differences among the groups, but additional tests are needed to determine which specific group(s) differ from one another.

#Q7

Handling missing data in a repeated measures ANOVA is essential to ensure the validity of your analysis. Missing data can arise in various ways, such as participants dropping out, incomplete responses, or technical issues during data collection. How you handle missing data can have significant consequences on your results and conclusions. Here are common methods for handling missing data and their potential consequences:

1. **Listwise Deletion (Complete Case Analysis):**
   - This method involves removing cases with any missing data from the analysis.
   - Pros:
     - Simple and straightforward.
     - Preserves the sample size for analysis.
   - Cons:
     - Reduces statistical power since it discards data.
     - May introduce bias if the data is not missing completely at random (MCAR). If there is a systematic pattern to the missing data, the analysis may produce biased results.

2. **Mean Imputation:**
   - Missing values are replaced with the mean of the available values for that variable.
   - Pros:
     - Preserves the sample size.
     - Simple and easy to implement.
   - Cons:
     - Can underestimate the variability and distort relationships between variables.
     - Does not reflect the true distribution of the missing data.
     - Assumes that missing values are missing completely at random (MCAR).

3. **Interpolation or LOCF (Last Observation Carried Forward):**
   - Missing values are replaced with the value of the previous observation for that variable.
   - Pros:
     - Preserves the sample size.
     - Suitable for data with a temporal or sequential structure.
   - Cons:
     - Assumes that the missing values follow a linear or constant trend.
     - Can introduce bias if the missing data are not monotonic (e.g., fluctuations over time).

4. **Multiple Imputation:**
   - Multiple imputation involves creating several complete datasets with different imputed values for missing data. The analysis is performed separately on each dataset, and results are combined.
   - Pros:
     - Preserves the sample size and accounts for uncertainty due to missing data.
     - Appropriate for missing data that are not MCAR.
   - Cons:
     - Requires more complex statistical software and procedures.
     - May introduce variability in results due to imputation.

5. **Model-Based Imputation:**
   - Impute missing values based on a statistical model, such as regression or mixed-effects models, that incorporates relationships between variables.
   - Pros:
     - Can provide more accurate imputations than mean or LOCF methods.
   - Cons:
     - Requires advanced statistical expertise and software.
     - May be sensitive to model assumptions.

The choice of method should be based on the nature of the missing data and the assumptions you are willing to make about the missing data mechanism (MCAR, MAR, or MNAR). It is essential to report the method used and consider sensitivity analyses to assess the robustness of your results under different missing data assumptions. Additionally, consulting with a statistician or data analyst with expertise in missing data handling is advisable for more complex cases.

#Q8

Common post-hoc tests are used after conducting an Analysis of Variance (ANOVA) to make pairwise comparisons between groups and determine which specific groups differ significantly from each other when the overall ANOVA test indicates a significant difference. Here are some common post-hoc tests and when you might use each one:

1. **Tukey's Honestly Significant Difference (Tukey HSD):**
   - Use Tukey's HSD when you have conducted a one-way ANOVA with three or more groups.
   - Tukey's HSD controls the family-wise error rate, which means it helps to maintain the overall Type I error rate at a desired level (usually 0.05).
   - Example: You have conducted a one-way ANOVA to compare the performance of four different teaching methods on student test scores. Tukey's HSD can be used to identify which pairs of teaching methods have significantly different mean test scores.

2. **Bonferroni Correction:**
   - Use the Bonferroni correction when you have conducted multiple pairwise comparisons, and you want to control the family-wise error rate at a specific level (e.g., 0.05).
   - It is more conservative than Tukey's HSD and can be applied to a wide range of post-hoc tests.
   - Example: After a two-way ANOVA with multiple groups in each factor, you want to compare the means of different combinations of levels of Factor A and Factor B. You apply Bonferroni correction to control for multiple comparisons.

3. **Sidak Correction:**
   - Similar to Bonferroni, the Sidak correction is used to control the family-wise error rate but is slightly less conservative.
   - It can be a good compromise between the strictness of Bonferroni and the lack of control in unadjusted tests.
   - Example: You have conducted a repeated measures ANOVA with multiple dependent variables, and you want to compare the means of different conditions across these variables. You apply Sidak correction for multiple comparisons.

4. **Duncan's Multiple Range Test (MRT):**
   - Duncan's MRT is used when you have conducted a one-way ANOVA, and you want to compare all possible pairs of groups.
   - It does not control the family-wise error rate, so it can be less conservative than Tukey's HSD.
   - Example: You have conducted a one-way ANOVA to compare the yield of several different fertilizer treatments. Duncan's MRT can help you identify which specific fertilizer treatments lead to significantly different yields.

5. **Scheffé's Test:**
   - Scheffé's test is a very conservative post-hoc test that can be used in situations where other tests are not appropriate.
   - It can be applied when sample sizes are unequal or when there are unequal variances between groups.
   - Example: You have conducted a one-way ANOVA with groups of unequal sizes and want to compare the means. Scheffé's test can provide a more robust approach in such cases.

When to use a particular post-hoc test depends on your specific research design, the assumptions you can make, and your goals regarding controlling Type I errors. Always choose a post-hoc test that aligns with the characteristics of your data and research question to make valid pairwise comparisons.

In [4]:
#Q9

import numpy as np
import scipy.stats as stats

diet_A = np.random.normal(5,1,50)
diet_B = np.random.normal(6,1,50)
diet_C = np.random.normal(4.5,1,50)

data = np.concatenate([diet_A, diet_B, diet_c])
groups = ["A"]*50 + ["B"]*50 + ["C"]*50

f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print(f"f_statistic: {f_statistic}")
print(f"p_value: {p_value}")

alpha = 0.05

if p_value < alpha:
    print("there is statistically significant difference between the mean weight loss of the three diets.")
    print("there is no statistically significant difference between the mean weight loss of the three diets.")

f_statistic: 18.05660489664189
p_value: 9.725858758985733e-08
there is statistically significant difference between the mean weight loss of the three diets.
there is no statistically significant difference between the mean weight loss of the three diets.


In [7]:
#Q10

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

data= {
    "Software": np.repeat(['A','B','C'], 10),
    "Experience": np.tile(["Novice", "Experienced"],15),
     "Time": np.random.normal(30,5,30)
}
df = pd.DataFrame(data)

model = ols("Time ~ Software * Experience", data = df).fit()

anova_table = sm.stats.anova_lm(model, typ = 2)
print(anova_table)

                         sum_sq    df         F    PR(>F)
Software              31.212494   2.0  0.838069  0.444814
Experience            28.533618   1.0  1.532281  0.227744
Software:Experience    9.440477   2.0  0.253481  0.778146
Residual             446.919985  24.0       NaN       NaN


In [8]:
#Q11

import numpy as np
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
# Simulated test scores (replace with your actual data)
control_group = np.random.normal(75, 10, 50)  # Mean score: 75, standard deviation: 10
experimental_group = np.random.normal(80, 10, 50)  # Mean score: 80, standard deviation: 10
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)
alpha = 0.05  # Set the significance level

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:
    print("There is a statistically significant difference in test scores between the two groups.")
else:
    print("There is no statistically significant difference in test scores between the two groups.")


T-statistic: -1.6926218630938965
P-value: 0.09370444626477131
There is no statistically significant difference in test scores between the two groups.


In [12]:
#Q12

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (replace with your actual sales data)
data = {
    'Store': ['A', 'B', 'C'] * 10,  # Three stores with 30 measurements
    'Day': np.repeat(range(1, 11), 3),  # 10 days with repeated measurements
    'Sales': np.random.randint(50, 100, 30)  # Simulated sales data
}

# Create a DataFrame
sales_data = pd.DataFrame(data)

# Perform a repeated measures ANOVA
rm = AnovaRM(data=sales_data, depvar='Sales', subject='Day', within=['Store'])
results = rm.fit()

# Print the ANOVA table
print(results.summary())

# Perform Tukey's HSD post-hoc test
posthoc = pairwise_tukeyhsd(endog=sales_data['Sales'], groups=sales_data['Store'], alpha=0.05)

# Print the post-hoc test results
print(posthoc)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.0139 2.0000 18.0000 0.9862

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B      0.7 0.9935 -15.2068 16.6068  False
     A      C     -0.4 0.9979 -16.3068 15.5068  False
     B      C     -1.1 0.9839 -17.0068 14.8068  False
-----------------------------------------------------
