In [1]:
#1...
"""ANOVA (Analysis of Variance) relies on several key assumptions to ensure the validity of the results. If these assumptions are violated, the test may lead to inaccurate conclusions. Below are the assumptions and examples of potential violations:

### 1. **Independence of Observations**
   - **Assumption**: Each observation must be independent, meaning the value of one observation does not influence or relate to another.
   - **Example of Violation**: If the data points are from repeated measures on the same subject or from groups where individuals influence each other (e.g., family members or classmates), the independence assumption is violated.

### 2. **Homogeneity of Variances (Equal Variances)**
   - **Assumption**: The variances of the groups being compared should be approximately equal.
   - **Example of Violation**: If one group has much more variability than another, the assumption of homogeneity of variances is violated. This can lead to misleading F-statistics and inflated Type I error rates. For example, comparing the exam scores of students from different schools where one school has a much wider range of student abilities could violate this assumption.

### 3. **Normality of Residuals**
   - **Assumption**: The residuals (differences between observed and predicted values) should be normally distributed.
   - **Example of Violation**: If the residuals show skewness or heavy tails (i.e., are not normally distributed), this assumption is violated. This can happen, for example, if the data contain extreme outliers or are heavily skewed. Non-normality can impact the validity of the F-test, especially with small sample sizes.

### 4. **Random Sampling**
   - **Assumption**: The data should be collected through random sampling to ensure generalizability.
   - **Example of Violation**: If a biased sampling method is used (e.g., selecting only students from a single, high-performing school when studying general education levels), the results may not be valid for the entire population.

### 5. **Additivity and Linearity**
   - **Assumption**: The relationship between the dependent variable and independent factors should be linear and additive.
   - **Example of Violation**: If interactions between variables are present or if the relationship is not linear (e.g., quadratic or exponential), the assumption is violated. This could lead to an inaccurate interpretation of group differences.

### Impact of Violations:
- **Independence**: Violations lead to biased results, as the observations are not truly independent.
- **Homogeneity of Variance**: If the variances are not equal, it could lead to inflated Type I errors (false positives), especially with unbalanced groups (i.e., groups of unequal sizes).
- **Normality of Residuals**: Violating normality can make the F-test less reliable, particularly for small samples, as ANOVA relies on normal distribution theory.
- **Random Sampling**: Lack of random sampling affects the generalizability of the results.
- **Additivity and Linearity**: Misinterpreting the relationship between variables can lead to incorrect conclusions about the significance of group differences.

### Remedies for Violations:
- **Transformations**: Applying data transformations (e.g., logarithmic or square root) can help address violations of normality and homogeneity of variance.
- **Non-parametric alternatives**: If assumptions are severely violated, consider using non-parametric tests like the Kruskal-Wallis test, which does not assume normality or equal variances.
- **Use of Robust Methods**: Some statistical software allows for "robust" ANOVA, which adjusts for heteroscedasticity (unequal variances)."""

'ANOVA (Analysis of Variance) relies on several key assumptions to ensure the validity of the results. If these assumptions are violated, the test may lead to inaccurate conclusions. Below are the assumptions and examples of potential violations:\n\n### 1. **Independence of Observations**\n   - **Assumption**: Each observation must be independent, meaning the value of one observation does not influence or relate to another.\n   - **Example of Violation**: If the data points are from repeated measures on the same subject or from groups where individuals influence each other (e.g., family members or classmates), the independence assumption is violated.\n\n### 2. **Homogeneity of Variances (Equal Variances)**\n   - **Assumption**: The variances of the groups being compared should be approximately equal.\n   - **Example of Violation**: If one group has much more variability than another, the assumption of homogeneity of variances is violated. This can lead to misleading F-statistics and in

In [2]:
#2...
"""There are three main types of ANOVA (Analysis of Variance), and each is used in different situations depending on the number of factors and the relationships between them. Here's an overview of the three types and their typical use cases:
    1. One way anova
    2. two way anova
    3. repeted measures anova"""

"There are three main types of ANOVA (Analysis of Variance), and each is used in different situations depending on the number of factors and the relationships between them. Here's an overview of the three types and their typical use cases:\n    1. One way anova\n    2. two way anova\n    3. repeted measures anova"

In [1]:
#9...
import pandas as pd
import numpy as np
from scipy import stats

# Sample weight loss data (replace this with actual data)
# Data for each diet group
np.random.seed(0)  # for reproducibility
diet_A = np.random.normal(loc=5.0, scale=1.5, size=50)  # diet A
diet_B = np.random.normal(loc=6.0, scale=1.5, size=50)  # diet B
diet_C = np.random.normal(loc=5.5, scale=1.5, size=50)  # diet C

# Create a DataFrame
data = pd.DataFrame({
    'Diet_A': diet_A,
    'Diet_B': diet_B,
    'Diet_C': diet_C
})

# Perform the one-way ANOVA test
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the F-statistic and p-value
f_statistic, p_value


(3.655905931471352, 0.02821532555294891)

In [2]:
#10...
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data generation
np.random.seed(0)
# 30 employees with equal representation in each group
n_employees = 30

# Factors
software = ['A']*10 + ['B']*10 + ['C']*10
experience = ['novice']*5 + ['experienced']*5  # 5 novice, 5 experienced per program

# Simulated time to complete the task (replace with actual data)
time_A_novice = np.random.normal(loc=20, scale=5, size=5)
time_A_experienced = np.random.normal(loc=18, scale=5, size=5)
time_B_novice = np.random.normal(loc=22, scale=5, size=5)
time_B_experienced = np.random.normal(loc=19, scale=5, size=5)
time_C_novice = np.random.normal(loc=21, scale=5, size=5)
time_C_experienced = np.random.normal(loc=20, scale=5, size=5)

# Combine the data
times = np.concatenate([time_A_novice, time_A_experienced, time_B_novice, 
                        time_B_experienced, time_C_novice, time_C_experienced])

# Create a DataFrame
data = pd.DataFrame({
    'Software': software,
    'Experience': experience*3,  # Repeating for each software
    'Time': times
})

# Two-way ANOVA
model = ols('Time ~ C(Software) * C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Output the ANOVA table
anova_table


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Software),8.947992,2.0,0.152926,0.859023
C(Experience),164.553588,1.0,5.624627,0.026076
C(Software):C(Experience),102.162565,2.0,1.746016,0.195908
Residual,702.141835,24.0,,


In [3]:
#11..
import numpy as np
import pandas as pd

# Example data: groups and values
data = {'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Value': [5, 7, 6, 8, 10, 9, 10, 12, 11]}

df = pd.DataFrame(data)

# Overall mean
overall_mean = df['Value'].mean()

# Group means
group_means = df.groupby('Group')['Value'].mean()

# SST (Total Sum of Squares)
sst = np.sum((df['Value'] - overall_mean) ** 2)

# SSE (Explained Sum of Squares)
sse = np.sum(df.groupby('Group').apply(lambda x: len(x) * (x['Value'].mean() - overall_mean) ** 2))

# SSR (Residual Sum of Squares)
ssr = np.sum(df.groupby('Group').apply(lambda x: np.sum((x['Value'] - x['Value'].mean()) ** 2)))

# Print results
print(f"SST (Total Sum of Squares): {sst}")
print(f"SSE (Explained Sum of Squares): {sse}")
print(f"SSR (Residual Sum of Squares): {ssr}")


SST (Total Sum of Squares): 44.0
SSE (Explained Sum of Squares): 38.0
SSR (Residual Sum of Squares): 6.0


In [4]:
#12...
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulating data: test scores for control and experimental groups
np.random.seed(0)  # For reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)  # Traditional method
experimental_group = np.random.normal(loc=75, scale=10, size=50)  # New method

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Print t-test results
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

# Interpret the t-test result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the two groups.")

# If significant, follow up with post-hoc test (Tukey's HSD)
if p_value < alpha:
    # Combine the data into a single array and create a group label array
    data = np.concatenate([control_group, experimental_group])
    groups = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)

    # Create a dataframe for the post-hoc test
    df = pd.DataFrame({'Score': data, 'Group': groups})

    # Tukey's HSD post-hoc test
    tukey = pairwise_tukeyhsd(endog=df['Score'], groups=df['Group'], alpha=alpha)

    # Print the results of the post-hoc test
    print(tukey)


t-statistic: -1.6677351961320235
p-value: 0.09856078338184605
Fail to reject the null hypothesis. There is no significant difference between the two groups.


In [5]:
#13..
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulate daily sales data for three stores across 30 days
np.random.seed(0)
store_a_sales = np.random.normal(loc=200, scale=10, size=30)
store_b_sales = np.random.normal(loc=210, scale=10, size=30)
store_c_sales = np.random.normal(loc=205, scale=10, size=30)

# Create a dataframe for the repeated measures ANOVA
data = pd.DataFrame({
    'Day': np.tile(np.arange(1, 31), 3),
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])
})

# Perform repeated measures ANOVA
# Each day is a "subject", and each store is a repeated measure
rm_anova = AnovaRM(data, 'Sales', 'Day', within=['Store'])
anova_results = rm_anova.fit()

# Print the ANOVA results
print(anova_results)

# If the ANOVA is significant, follow up with a post-hoc test
alpha = 0.05
if anova_results.anova_table['Pr > F'][0] < alpha:
    print("The repeated measures ANOVA is significant. Performing post-hoc test...")
    
    # Post-hoc test using Tukey's HSD
    tukey = pairwise_tukeyhsd(endog=data['Sales'], groups=data['Store'], alpha=alpha)
    print(tukey)
else:
    print("No significant difference in sales between the stores.")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.9395 2.0000 58.0000 0.3967

No significant difference in sales between the stores.


In [7]:
#5...
"""
In a **two-way ANOVA**, we investigate how two independent variables (factors) affect a dependent variable and whether there is an interaction effect between the two factors. The main effects test the individual influence of each factor, while the interaction effect examines if the factors jointly affect the dependent variable.

### Steps:
1. **Main Effects**: These test the independent effects of each factor on the dependent variable.
2. **Interaction Effects**: This tests whether the combined levels of the two factors produce an effect that is different from the sum of their individual effects.

You can perform a two-way ANOVA using Python's `statsmodels` library, which allows for modeling both main effects and interaction effects.

### Example Code:

Let's assume we are testing how two factors—**Store** and **Promotion**—affect **daily sales**. The two factors are:
- **Factor 1 (Store)**: Store A, Store B, Store C.
- **Factor 2 (Promotion)**: Promotion 1, Promotion 2.

### Step-by-Step Implementation

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulated data
np.random.seed(0)
data = pd.DataFrame({
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Promotion': np.tile(np.repeat(['P1', 'P2'], 15), 3),
    'Sales': np.random.normal(loc=200, scale=10, size=90) + np.repeat([10, 20, 15], 30)  # Adding variation by store
})

# Create the model
# Formula: Sales ~ Store + Promotion + Store:Promotion
model = ols('Sales ~ Store + Promotion + Store:Promotion', data=data).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)
```

### Explanation:
1. **Data**:
   - `Store`: Three stores (A, B, C).
   - `Promotion`: Two promotion types (P1, P2).
   - `Sales`: Randomly generated sales data, with different means per store to simulate variation.

2. **Model Setup**:
   - We use the formula `Sales ~ Store + Promotion + Store:Promotion`:
     - `Store`: Main effect for Store.
     - `Promotion`: Main effect for Promotion.
     - `Store:Promotion`: Interaction effect between Store and Promotion.

3. **ANOVA Table**:
   - The function `sm.stats.anova_lm(model, typ=2)` produces the ANOVA table.
   - `typ=2` indicates that we are using a Type II ANOVA, which is appropriate when the model does not include interaction terms or if interaction terms are independent of each other.

### Interpreting the Output:
The ANOVA table will provide:
- **Sum of Squares (SS)**: Measures the variance explained by each factor.
- **Degrees of Freedom (df)**: The number of independent values.
- **F-statistic**: Tests the significance of the effect.
- **p-value**: Tells us whether the effect is significant (usually if `p < 0.05`).

### Example Output (Hypothetical):
```bash
                      sum_sq    df         F    PR(>F)
Store                1425.25   2.0   15.203     0.0001
Promotion            1265.75   1.0   27.456     0.0000
Store:Promotion       330.10   2.0    3.582     0.0315
Residual            3689.88   84.0       NaN        NaN
```

- **Store**: Significant main effect (`p = 0.0001`), indicating that the sales differ between the stores.
- **Promotion**: Significant main effect (`p = 0.0000`), suggesting that different promotions have a significant effect on sales.
- **Store:Promotion**: Significant interaction effect (`p = 0.0315`), meaning that the effect of promotion depends on the store.

### Post-hoc Analysis (If Interaction is Significant):
If the interaction effect is significant, you may want to investigate which specific pairs of factor levels differ significantly. You can use **Tukey’s HSD** for pairwise comparisons.

```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD for post-hoc analysis
tukey = pairwise_tukeyhsd(endog=data['Sales'], groups=data['Store'] + " & " + data['Promotion'], alpha=0.05)
print(tukey)
```

### Summary:
- The **main effects** (Store and Promotion) test whether each factor individually affects the dependent variable.
- The **interaction effect** tests whether the combined levels of Store and Promotion jointly affect sales.
- If significant, you can follow up with post-hoc tests (e.g., Tukey’s HSD) to determine where the differences lie.

Let me know if you'd like to explore further or run this code with specific data!"""

'\nIn a **two-way ANOVA**, we investigate how two independent variables (factors) affect a dependent variable and whether there is an interaction effect between the two factors. The main effects test the individual influence of each factor, while the interaction effect examines if the factors jointly affect the dependent variable.\n\n### Steps:\n1. **Main Effects**: These test the independent effects of each factor on the dependent variable.\n2. **Interaction Effects**: This tests whether the combined levels of the two factors produce an effect that is different from the sum of their individual effects.\n\nYou can perform a two-way ANOVA using Python\'s `statsmodels` library, which allows for modeling both main effects and interaction effects.\n\n### Example Code:\n\nLet\'s assume we are testing how two factors—**Store** and **Promotion**—affect **daily sales**. The two factors are:\n- **Factor 1 (Store)**: Store A, Store B, Store C.\n- **Factor 2 (Promotion)**: Promotion 1, Promotion 