# Statistics Advance-6

Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:
# Creating a dataset

import pandas as pd
import numpy as np

data = {
    'Group_A': [10, 12, 15, 11, 13],
    'Group_B': [8, 9, 11, 10, 12],
    'Group_C': [13, 14, 16, 15, 17]
}

marks = pd.DataFrame(data)

grand_mean = marks.values.mean()

sst = np.sum((marks.values - grand_mean) ** 2)

group_means = marks.mean()
squared_group_means = np.square(group_means - grand_mean)
sse = np.sum(marks.apply(lambda col: np.square(col - col.mean())))

ssr = sst - sse

print("Total Sum of Squares:", sst)
print("Explained Sum of Squares:", sse)
print("Residual Sum of Squares:", ssr)

Total Sum of Squares: 97.6
Explained Sum of Squares: Group_A    14.8
Group_B    10.0
Group_C    10.0
dtype: float64
Residual Sum of Squares: Group_A    82.8
Group_B    87.6
Group_C    87.6
dtype: float64


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [17]:
import pandas as pd
import numpy as np

# Create the dataset
data = {
    'A': [1, 1, 2, 2, 3, 3],
    'B': [1, 2, 1, 2, 1, 2],
    'Y': [10, 12, 15, 13, 14, 16]
}

df = pd.DataFrame(data)

# Check for NaNs
print("NaN values in the dataset:")
print(df.isna().sum())

# Check for infinite values
print("\nInfinite values in the dataset:")
print(np.isinf(df).sum())

NaN values in the dataset:
A    0
B    0
Y    0
dtype: int64

Infinite values in the dataset:
A    0
B    0
Y    0
dtype: int64


In [21]:
import warnings
warnings.filterwarnings('ignore')

In [22]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the dataset
data = {
    'A': [1, 1, 2, 2, 3, 3],
    'B': [1, 2, 1, 2, 1, 2],
    'Y': [10, 12, 15, 13, 14, 16]
}

df = pd.DataFrame(data)

# Check for NaNs and Infs
print("Checking for NaN values in the dataset:")
print(df.isna().sum())

print("\nChecking for infinite values in the dataset:")
print(np.isinf(df).sum())

# Display the dataframe
print("\nDataFrame:")
print(df)

# Check data types
print("\nData types:")
print(df.dtypes)

# Check unique values
print("\nUnique values in each column:")
for column in df.columns:
    print(f"{column}: {df[column].unique()}")

# Convert A and B to categorical variables
df['A'] = df['A'].astype('category')
df['B'] = df['B'].astype('category')

# Verify data types after conversion
print("\nData types after conversion to 'category':")
print(df.dtypes)

# Fit the ANOVA model
try:
    model = ols('Y ~ C(A) + C(B) + C(A):C(B)', data=df).fit()
    # Get ANOVA table
    anova_table = sm.stats.anova_lm(model, typ=2)
    print("\nANOVA Table:")
    print(anova_table)
except Exception as e:
    print(f"An error occurred: {e}")


Checking for NaN values in the dataset:
A    0
B    0
Y    0
dtype: int64

Checking for infinite values in the dataset:
A    0
B    0
Y    0
dtype: int64

DataFrame:
   A  B   Y
0  1  1  10
1  1  2  12
2  2  1  15
3  2  2  13
4  3  1  14
5  3  2  16

Data types:
A    int64
B    int64
Y    int64
dtype: object

Unique values in each column:
A: [1 2 3]
B: [1 2]
Y: [10 12 15 13 14 16]

Data types after conversion to 'category':
A    category
B    category
Y       int64
dtype: object
An error occurred: array must not contain infs or NaNs


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [28]:
# Null hypo: there are significant differences between the mean weight loss of the three diets

# Alt hypo: there are no significant differences between the mean weight loss of the three diets

In [25]:
# Generating Data

import pandas as pd
import numpy as np

np.random.seed(0)

n = 50
diets = ['Diet1', 'Diet2', 'Diet3']
data = {
    'diets': np.random.choice(diets, n),
    'weight_loss': np.random.normal(loc=0, scale=1, size=n) + np.random.choice([1, 2, 3], n)
}

df = pd.DataFrame(data)

print(df.head())

# Perform One Way ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the ANOVA model
model = ols('weight_loss ~ C(diets)', data=df).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

   diets  weight_loss
0  Diet1     1.566440
1  Diet2     3.149265
2  Diet1    -0.078278
3  Diet2     4.395472
4  Diet2     4.787484
             sum_sq    df         F   PR(>F)
C(diets)   3.445806   2.0  1.016375  0.36972
Residual  79.671787  47.0       NaN      NaN


In [26]:
# From the above code, we conclude that:
    
F = 1.016
p_value = 0.369

In [27]:
if p_value < F:
    print('reject null hypo')
else:
    print('failed to reject null hypo')

reject null hypo


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice V/s experienced). Report the F-statistics and p-values, and interpret the results.

In [30]:
import pandas as pd
import numpy as np

np.random.seed(42)

n = 30
programs = ['A', 'B', 'C']
experience_levels = ['novice', 'experienced']

data = {
    'program': np.random.choice(programs, n),
    'experience': np.random.choice(experience_levels, n),
    'completion_time': np.random.normal(loc=10, scale=2, size=n) + np.random.choice([1, 2, 3], n)
}

df = pd.DataFrame(data)

# Perform two-way ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the two-way ANOVA model
model = ols('completion_time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print("\nANOVA Table:")
print(anova_table)


ANOVA Table:
                             sum_sq    df         F    PR(>F)
C(program)                 8.498844   2.0  1.262868  0.300972
C(experience)              0.480110   1.0  0.142682  0.708949
C(program):C(experience)   3.045226   2.0  0.452499  0.641352
Residual                  80.757538  24.0       NaN       NaN


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [32]:
np.random.seed(42)
n_days = 30
stores = ['A', 'B', 'C']
sales_data = {
    'day': np.tile(np.arange(1, n_days + 1), len(stores)),
    'store': np.repeat(stores, n_days),
    'sales': np.random.normal(loc=100, scale=10, size=n_days * len(stores)) + np.random.choice([0, 10, 20], n_days * len(stores))
}

df = pd.DataFrame(sales_data)

print("Sample Data:")
print(df.head())

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

rm_anova = AnovaRM(df, 'sales', 'day', within=['store'])
rm_anova_results = rm_anova.fit()

print("\nRepeated Measures ANOVA Results:")
print(rm_anova_results)

from statsmodels.stats.multicomp import pairwise_tukeyhsd

posthoc = pairwise_tukeyhsd(df['sales'], df['store'], alpha=0.05)

print("\nPost-Hoc Test Results (Tukey's HSD):")
print(posthoc)

Sample Data:
   day store       sales
0    1     A  114.967142
1    2     A  108.617357
2    3     A  126.476885
3    4     A  125.230299
4    5     A  117.658466

Repeated Measures ANOVA Results:
               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store  1.0637 2.0000 58.0000 0.3518


Post-Hoc Test Results (Tukey's HSD):
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B   0.0032    1.0 -7.7758  7.7822  False
     A      C   4.0103 0.4393 -3.7687 11.7893  False
     B      C   4.0071 0.4398 -3.7719 11.7861  False
----------------------------------------------------
