In [None]:
#13 March Assignment Solution

In [None]:
#Ans 1:
'''
Assumptions of ANOVA:

Normality: The dependent variable should be normally distributed within each group.
Homogeneity of variances (homoscedasticity): The variance of the dependent variable should be approximately equal across all groups.
Independence: Observations within each group should be independent of each other.

Violations of Assumptions:
Normality:
Example of Violation: The dependent variable may not follow a normal distribution within one or more groups. 
For example, in a study examining test scores, if one group's scores are skewed heavily to the left, violating the normality assumption.

Homogeneity of Variances:
Example of Violation: If the variance of the dependent variable differs significantly between groups,
it can lead to violations. For instance, in an experiment comparing the effectiveness of different treatments on patients, 
if the variability of responses to treatment varies widely between groups, it violates homogeneity of variances.

Independence:
Example of Violation: In repeated measures designs or nested designs, where observations within groups are not independent
(e.g., measurements taken from the same subject over time), the independence assumption is violated.
'''

In [None]:
#Ans 2:
'''
Three types of ANOVA:

One-way ANOVA: Used to compare means of three or more independent groups on a single dependent variable. 
It determines whether there are any statistically significant differences between the means of the groups.

Two-way ANOVA: Extends one-way ANOVA to assess the influence of two categorical independent variables (factors) on a single dependent variable.
It examines the main effects of each factor and their interaction effect.

N-way ANOVA: Generalization of ANOVA to more than two factors. It can handle complex experimental
designs with multiple categorical independent variables.

Situations for Each Type:

One-way ANOVA: Used when comparing means across different treatment groups, experimental conditions, or levels of a single 
categorical independent variable.

Two-way ANOVA: Applied when studying the effects of two independent variables on a dependent variable and examining
if there is an interaction between these variables.

N-way ANOVA: Employed in studies with more than two categorical independent variables, where the researcher wants to 
analyze the combined effects of multiple factors on the dependent variable.

'''

In [None]:
#Ans 3:
'''
Partitioning of Variance:
In ANOVA, the total variance in the dependent variable is decomposed into different components:

Total Sum of Squares (SST): Represents the total variability in the dependent variable.
Explained Sum of Squares (SSE): Indicates the variability explained by the independent variable(s) or factors.
Residual Sum of Squares (SSR): Reflects the unexplained or error variability remaining after accounting for the effects of the independent variable(s).

Importance:
Understanding the partitioning of variance is crucial as it helps to:

Assess the proportion of variance in the dependent variable explained by the independent variable(s).
Evaluate the goodness-of-fit of the model and the significance of the independent variable(s) in explaining the variability in the dependent variable.
Interpret the relative importance of different factors or variables in influencing the dependent variable.

'''

In [None]:
#Ans 4:
import numpy as np

def calculate_sst_sse_ssr(data, groups):
    overall_mean = np.mean(data)
    
    sst = np.sum((data - overall_mean) ** 2)
    
    sse = 0
    for group in np.unique(groups):
        group_data = data[groups == group]
        group_mean = np.mean(group_data)
        sse += np.sum((group_data - group_mean) ** 2)
    
    ssr = sst - sse
    
    return sst, sse, ssr

# Example usage:
data = np.array([10, 15, 20, 12, 18, 22])
groups = np.array(['A', 'B', 'A', 'B', 'A', 'B'])

sst, sse, ssr = calculate_sst_sse_ssr(data, groups)
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


In [None]:
#Ans 5:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'factor1': [1, 1, 1, 2, 2, 2],
        'factor2': [1, 2, 3, 1, 2, 3],
        'response': [5, 7, 6, 9, 11, 10]}

# Fit two-way ANOVA model
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=data).fit()

# Print ANOVA table
print(sm.stats.anova_lm(model))

# Get main effects and interaction effect
main_effect_factor1 = model.params['C(factor1)[T.2]']
main_effect_factor2 = model.params['C(factor2)[T.2]']
interaction_effect = model.params['C(factor1)[T.2]:C(factor2)[T.2]']

print("Main effect of Factor 1:", main_effect_factor1)
print("Main effect of Factor 2:", main_effect_factor2)
print("Interaction effect:", interaction_effect)


In [None]:
#Ans 6:
'''
In this scenario:

The obtained F-statistic of 5.23 indicates that there is some evidence of differences in means between the groups.
The p-value of 0.02 is less than the significance level (usually 0.05), suggesting that the observed differences between 
the groups are statistically significant.

Interpretation:

We reject the null hypothesis, which assumes that there are no differences between the group means.
Therefore, we can conclude that there are statistically significant differences in at least one pair of group means.
However, we cannot determine which specific groups differ from each other based solely on the ANOVA results. 
Post-hoc tests or pairwise comparisons would be necessary to identify which groups differ significantly.
'''

In [None]:
#Ans 7:
'''
Handling Missing Data in Repeated Measures ANOVA:

Exclude cases with missing data: You can choose to exclude cases with missing data from the analysis, but this may lead to biased results
if missingness is related to the outcome or other variables.
Imputation: Impute missing values using methods such as mean imputation, regression imputation, or multiple imputation.
This approach allows you to retain all cases in the analysis but may introduce bias if the imputation model is misspecified.

Potential Consequences of Using Different Methods:

Excluding cases with missing data can lead to biased results if the missingness is not completely random. It may also reduce statistical power.
Imputation methods may introduce bias if the imputation model does not accurately represent the missing data mechanism. 
Additionally, imputed values may not accurately reflect the true values, leading to inaccurate estimates of effects.

'''

In [None]:
#ans 8:
'''
Common Post-hoc Tests:

1.Tukey's Honestly Significant Difference (HSD) Test: Used when conducting multiple pairwise comparisons between group means.
It controls the familywise error rate and is appropriate when there are equal sample sizes and variances across groups.

2.Bonferroni Correction: Adjusts the significance level for multiple comparisons to maintain a desired familywise error rate. 
It is conservative but suitable for controlling Type I error rate.

3.Scheffe's Test: Suitable for unequal sample sizes and variances. It controls the familywise error rate but is less sensitive than Tukey's HSD.

4.Dunnett's Test: Compares each treatment group mean to a control group mean. 
It is used when comparing multiple treatment groups to a single control group.

Example Situation:
Suppose you conducted a one-way ANOVA to compare the effectiveness of four different treatments on reducing blood pressure. The ANOVA results suggest that there are significant differences between the treatment groups. To identify which specific treatments differ significantly
from each other, you would conduct post-hoc tests, such as Tukey's HSD or Bonferroni correction.
'''

In [None]:
#Ans 9:
import numpy as np
from scipy.stats import f_oneway

# Example data
weight_loss_a = np.array([2, 3, 4, 5, 3])
weight_loss_b = np.array([3, 4, 5, 6, 4])
weight_loss_c = np.array([4, 5, 6, 7, 5])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(weight_loss_a, weight_loss_b, weight_loss_c)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Reject null hypothesis: There are significant differences in mean weight loss between at least one pair of diets.")
else:
    print("Fail to reject null hypothesis: There are no significant differences in mean weight loss between the diets.")


In [None]:
#ANs 10:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'Software': ['A', 'B', 'C'] * 20,
        'Experience': ['Novice'] * 30 + ['Experienced'] * 30 + ['Novice'] * 30,
        'Time': [10, 12, 11, 13, 15, 14, 9, 11, 10] * 10}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Fit two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()

# Print ANOVA table
print(sm.stats.anova_lm(model, typ=2))

# Interpretation
# Look for significant main effects and interaction effects in the ANOVA table
# Interpret the results based on the F-statistics and p-values


In [3]:
#ANS 11:
import numpy as np
from scipy.stats import ttest_ind

# Example data
control_group_scores = np.array([70, 75, 80, 85, 78, 72, 77, 82, 79, 75])
experimental_group_scores = np.array([75, 82, 88, 90, 85, 79, 81, 86, 83, 80])

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)

# Print results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("Reject null hypothesis: There is a significant difference in test scores between the control and experimental groups.")
    # Perform post-hoc test if necessary
else:
    print("Fail to reject null hypothesis: There is no significant difference in test scores between the groups.")


In [None]:
#ANs 12:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'Store': ['A', 'B', 'C'] * 30,
        'Sales': [100, 110, 95, 105, 115, 90, 95, 100, 105] * 10}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Fit repeated measures ANOVA model
model = ols('Sales ~ C(Store)', data=df).fit()

# Print ANOVA table
print(sm.stats.anova_lm(model))

# Perform post-hoc test if necessary
# You can use Tukey's HSD or other appropriate methods to determine significant differences between stores.
