In [1]:
# Question 1

# Answer 1 -

# Analysis of Variance (ANOVA) is a statistical technique used to compare means of three or more groups to determine if there are significant 
# differences between them. To use ANOVA and obtain reliable results, certain assumptions need to be met. 
# These assumptions ensure that the test is appropriate and the conclusions drawn are valid. 
# The assumptions for ANOVA include:

# 1. Normality: The data within each group should follow a normal distribution. This assumption is important because ANOVA relies on the 
# normal distribution to estimate population parameters accurately.

# 2. Homogeneity of Variance: The variances of the groups being compared should be approximately equal. Unequal variances can affect the 
# overall F-test's sensitivity and result in inaccurate p-values.

# 3. Independence: Observations within each group should be independent of each other. This assumption ensures that the observations are 
# not correlated or related to each other within a group.

# 4. Absence of Outliers: Oulying score need to be removed from the dataset

# Examples of Violations and Their Impact:

# 1. Normality Violation:
#   - Impact: If the assumption of normality is violated, the validity of p-values and confidence intervals can be compromised. 
#     The F-test's distribution relies on the normality assumption, so violating it may lead to incorrect conclusions about group differences.
#   - Example: In an ANOVA comparing test scores between groups, if one group's scores are heavily skewed, it might violate the normality assumption.

# 2. Homogeneity of Variance Violation:
#   - Impact: Violating the homogeneity of variance assumption can lead to inflated or deflated p-values. If variances are not equal, 
#     the F-test's assumptions are not met, potentially leading to a higher Type I error rate.
#   - Example: In a study comparing the effects of different fertilizer brands on plant growth, if the variances in plant heights are vastly 
#    different between the fertilizer groups, the homogeneity of variance assumption might be violated.

# 3. Independence Violation:
#   - Impact: If observations are not independent within groups, the assumption of independence is violated. This can lead to incorrect conclusions 
#     about group differences and affect the validity of the F-test.
#   - Example: In a study where students' test scores are measured before and after a tutoring program, the scores within each student are not 
#    independent, violating the assumption.

# When these assumptions are significantly violated, the results of ANOVA may be unreliable.

In [2]:
# Question 2

# Answer 2 -

# There are three main types of Analysis of Variance (ANOVA) techniques: One-Way ANOVA, Two-Way ANOVA, and Multivariate ANOVA (MANOVA). 
# Each type is used in different situations to analyze and compare means across different groups or factors.

# 1. One-Way ANOVA:
#   - Situation: One-Way ANOVA is used when you have one independent variable with more than two levels (groups) and you want to determine 
#    if there are any significant differences in means between these groups.
#   - Example: Suppose you want to compare the mean test scores of students from three different schools to determine if there's a significant 
#    difference in the quality of education they receive.

# 2. Two-Way ANOVA:
#    - Situation: Two-Way ANOVA is used when you have two independent variables (factors) and you want to analyze their main effects and 
#     potential interaction effects on a dependent variable.
#   - Example: Consider a study where you're investigating the effects of both gender and diet on weight loss. You have two independent variables: 
#     gender (male/female) and diet type (low-carb/high-carb). Two-Way ANOVA helps you assess if there are main effects for each factor and 
#     whether the interaction of gender and diet is significant.

# 3. Multivariate ANOVA (MANOVA):
#   - Situation: MANOVA is used when you have two or more dependent variables and you want to determine if there are significant differences 
#    among groups on these multiple dependent variables simultaneously.
#  - Example: Imagine a study examining the effects of different exercise programs on both weight loss and cardiovascular fitness. 
#    You're interested in understanding if the exercise programs have a joint effect on both variables. MANOVA helps you analyze whether there are 
#    any significant differences in the combination of weight loss and fitness levels.

# These ANOVA techniques are used to analyze different levels of complexity in experimental designs. 
# One-Way ANOVA is suitable for comparing means across multiple groups of one factor. 
# Two-Way ANOVA expands this to two factors, and it allows you to examine potential interactions between these factors. 
# MANOVA goes further by considering multiple dependent variables, giving insight into patterns across multiple outcome measures.

In [3]:
# Question 3

# Answer 3 -

# The partitioning of variance in ANOVA refers to the process of decomposing the total variability observed in a dataset into different 
# components that can be attributed to different sources or factors. This partitioning helps to understand how much of the variability in the data 
# is due to the factors being studied and how much is due to random variability or measurement error.

#In ANOVA, the total variability is broken down into two main components:

#1. Between-Group Variability (Treatment Variability):
#   This component represents the variability between the group means. It indicates how much the means of different groups differ from each other. 
#   The larger this component is relative to the total variability, the more likely it is that the group means are significantly different.

# 2. Within-Group Variability (Error Variability):
#   This component represents the variability within each group. It indicates how much individual data points within each group deviate from their 
#   respective group mean. This variability is often attributed to random chance or measurement error.

# The partitioning of variance is represented mathematically using the sum of squares (SS) terms. There are three primary sources of sums of squares 
# in a one-way ANOVA:

# 1. Total Sum of Squares (SSTotal):
#   This represents the total variability in the dataset, calculated as the sum of squared differences between each data point and the overall mean 
# of all data points.

# 2. Between-Group Sum of Squares (SSBetween):
#   This represents the variability between the group means, calculated as the sum of squared differences between each group mean and the overall 
#   mean.

# 3. Within-Group Sum of Squares (SSWithin):
#   This represents the variability within each group, calculated as the sum of squared differences between each data point and its group mean.

# The importance of understanding the partitioning of variance in ANOVA includes:

#1. Interpretation of Results:
#   By understanding how much of the variability is due to between-group differences and how much is due to within-group variability, 
#  researchers can better interpret the significance of the differences between groups. It provides insight into the factors that contribute to the 
#  observed variation.

# 2. Validity of Results:
#   A large between-group variability relative to within-group variability indicates that the groups might be significantly different. 
#  However, if the within-group variability is too large, it might mask genuine between-group differences.

# 3. Model Assessment:
#   Partitioning the variance allows researchers to assess how well the model (ANOVA) explains the observed variability. A good model should 
#  capture most of the variability between groups while minimizing the variability within groups.

# 4. Comparing Effects:
#   Understanding the proportion of variance explained by different factors helps researchers compare the relative importance of these factors 
#  in explaining the observed variability.

In [13]:
# Question 4 -

# Answer 4 -

import numpy as np

# Sample data for each group
group_1 = np.array([22, 18, 20, 25, 23])
group_2 = np.array([30, 28, 25, 33, 31])
group_3 = np.array([17, 15, 14, 19, 18])

# Overall data
all_data = np.concatenate([group_1, group_2, group_3])

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate group means
group_means = [np.mean(group) for group in [group_1, group_2, group_3]]

# Calculate Explained Sum of Squares (SSE)
sse = np.sum([len(group) * (mean - overall_mean)**2 for group, mean in zip([group_1, group_2, group_3], group_means)])

# Calculate Residual Sum of Squares (SSR)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 499.73333333333335
Explained Sum of Squares (SSE): 416.13333333333316
Residual Sum of Squares (SSR): 83.6000000000002


In [14]:
# Question 5

# Answer 5 -

import numpy as np
from scipy.stats import f

# Sample data
data = np.array([[10, 12, 15],
                 [18, 20, 23],
                 [25, 28, 30]])

# Calculate means for Factor A and Factor B
mean_factor_a = np.mean(data, axis=1)
mean_factor_b = np.mean(data, axis=0)

# Calculate Grand Mean (overall mean)
grand_mean = np.mean(data)

# Calculate Main Effect for Factor A
main_effect_a = mean_factor_a - grand_mean

# Calculate Main Effect for Factor B
main_effect_b = mean_factor_b - grand_mean

# Calculate Interaction Effect
interaction_effect = data - (mean_factor_a[:, np.newaxis] + mean_factor_b)

# Calculate the sums of squares for Main Effect A, Main Effect B, and Interaction Effect
sse_main_effect_a = np.sum(main_effect_a**2)
sse_main_effect_b = np.sum(main_effect_b**2)
sse_interaction_effect = np.sum(interaction_effect**2)

# Calculate the degrees of freedom for each effect
df_main_effect_a = data.shape[0] - 1
df_main_effect_b = data.shape[1] - 1
df_interaction_effect = (data.shape[0] - 1) * (data.shape[1] - 1)

# Calculate mean squares for each effect
ms_main_effect_a = sse_main_effect_a / df_main_effect_a
ms_main_effect_b = sse_main_effect_b / df_main_effect_b
ms_interaction_effect = sse_interaction_effect / df_interaction_effect

# Calculate F-ratio for each effect
f_ratio_main_effect_a = ms_main_effect_a / ms_interaction_effect
f_ratio_main_effect_b = ms_main_effect_b / ms_interaction_effect

# Calculate p-values for each effect
p_value_main_effect_a = 1 - f.cdf(f_ratio_main_effect_a, df_main_effect_a, df_interaction_effect)
p_value_main_effect_b = 1 - f.cdf(f_ratio_main_effect_b, df_main_effect_b, df_interaction_effect)

print("Main Effect A:", main_effect_a)
print("Main Effect B:", main_effect_b)
print("Interaction Effect:", interaction_effect)
print("F-ratio for Main Effect A:", f_ratio_main_effect_a)
print("F-ratio for Main Effect B:", f_ratio_main_effect_b)
print("P-value for Main Effect A:", p_value_main_effect_a)
print("P-value for Main Effect B:", p_value_main_effect_b)


Main Effect A: [-7.77777778  0.22222222  7.55555556]
Main Effect B: [-2.44444444 -0.11111111  2.55555556]
Interaction Effect: [[-20.         -20.33333333 -20.        ]
 [-20.         -20.33333333 -20.        ]
 [-20.33333333 -19.66666667 -20.33333333]]
F-ratio for Main Effect A: 0.06462180171931432
F-ratio for Main Effect B: 0.006877257235871611
P-value for Main Effect A: 0.9383804976399921
P-value for Main Effect B: 0.9931580533249532


In [1]:
# Question 6

# Answer 6 -

# The F-statistic is a ratio of the variability between group means to the variability within groups. 
# A larger F-statistic indicates that the differences between group means are relatively large compared to the within-group variability.
# F-statistic of 5.23 indicates that the differences between group means are relatively large compared to the within-group variability.

# The p-value is a measure of the evidence against the null hypothesis. It indicates the probability of observing such an 
# extreme F-statistic (or more extreme) if the null hypothesis were true. A smaller p-value suggests stronger evidence against the null hypothesis.
# Since the p-value of 0.02 is less than the common significance level of 0.05, we have evidence to reject the null hypothesis. 

# In the context of a one-way ANOVA:
# Null Hypothesis (H0): The group means are all equal (there are no significant differences between groups).
# Alternate Hypothesis (H1): At least one group mean is different from the others.

# Final conclusion: Basis the above F-statistic of 5.23 and p-value of 0.02; we have evidence to not accept the null hypothesis.

In [2]:
# Question 7

# Answer 7 -

# In a repeated measures ANOVA, missing data can be a common challenge. Handling missing data appropriately is crucial to ensure the validity 
# and reliability of the analysis. There are several methods to handle missing data, each with its potential consequences:

# 1. Listwise Deletion (Complete Case Analysis):
#   - This method involves excluding any participant with missing data from the analysis.
#   - Consequences: This can lead to reduced sample size, loss of statistical power, and potential bias if the missing data are not completely 
#    random (i.e., if they are related to the variables being studied).

# 2. Pairwise Deletion (Available Case Analysis):
#   - This method includes all available data points for each specific analysis.
#   - Consequences: Similar to listwise deletion, it can result in reduced sample size and loss of statistical power. Different analyses 
#    might have different effective sample sizes, leading to potential inconsistencies in results.

# 3. Mean Substitution (Imputation):
#   - Replace missing values with the mean value of the variable.
#   - Consequences: This method can artificially reduce variability and potentially bias results if missingness is not random. It can also 
#    underestimate standard errors, leading to incorrect significance tests.

# 4. Last Observation Carried Forward (LOCF):
#   - Use the last observed value for a participant with missing data.
#   - Consequences: This method might not accurately represent the actual trajectory of the data. If the pattern of missingness is related to the 
#    variable's change over time, it can introduce bias.

# 5. Linear Interpolation:
#   - Estimate missing values by linearly interpolating between adjacent observed values.
#   - Consequences: This method assumes a linear relationship, which might not be appropriate for all variables. It can also underestimate 
#    variability and introduce bias if the underlying relationship is not linear.

# 6. Multiple Imputation:
#   - Generate multiple plausible values for each missing data point, considering the variability in the data.
#   - Consequences: This method can provide more accurate estimates and standard errors. However, it requires assumptions about the data's 
#    distribution and relationships. Multiple imputation is computationally intensive.

# 7. Model-Based Imputation:
#   - Use regression or other modeling techniques to predict missing values based on other variables.
#   - Consequences: This method can yield accurate estimates if the model is well-specified. However, it relies on the validity of the model 
#    assumptions.

# Choosing the appropriate method for handling missing data depends on the nature of the data, the extent of missingness, and the underlying 
# reasons for missingness. It's important to carefully consider the potential consequences of each method and to document your chosen approach 
# transparently in your analysis to ensure the robustness of your results.

In [3]:
# Question 8

# Answer 8 -

# After conducting an ANOVA and finding a significant overall effect, post-hoc tests are used to determine which specific groups differ 
# significantly from each other. Common post-hoc tests include:

# 1. Tukey's Honestly Significant Difference (HSD):
#    - Use when: You have three or more groups and want to compare all possible pairs to identify significant differences.
#   - Example: In a study comparing the effectiveness of three different teaching methods on test scores, you find a significant overall effect. 
#    You would use Tukey's HSD to determine which pairs of teaching methods have significantly different scores.

# 2. Bonferroni Correction:
#   - Use when: You want to control the familywise error rate (overall Type I error rate) when conducting multiple pairwise comparisons.
#   - Example: You're comparing the effectiveness of four different marketing strategies on sales. Using Bonferroni correction can help you 
#   adjust the significance level for each individual comparison to maintain an overall desired level of significance.

# 3. Sidak Correction:
#    - Use when: Similar to Bonferroni, but it's less conservative and can be used when there are a larger number of comparisons.
#   - Example: You're comparing the performance of different software algorithms across multiple scenarios. The Sidak correction can help you 
#   adjust p-values for multiple comparisons.

# 4. Dunnett's Test:
#   - Use when: You have a control group and want to compare other groups to the control while controlling the overall Type I error rate.
#   - Example: You're testing the effects of different drug treatments compared to a placebo control. Dunnett's test allows you to focus 
#    on comparisons with the control while maintaining the overall significance level.

# 5. Holm-Bonferroni Method:
#   - Use when: Similar to Bonferroni, but it's a step-down procedure that can be more powerful.
#   - Example: You're analyzing the effects of different exercise regimes on fitness levels across multiple age groups. The Holm-Bonferroni 
#    method can provide adjusted p-values for each comparison.

# 6. Fisher's Least Significant Difference (LSD):
#   - Use when: You have a small number of comparisons and are not concerned about controlling the overall Type I error rate.
#   - Example: You're comparing the effectiveness of two treatments on recovery time. Fisher's LSD can help you determine if the treatments 
#    have significantly different effects.

# The choice of post-hoc test depends on your research question, the number of groups, the nature of your data, and whether you need to control
# for multiple comparisons. Post-hoc tests help you avoid making overly broad conclusions when you find a significant overall effect in your ANOVA 
# and allow you to pinpoint which specific group differences are significant.

In [4]:
# Question 9

# Answer 9 -

import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet
diet_a = np.array([2.5, 3.0, 1.8, 2.2, 2.9, 2.7, 2.1, 2.5, 3.2, 3.5,
                   2.8, 2.6, 2.3, 2.0, 2.4, 2.6, 3.1, 2.9, 2.7, 3.2,
                   2.2, 2.6, 2.8, 2.4, 2.1, 2.9, 2.3, 2.7, 2.5, 2.6,
                   2.8, 2.2, 2.6, 2.3, 2.0, 2.7, 3.0, 2.5, 2.8, 2.1,
                   2.4, 2.9, 2.6, 2.2, 2.7, 2.3, 2.0, 2.5, 2.8, 2.6])

diet_b = np.array([3.8, 3.5, 3.2, 3.7, 3.1, 3.4, 3.9, 3.6, 3.3, 3.1,
                   3.7, 3.5, 3.2, 3.6, 3.8, 3.4, 3.9, 3.2, 3.5, 3.7,
                   3.1, 3.6, 3.4, 3.2, 3.9, 3.3, 3.5, 3.7, 3.8, 3.4,
                   3.6, 3.9, 3.5, 3.2, 3.7, 3.4, 3.1, 3.8, 3.6, 3.2,
                   3.5, 3.9, 3.7, 3.4, 3.6, 3.2, 3.8, 3.5, 3.7, 3.9])

diet_c = np.array([1.2, 1.5, 1.0, 1.4, 1.3, 1.1, 1.6, 1.7, 1.4, 1.8,
                   1.2, 1.3, 1.5, 1.1, 1.6, 1.4, 1.3, 1.7, 1.2, 1.5,
                   1.6, 1.0, 1.3, 1.7, 1.4, 1.1, 1.6, 1.2, 1.8, 1.5,
                   1.4, 1.3, 1.7, 1.2, 1.1, 1.5, 1.4, 1.8, 1.6, 1.3,
                   1.2, 1.7, 1.4, 1.5, 1.1, 1.6, 1.3, 1.8, 1.4, 1.2])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")

F-Statistic: 691.1105757931851
P-Value: 1.7348315835560967e-75
There is a significant difference between the mean weight loss of the three diets.


In [63]:
# Question 10

# Answer 10 -

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 30
software = np.repeat(['Program A', 'Program B', 'Program C'], n)
experience = np.tile(['Novice', 'Experienced'], 45)
time = np.random.normal(loc=10, scale=2, size=n * 3)


# Create a DataFrame
data = pd.DataFrame({'Software': software, 'Experience': experience, 'Time': time})

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report the F-statistics and p-values
print(anova_table)

# Interpret the results
alpha = 0.05

if any(anova_table['PR(>F)'] < alpha):
    print("At least one factor or interaction effect is significant.")
    for i in anova_table['PR(>F)']:
        if i<alpha:
            print("P-Value of the Factor is: ",i)
    significant_interactions = anova_table[anova_table['PR(>F)'] < alpha]
    for index, row in significant_interactions.iterrows():
        factors = index.split(':')
    print(f"Interaction effect for {factors[0]} is significant.")

else:
    print("No significant effects or interactions were found.")


                               sum_sq    df         F    PR(>F)
C(Software)                 35.723737   2.0  4.579013  0.012956
C(Experience)                0.010866   1.0  0.002786  0.958034
C(Software):C(Experience)   17.753188   2.0  2.275576  0.109037
Residual                   327.668211  84.0       NaN       NaN
At least one factor or interaction effect is significant.
P-Value of the Factor is:  0.012956222473478022
Interaction effect for C(Software) is significant.


In [3]:
# Question 11

# Answer 11 -

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(0)
control_group = np.random.normal(loc=70, scale=10, size=50)  # Control group test scores
experimental_group = np.random.normal(loc=75, scale=10, size=50)  # Experimental group test scores

# Perform a two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

print("Two-sample t-test results:")
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Interpret the t-test results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
    # Perform post-hoc test (Tukey's HSD)
    all_scores = np.concatenate([control_group, experimental_group])
    group_labels = np.array(['Control'] * len(control_group) + ['Experimental'] * len(experimental_group))
    tukey_result = pairwise_tukeyhsd(all_scores, group_labels, alpha=alpha)
    print(tukey_result)
else:
    print("There is no significant difference in test scores between the two groups.")


Two-sample t-test results:
T-Statistic: -1.6677351961320235
P-Value: 0.09856078338184605
There is no significant difference in test scores between the two groups.


In [4]:
# Question 12

# Answer 12 -

import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(0)
n_days = 30
store_a_sales = np.random.normal(loc=1000, scale=200, size=n_days)  # Store A daily sales
store_b_sales = np.random.normal(loc=1100, scale=180, size=n_days)  # Store B daily sales
store_c_sales = np.random.normal(loc=1050, scale=190, size=n_days)  # Store C daily sales

# Combine the sales data
all_sales = np.concatenate([store_a_sales, store_b_sales, store_c_sales])
store_labels = np.array(['Store A'] * n_days + ['Store B'] * n_days + ['Store C'] * n_days)

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(store_a_sales, store_b_sales, store_c_sales)

print("One-way ANOVA results:")
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret the ANOVA results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in average daily sales between the three stores.")
    # Perform post-hoc test (Tukey's HSD)
    tukey_result = pairwise_tukeyhsd(all_sales, store_labels, alpha=alpha)
    print(tukey_result)
else:
    print("There is no significant difference in average daily sales between the three stores.")

One-way ANOVA results:
F-Statistic: 0.8647116816086053
P-Value: 0.4247606893565754
There is no significant difference in average daily sales between the three stores.
