In [1]:
#Ans 01:

In [2]:
# Analysis of Variance (ANOVA) is a statistical method used to compare means among multiple groups. However, for
# ANOVA results to be valid, several assumptions must be met. Violations of these assumptions can lead to inaccurate
# conclusions. The key assumptions of ANOVA include:

# Normality: The dependent variable should be approximately normally distributed within each group. This assumption is more
# critical when sample sizes are small. Violations may lead to inflated Type I error rates.
# Example of Violation: If the data in one or more groups significantly deviate from normality, it may affect the ANOVA results.
# For instance, if a group has a highly skewed or heavy-tailed distribution, it could impact the overall analysis.

# Homogeneity of Variances (Homoscedasticity): The variances of the dependent variable should be roughly equal across all groups.
# This assumption is important because ANOVA is sensitive to unequal variances, and violating this assumption may lead to
# inaccurate p-values and confidence intervals.
# Example of Violation: If the variances in one group are much larger or smaller than those in other groups, it can affect the
# overall ANOVA results. This is often observed in situations where there are outliers or when the groups have different levels of
# variability.

# Independence of Observations: Observations within and between groups should be independent. Independence is crucial to ensure
# that the variability observed in the dependent variable is due to differences between groups and not influenced by dependencies
# among observations.
# Example of Violation: If there is dependence among observations (e.g., repeated measures or nested designs), it can lead to
# pseudoreplication and impact the validity of ANOVA results.

# Random Sampling: Data should ideally be collected through random sampling. This assumption is necessary for making inferences
# about a population based on the sample data.
# Example of Violation: If the sampling process is not random, it may introduce bias into the sample, making it less representative
# of the population. This can affect the generalizability of the ANOVA results.

# Additivity: The effects of different factors should be additive, meaning that the total effect of two or more factors on the
# dependent variable is the sum of their individual effects.
# Example of Violation: If there are interactions between factors (i.e., the effects of factors are not purely additive), it can
# complicate the interpretation of main effects and may require more sophisticated statistical techniques.

# When these assumptions are violated, alternative methods or transformations may be considered, or non-parametric alternatives
# like the Kruskal-Wallis test could be used. Additionally, diagnostic tests and graphical methods can help assess the validity of
# assumptions before relying on ANOVA results.

In [3]:
#########################################################################
#Ans 02:

In [4]:
# Analysis of Variance (ANOVA) can be classified into three main types based on the experimental design and the
# number of factors involved. These three types are:


# One-Way ANOVA (One-Factor ANOVA):

# Situation: Used when there is only one independent variable (factor) with more than two levels or groups.
# Example: Suppose you want to compare the mean scores of three different teaching methods (A, B, C) to determine if there
# is a significant difference in student performance.


# Two-Way ANOVA:

# Situation: Used when there are two independent variables (factors) that are crossed (meaning each level of one factor is
# combined with each level of the other).
# Example: You might use a two-way ANOVA to study the effects of both the type of diet (Factor 1: Low-fat, High-fat) and the
# type of exercise (Factor 2: Cardio, Weight-lifting) on weight loss.

# Repeated Measures ANOVA:

# Situation: Used when measurements are taken on the same subjects or at multiple time points, resulting in repeated observations.
# Example: Suppose you measure the blood pressure of the same group of individuals before and after three different treatments.
# Repeated Measures ANOVA would be appropriate to assess whether there are significant differences among the treatments while
# accounting for the repeated nature of the measurements within individuals.


# Each type of ANOVA is suited to different experimental designs and research questions. The choice between them depends on the
# structure of the data and the nature of the independent variables. It's important to correctly select the appropriate ANOVA
# type to ensure the validity of the statistical analysis and the interpretation of results. Additionally, post-hoc tests or
# pairwise comparisons may be applied after ANOVA to identify specific group differences
# if the overall test indicates significance.

In [5]:
#########################################################################
#Ans 03:

In [6]:
# The partitioning of variance in Analysis of Variance (ANOVA) refers to the division of the total variance observed
# in the dependent variable into different components that can be attributed to specific sources or factors. Understanding
# this concept is crucial because it allows researchers to quantify the relative contributions of various factors to the
# overall variability in the data. The total variance is decomposed into several components, which are:

# Between-Group Variance (SSB): This component represents the variability in the dependent variable that can be attributed to
# differences between the group means. It is a measure of the overall group effect.

# Within-Group Variance (SSW): This component represents the variability within each group or condition. It reflects the random
# variability or error in the data that cannot be explained by the group means.

# Total Variance (SST): This is the overall variability in the dependent variable, and it is the sum of the between-group variance
# and the within-group variance. Mathematically, SST = SSB + SSW.


# The partitioning of variance is typically presented in the form of an ANOVA table, which summarizes the sources of variability,
# degrees of freedom, sum of squares, mean squares, and F-ratios.


# Understanding the partitioning of variance is important for several reasons:

# Assessing Group Differences: By examining the between-group variance, researchers can determine whether there are significant
# differences among the group means. This is the primary goal of ANOVA.

# Quantifying the Impact of Factors: The partitioning allows researchers to quantify the proportion of total variance that can
# be explained by the independent variable(s). This helps in understanding the relative importance of different factors in
# influencing the dependent variable.

# Interpretation of F-ratio: The F-ratio, calculated as the ratio of between-group variance to within-group variance, provides
# a measure of how much the group means differ relative to the random variability within groups.

# Basis for Post-hoc Tests: If the overall ANOVA is significant, post-hoc tests or pairwise comparisons are often conducted to
# identify which specific groups differ from each other. The partitioning of variance helps guide these additional analyses.

# In summary, the partitioning of variance in ANOVA is essential for a comprehensive understanding of the sources of variability
# in the data, the effects of independent variables, and the overall significance of the observed group differences.

In [7]:
#########################################################################
#Ans 04:

In [8]:
# To calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) in
# a one-way ANOVA using Python, you can use libraries such as NumPy or Scipy. Here's an example using NumPy:

In [9]:
import numpy as np

# Sample data for each group
group1 = np.array([4, 6, 8, 10, 12])
group2 = np.array([3, 5, 7, 9, 11])
group3 = np.array([2, 4, 6, 8, 10])

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate overall mean
overall_mean = np.mean(data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((data - overall_mean)**2)

# Calculate group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Calculate Explained Sum of Squares (SSE)
sse = len(group1) * (group1_mean - overall_mean)**2 + \
      len(group2) * (group2_mean - overall_mean)**2 + \
      len(group3) * (group3_mean - overall_mean)**2

# Calculate Residual Sum of Squares (SSR)
ssr = np.sum((group1 - group1_mean)**2) + \
      np.sum((group2 - group2_mean)**2) + \
      np.sum((group3 - group3_mean)**2)

# Print the results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)

Total Sum of Squares (SST): 130.0
Explained Sum of Squares (SSE): 10.0
Residual Sum of Squares (SSR): 120.0


In [10]:
# In this example, group1, group2, and group3 represent the data for three groups. The SST is calculated by summing
# the squared differences between each data point and the overall mean. The SSE is calculated by summing the squared
# differences between each group mean and the overall mean, weighted by the number of observations in each group.
# The SSR is then obtained by subtracting SSE from SST.

# Note: This is a simplified example, and in practice, you might use specialized libraries like Scipy's stats.f_oneway
# function to perform a one-way ANOVA, which automatically calculates these values along with the F-statistic and p-value.

In [11]:
#########################################################################
#Ans 05:

In [12]:
import numpy as np
import pandas as pd
from scipy.stats import f

# Sample data
data = pd.DataFrame({
    'Factor1': np.repeat(['A', 'B', 'C'], 4),
    'Factor2': np.tile(['X', 'Y'], 6),
    'Value': np.random.randn(12)
})

# Calculate means
overall_mean = data['Value'].mean()
mean_factor1 = data.groupby('Factor1')['Value'].mean()
mean_factor2 = data.groupby('Factor2')['Value'].mean()
mean_interaction = data.groupby(['Factor1', 'Factor2'])['Value'].mean()

# Degrees of freedom
df_total = len(data) - 1
df_factor1 = len(mean_factor1) - 1
df_factor2 = len(mean_factor2) - 1
df_interaction = (len(mean_factor1) - 1) * (len(mean_factor2) - 1)
df_residual = df_total - (df_factor1 + df_factor2 + df_interaction)

# Sum of squares
sst = np.sum((data['Value'] - overall_mean)**2)
ss_factor1 = np.sum((mean_factor1 - overall_mean)**2 * data.groupby('Factor1').size())
ss_factor2 = np.sum((mean_factor2 - overall_mean)**2 * data.groupby('Factor2').size())
ss_interaction = np.sum((mean_interaction - overall_mean)**2 * data.groupby(['Factor1', 'Factor2']).size())
ss_residual = sst - (ss_factor1 + ss_factor2 + ss_interaction)

# Mean squares
ms_factor1 = ss_factor1 / df_factor1
ms_factor2 = ss_factor2 / df_factor2
ms_interaction = ss_interaction / df_interaction
ms_residual = ss_residual / df_residual

# F-statistics
f_factor1 = ms_factor1 / ms_residual
f_factor2 = ms_factor2 / ms_residual
f_interaction = ms_interaction / ms_residual

# p-values
p_factor1 = 1 - f.cdf(f_factor1, df_factor1, df_residual)
p_factor2 = 1 - f.cdf(f_factor2, df_factor2, df_residual)
p_interaction = 1 - f.cdf(f_interaction, df_interaction, df_residual)

# Print results
print("Main Effect of Factor 1: F =", f_factor1, ", p =", p_factor1)
print("Main Effect of Factor 2: F =", f_factor2, ", p =", p_factor2)
print("Interaction Effect: F =", f_interaction, ", p =", p_interaction)

Main Effect of Factor 1: F = 3.7909706477569953 , p = 0.08621209605597369
Main Effect of Factor 2: F = 0.2984276677012813 , p = 0.6045729043804502
Interaction Effect: F = 5.200679840963948 , p = 0.048956931869770126


In [13]:
#########################################################################
#Ans 06:

In [14]:
# In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of the groups are equal.
# A small p-value suggests that you can reject the null hypothesis. In your case, with an F-statistic of 5.23 and
# a p-value of 0.02, the p-value is less than the commonly used significance level of 0.05.

# Here's how you can interpret these results:

# Null Hypothesis (H0): The means of the groups are equal.
# Alternative Hypothesis (H1): At least one group mean is different from the others.

# Since the p-value (0.02) is less than the significance level (commonly chosen as 0.05), you would reject the null hypothesis.
# This suggests that there is enough evidence to conclude that there are significant differences between the groups.

# In practical terms, you can say that there are statistically significant variations in the means of the groups you are
# comparing. However, the ANOVA itself does not tell you which specific groups are different from each other. If you reject the
# null hypothesis, it is common practice to conduct post-hoc tests or pairwise comparisons to identify which groups differ.

# In summary, based on the results of your one-way ANOVA:

# Conclusion: There are statistically significant differences between the groups.
# Next Steps: Conduct post-hoc tests to determine which specific groups are different from each other.

# Keep in mind that statistical significance doesn't necessarily imply practical significance, and it's crucial to consider the
# context of the study and the magnitude of the differences in addition to the statistical results.

In [15]:
#########################################################################
#Ans 07:

In [16]:
# Dealing with missing data in repeated measures ANOVA is important for obtaining valid and reliable results. There
# are various methods to handle missing data, each with its own assumptions and potential consequences. Here are some
# common approaches:

    
# Complete Case Analysis (Listwise Deletion):
# Method: Exclude participants with missing data on any variable involved in the analysis.
# Consequences: This method can lead to a loss of statistical power and potentially biased results, especially if the missing
# data is not completely at random (MCAR). It may also introduce selection bias if the missingness is related to the variables
# under study.

# Pairwise Deletion:
# Method: Analyze all available data for each pair of variables, excluding only cases with missing data on the specific
# variables being analyzed.
# Consequences: While it utilizes more data than listwise deletion, it can lead to biased estimates if the missing data is
# related to the variables being analyzed. The results may be inconsistent across different pairwise comparisons.

# Imputation Techniques:
# Method: Impute missing values based on observed values and/or other variables.
# Consequences: Imputation methods, such as mean imputation, regression imputation, or multiple imputation, can introduce bias if
# the assumptions underlying the imputation model are violated. However, they can provide more reliable estimates than complete
# case analysis when the missing data is related to other observed variables.

# Last Observation Carried Forward (LOCF):
# Method: Impute missing values using the last observed value for each participant.
# Consequences: LOCF assumes that the missing values remain constant over time, which may not be the case. This method may lead
# to biased results, especially if there is a trend or pattern in the missing data.

# Interpolation or Extrapolation:
# Method: Estimate missing values based on patterns or trends observed in the available data.
# Consequences: This method assumes that the missing values follow a specific pattern, which may not be accurate. Extrapolation,
# in particular, can be risky if it extends beyond the observed range of data.

# Maximum Likelihood Estimation (MLE):
# Method: Incorporates the likelihood of the observed data given the missing data into the analysis.
# Consequences: MLE provides unbiased estimates under the assumption that the missing data is missing at random (MAR). However,
# the MAR assumption is critical for the validity of this method.


# When choosing a method, researchers should carefully consider the nature of the missing data and the assumptions of the imputation
# technique. Multiple imputation is often recommended when dealing with missing data as it accounts for uncertainty related to
# imputation. However, the choice of method depends on the specific characteristics of the dataset and the research context.
# It's also essential to report any methods used to handle missing data and conduct sensitivity analyses to assess the robustness
# of the results to different missing data handling strategies.

In [17]:
#########################################################################
#Ans 08:

In [18]:
# Post-hoc tests are used after ANOVA to make detailed comparisons between group means when the overall ANOVA
# result indicates that there are significant differences among groups. Some common post-hoc tests include:

# Tukey's Honestly Significant Difference (HSD):
# When to Use: Use Tukey's HSD when you have conducted a one-way ANOVA and want to compare all possible pairs of group means.
# It controls the familywise error rate.
# Example: You conducted a one-way ANOVA to compare the performance of three different teaching methods, and the ANOVA result
# is significant. Tukey's HSD can be used to identify which specific teaching methods differ significantly in terms of student
# performance.

# Bonferroni Correction:
# When to Use: Bonferroni is a conservative correction method. Use it when making multiple pairwise comparisons in a one-way
# ANOVA or in situations where you want to control the overall type I error rate.
# Example: You have conducted a one-way ANOVA to compare the mean scores of four different groups. The overall ANOVA result is
# significant, and you want to perform pairwise comparisons while controlling the familywise error rate. Bonferroni correction
# can be applied in this case.

# Sidak Correction:
# When to Use: Similar to Bonferroni, Sidak correction is used for multiple comparisons after ANOVA. It is less conservative than
# Bonferroni but still controls the familywise error rate.
# Example: After conducting a two-way ANOVA to analyze the effects of two factors on a dependent variable, you want to perform
# multiple pairwise comparisons. Sidak correction can be applied to control the overall type I error rate.

# Dunnett's Test:
# When to Use: Use Dunnett's test when you have a control group, and you want to compare other groups against the control group.
# Example: In a medical study, you have a control group receiving a placebo and several treatment groups. After conducting an ANOVA,
# you can use Dunnett's test to compare each treatment group with the control group.

# Holm's Method:
# When to Use: Holm's method is a step-down procedure that controls the familywise error rate. It is less conservative than
# Bonferroni but still provides strong control.
# Example: In a factorial ANOVA with multiple factors, you may want to perform post-hoc tests to compare specific groups while
# controlling the overall type I error rate. Holm's method can be applied for this purpose.

# Games-Howell Test:
# When to Use: Games-Howell is used when the assumption of equal variances is violated. It is a more robust alternative to Tukey's
# HSD in such cases.
# Example: In a one-way ANOVA where the assumption of homogeneity of variances is not met, you can use Games-Howell to perform
# post-hoc tests and compare group means.


# Example Situation:
# Suppose you conducted a one-way ANOVA to compare the effectiveness of four different training programs on employee productivity.
# The ANOVA indicates that there are significant differences among the groups. In this scenario, you might employ Tukey's HSD,
# Bonferroni, or Sidak correction for post-hoc tests to identify which specific training programs result in significantly different
# levels of productivity. This allows you to make detailed pairwise comparisons and draw more specific conclusions about the
# effectiveness of the training programs.

In [20]:
#########################################################################
#Ans 09:

In [21]:
import numpy as np
from scipy.stats import f_oneway

# Generate sample data
np.random.seed(42)  # for reproducibility
diet_A = np.random.normal(5, 1, 50)  # mean weight loss of 5 with standard deviation 1
diet_B = np.random.normal(6, 1, 50)  # mean weight loss of 6 with standard deviation 1
diet_C = np.random.normal(7, 1, 50)  # mean weight loss of 7 with standard deviation 1

# Combine data into a single array
data = np.concatenate([diet_A, diet_B, diet_C])

# Create group labels
labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret results
if p_value < 0.05:
    print("The one-way ANOVA result is statistically significant.")
    print("There is evidence to suggest that there are significant differences between the mean weight loss of the three diets.")
else:
    print("The one-way ANOVA result is not statistically significant.")
    print("There is no strong evidence to suggest differences between the mean weight loss of the three diets.")

F-statistic: 67.61854911979148
p-value: 1.5055246613126342e-21
The one-way ANOVA result is statistically significant.
There is evidence to suggest that there are significant differences between the mean weight loss of the three diets.


In [22]:
#########################################################################
#Ans 10:

In [23]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Generate sample data
np.random.seed(42)  # for reproducibility

# Create a DataFrame with columns 'Time', 'Program', and 'Experience'
data = pd.DataFrame({
    'Time': np.random.normal(loc=10, scale=2, size=90),
    'Program': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45)
})

# Perform two-way ANOVA
formula = 'Time ~ C(Program) + C(Experience) + C(Program):C(Experience)'
model = ols(formula, data).fit()
anova_results = anova_lm(model)

# Print results
print(anova_results)

# Interpret results
p_program = anova_results['PR(>F)']['C(Program)']
p_experience = anova_results['PR(>F)']['C(Experience)']
p_interaction = anova_results['PR(>F)']['C(Program):C(Experience)']

if p_program < 0.05:
    print("There is a significant main effect of software programs on task completion time.")
else:
    print("There is no significant main effect of software programs.")

if p_experience < 0.05:
    print("There is a significant main effect of experience level on task completion time.")
else:
    print("There is no significant main effect of experience level.")

if p_interaction < 0.05:
    print("There is a significant interaction effect between software programs and experience level.")
else:
    print("There is no significant interaction effect between software programs and experience level.")

                            df      sum_sq   mean_sq         F    PR(>F)
C(Program)                 2.0    2.514772  1.257386  0.344485  0.709581
C(Experience)              1.0    0.479063  0.479063  0.131248  0.718051
C(Program):C(Experience)   2.0    1.592393  0.796197  0.218133  0.804472
Residual                  84.0  306.603758  3.650045       NaN       NaN
There is no significant main effect of software programs.
There is no significant main effect of experience level.
There is no significant interaction effect between software programs and experience level.


In [24]:
#########################################################################
#Ans 11:

In [25]:
import numpy as np
from scipy.stats import ttest_ind
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)
experimental_group = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Print results
print("Two-sample t-test results:")
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Interpret results
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
    print("Proceed with post-hoc tests.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")

# Post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    # Combine data for post-hoc test
    all_data = np.concatenate([control_group, experimental_group])
    group_labels = np.concatenate([['Control'] * 50, ['Experimental'] * 50])

    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(all_data, group_labels)

    # Print post-hoc results
    print("\nPost-hoc (Tukey's HSD) results:")
    print(tukey_results)

Two-sample t-test results:
T-statistic: -4.108723928204809
P-value: 8.261945608702611e-05
There is a significant difference in test scores between the control and experimental groups.
Proceed with post-hoc tests.

Post-hoc (Tukey's HSD) results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


In [26]:
# In this example, I generated random test scores for the control and experimental groups using normal
# distributions. Replace this data with your actual test score data.

# The two-sample t-test is performed using the ttest_ind function from scipy.stats. If the results are significant
# (p-value < 0.05), the code proceeds with a post-hoc test, in this case, Tukey's HSD, using the pairwise_tukeyhsd
# function from statsmodels.

# Remember to interpret the results cautiously, considering both statistical significance and practical significance.
# The post-hoc test helps identify which specific groups differ significantly if there is a significant overall
# difference.

In [27]:
#########################################################################
#Ans 12:

In [29]:
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(42)  # for reproducibility
sales_store_A = np.random.normal(loc=100, scale=20, size=30)
sales_store_B = np.random.normal(loc=120, scale=20, size=30)
sales_store_C = np.random.normal(loc=110, scale=20, size=30)

# Combine data for one-way ANOVA
all_sales_data = np.concatenate([sales_store_A, sales_store_B, sales_store_C])
group_labels = np.concatenate([['Store A'] * 30, ['Store B'] * 30, ['Store C'] * 30])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(sales_store_A, sales_store_B, sales_store_C)

# Print results
print("One-way ANOVA results:")
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpret results
if p_value < 0.05:
    print("There is a significant difference in sales between the three stores.")
    print("Proceed with post-hoc tests.")
else:
    print("There is no significant difference in sales between the three stores.")

# Post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(all_sales_data, group_labels)

    # Print post-hoc results
    print("\nPost-hoc (Tukey's HSD) results:")
    print(tukey_results)

One-way ANOVA results:
F-statistic: 9.942655439164778
P-value: 0.00012916754728826364
There is a significant difference in sales between the three stores.
Proceed with post-hoc tests.

Post-hoc (Tukey's HSD) results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
Store A Store B  21.3397 0.0001   9.7429 32.9365   True
Store A Store C  14.0206 0.0136   2.4238 25.6175   True
Store B Store C  -7.3191 0.2936 -18.9159  4.2778  False
-------------------------------------------------------


In [30]:
#########################################################################