# stats 6

In [5]:
import logging

logging.basicConfig(filename="13MarInfo.log", level=logging.INFO, format="%(asctime)s %(name)s %(message)s")

# answer 1
"""
ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups. The 
assumptions required to use ANOVA are as follows:

- Absence of outliers: Outlying score should be removed from data set.

- Independence: The observations within each group must be independent of each other.

- Normality: The distribution of the data within each group must be approximately normal.

- Homogeneity of variance: The variance of the data within each group must be approximately equal.

Examples of violations that could impact the validity of ANOVA results are:

- Non-independence: If the observations within each group are not independent, such as in repeated measures
designs, ANOVA results may be invalid.

- Non-normality: If the distribution of the data within each group is not approximately normal, ANOVA results
may be unreliable. For example, if the data is heavily skewed or has outliers, the assumptions of normality 
may be violated.

- Heterogeneity of variance: If the variance of the data within each group is not approximately equal, ANOVA 
results may be biased. For example, if the data from one group has much larger variance than the others, 
ANOVA may incorrectly conclude that there are significant differences between the groups.
"""

# answer 2
"""
ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more groups or treatments. There are several types of ANOVA, including one-way ANOVA, repeated measures ANOVA, and factorial ANOVA.

- **One-way ANOVA**:

One-way ANOVA is used to compare the means of three or more independent groups that are not related to each other. It tests whether the means of the groups are significantly different from each other. 
**For example, if we wanted to compare the mean test scores of students who studied for 0 hours, 1 hour, 2 hours, or 3 hours, a one-way ANOVA could be used to determine if there is a statistically significant difference between the mean test scores of these groups.**

- **Repeated measures ANOVA**:

Repeated measures ANOVA is used when we want to compare the means of a single group or subject under different conditions. This is often used in studies where the same group of participants is tested multiple times, such as in a longitudinal study. The repeated measures ANOVA takes into account the correlation between the repeated measures and provides a more powerful test compared to a series of independent t-tests.

- **Factorial ANOVA**:

Factorial ANOVA is used when we want to test the effects of two or more independent variables on a dependent variable. It tests whether there is a significant main effect of each independent variable and whether there is an interaction effect between the independent variables.
**For example, if we were interested in testing the effect of both gender and age on salary, a two-way factorial ANOVA could be used to determine if there is a significant main effect of gender and age, as well as an interaction effect between the two variables.**
"""

In [6]:
# answer 3
"""
The partitioning of variance in ANOVA refers to the division of the total variance of a dependent variable
into different components, each of which is associated with a specific source of variation. 

The total variance in the dependent variable can be partitioned into two components: the variance between 
groups (due to differences between groups) and the variance within groups (due to individual differences 
within each group). The ratio of the between-group variance to the within-group variance is used to determine
the statistical significance of the differences between groups.

Understanding the partitioning of variance in ANOVA is important because it allows researchers to identify 
the sources of variation that contribute to the differences between groups. This information can be used to
gain insights into the factors that are associated with the dependent variable and to design more effective
interventions or treatments. Additionally, partitioning of variance allows researchers to test the statistical
significance of the differences between groups, which is essential for drawing valid conclusions about the
relationships between variables.
"""

'\nThe partitioning of variance in ANOVA refers to the division of the total variance of a dependent variable\ninto different components, each of which is associated with a specific source of variation. \n\nThe total variance in the dependent variable can be partitioned into two components: the variance between \ngroups (due to differences between groups) and the variance within groups (due to individual differences \nwithin each group). The ratio of the between-group variance to the within-group variance is used to determine\nthe statistical significance of the differences between groups.\n\nUnderstanding the partitioning of variance in ANOVA is important because it allows researchers to identify \nthe sources of variation that contribute to the differences between groups. This information can be used to\ngain insights into the factors that are associated with the dependent variable and to design more effective\ninterventions or treatments. Additionally, partitioning of variance allow

In [7]:
# answer 4
import numpy as np
import pandas as pd
from scipy import stats

#generate random data
np.random.seed(123)
data = pd.DataFrame({'group': np.random.choice(['A', 'B', 'C'], size=90),
                     'value': np.random.normal(loc=10, scale=2, size=90)})

group_means = data.groupby('group')['value'].mean()
grand_mean = data['value'].mean()
SST = ((data['value'] - grand_mean) ** 2).sum()
SSE = ((group_means - grand_mean) ** 2 * data['group'].value_counts()).sum()
SSR = SST - SSE

print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)

SST: 442.20027347393636
SSE: 29.999927882554044
SSR: 412.2003455913823


In [8]:
# answwer 5
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(123)
df = pd.DataFrame({'A': np.repeat(['a', 'b'], 25),
                   'B': np.repeat(['x', 'y'], 25),
                   'score': np.random.normal(0, 1, 50)})

model = ols('score ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

main_effects = anova_table[['sum_sq', 'df', 'F']].iloc[:2]
interaction_effect = anova_table[['sum_sq', 'df', 'F']].iloc[2]

print('Main Effects:')
print(main_effects)
print('\nInteraction Effect:')
print(interaction_effect)

Main Effects:
     sum_sq   df         F
A  0.971609  1.0  0.666667
B  0.971609  1.0  0.666667

Interaction Effect:
sum_sq    0.804310
df        1.000000
F         0.551875
Name: A:B, dtype: float64


In [9]:
# answer 6
"""
If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, 
we can conclude that there is a significant difference between the means of the groups.

The F-statistic measures the ratio of the variance between groups to the variance within groups.
A large F-statistic indicates that the variance between groups is much greater than the variance
within groups, suggesting that there is a significant difference between the means of the groups.

The p-value of 0.02 indicates that there is strong evidence against the null hypothesis of equal 
means, assuming a significance level of 0.05. Therefore, we can reject the null hypothesis and 
conclude that there is a significant difference between the means of the groups.

To interpret these results, we can compare the means of the groups using post-hoc tests, such as 
Tukey's HSD (honestly significant difference) test, to determine which specific groups differ
significantly from each other. Additionally, we can report effect size measures, such as eta-squared
or Cohen's d, to quantify the magnitude of the observed differences between the groups.
"""

"\nIf we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, \nwe can conclude that there is a significant difference between the means of the groups.\n\nThe F-statistic measures the ratio of the variance between groups to the variance within groups.\nA large F-statistic indicates that the variance between groups is much greater than the variance\nwithin groups, suggesting that there is a significant difference between the means of the groups.\n\nThe p-value of 0.02 indicates that there is strong evidence against the null hypothesis of equal \nmeans, assuming a significance level of 0.05. Therefore, we can reject the null hypothesis and \nconclude that there is a significant difference between the means of the groups.\n\nTo interpret these results, we can compare the means of the groups using post-hoc tests, such as \nTukey's HSD (honestly significant difference) test, to determine which specific groups differ\nsignificantly from each other. Additionall

In [10]:
# answer 7
"""
Handling missing data in a repeated measures ANOVA can be challenging, as missing data can introduce
bias and reduce the power of the analysis. There are several methods to handle missing data in a repeated
measures ANOVA, and the choice of method can have different consequences depending on the type and amount
of missing data.

Here are some common methods to handle missing data in a repeated measures ANOVA:

Pairwise deletion: This method involves excluding cases that have missing data on any of the variables. 
While this method is simple to implement, it can reduce the sample size and introduce bias if the missing
data are not missing completely at random.

Listwise deletion: This method involves excluding cases that have missing data on any of the variables
used in the analysis. While this method avoids bias due to missing data, it can reduce the sample size
and introduce bias if the missing data are not missing completely at random.

Imputation: This method involves estimating the missing values based on the observed values and using 
these estimates in the analysis. There are several types of imputation methods, such as mean imputation, 
regression imputation, and multiple imputation. While imputation can preserve the sample size and avoid
bias due to missing data, it can introduce bias if the imputation model is misspecified or if the 
missing data are not missing at random.

Maximum likelihood estimation: This method involves estimating the model parameters using all 
available data, including the cases with missing data. While this method can be computationally intensive,
it can provide unbiased estimates of the model parameters under certain assumptions about the missing data
mechanism.
"""

'\nHandling missing data in a repeated measures ANOVA can be challenging, as missing data can introduce\nbias and reduce the power of the analysis. There are several methods to handle missing data in a repeated\nmeasures ANOVA, and the choice of method can have different consequences depending on the type and amount\nof missing data.\n\nHere are some common methods to handle missing data in a repeated measures ANOVA:\n\nPairwise deletion: This method involves excluding cases that have missing data on any of the variables. \nWhile this method is simple to implement, it can reduce the sample size and introduce bias if the missing\ndata are not missing completely at random.\n\nListwise deletion: This method involves excluding cases that have missing data on any of the variables\nused in the analysis. While this method avoids bias due to missing data, it can reduce the sample size\nand introduce bias if the missing data are not missing completely at random.\n\nImputation: This method invol

In [11]:
# answer 8
"""
Post-hoc tests are used after an ANOVA to determine which specific groups differ significantly
from each other. Here are some common post-hoc tests used after ANOVA, along with when and why
we would use them:

Tukey's HSD (Honestly Significant Difference) test: This test is used when we have conducted an
ANOVA with more than two groups and want to compare all possible pairwise differences between the groups.
It controls the family-wise error rate, making it a conservative test.

Bonferroni test: This test is used when we have conducted multiple pairwise comparisons and want to 
control the overall Type I error rate. It is more conservative than Tukey's test, but it is less powerful.

Scheffe's test: This test is used when we have conducted an ANOVA with more than two groups and want to 
compare all possible pairwise differences between the groups while controlling the family-wise error rate.
It is more powerful than Tukey's test but is more conservative than the Bonferroni test.

Dunnett's test: This test is used when we have one control group and want to compare it with each of 
the other groups. It controls the overall Type I error rate while allowing for multiple comparisons 
with a single control group.

Games-Howell test: This test is used when the assumption of equal variances is violated, and we want 
to compare all possible pairwise differences between the groups. It is more powerful than Tukey's test
when the assumption of equal variances is violated.

Kruskal-Wallis test: This test is a non-parametric alternative to ANOVA and is used when the assumption
of normality is violated. It tests whether the medians of the groups differ significantly and can be 
followed by post-hoc tests such as Dunn's test.

An example situation where a post-hoc test might be necessary is when we conduct an ANOVA to compare
the mean scores of three different teaching methods in a classroom. If the ANOVA shows a significant 
difference between the means, we might want to use Tukey's HSD test to determine which specific teaching
methods are significantly different from each other. This would help us to identify which teaching method 
is most effective and can inform future teaching strategies.
"""

"\nPost-hoc tests are used after an ANOVA to determine which specific groups differ significantly\nfrom each other. Here are some common post-hoc tests used after ANOVA, along with when and why\nwe would use them:\n\nTukey's HSD (Honestly Significant Difference) test: This test is used when we have conducted an\nANOVA with more than two groups and want to compare all possible pairwise differences between the groups.\nIt controls the family-wise error rate, making it a conservative test.\n\nBonferroni test: This test is used when we have conducted multiple pairwise comparisons and want to \ncontrol the overall Type I error rate. It is more conservative than Tukey's test, but it is less powerful.\n\nScheffe's test: This test is used when we have conducted an ANOVA with more than two groups and want to \ncompare all possible pairwise differences between the groups while controlling the family-wise error rate.\nIt is more powerful than Tukey's test but is more conservative than the Bonferr

In [12]:
# answer 9
import numpy as np
from scipy.stats import f_oneway

h0="The means are equal"
ha="Atleast one of the sample mean is not equal"

alpha=0.05

# Generate random weight loss data for each diet
"""np.random.seed(1234) sets the seed for the random number generator, ensuring that the same random values are generated each time 
the code is run with the same seed.
np.random.normal(mean, standard deviation, size) generates random numbers from a normal distribution with the specified mean, standard deviation,
and size. 
"""
np.random.seed(1234)
diet_a = np.random.normal(10, 2, size=50)
diet_b = np.random.normal(8, 3, size=50)
diet_c = np.random.normal(12, 4, size=50)
# print(diet_a,diet_b,diet_c)

# Conduct one-way ANOVA
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-statistic: {:.2f}".format(f_statistic))
print("p-value: {:.4f}".format(p_value))

if p_value < alpha:
    print("Reject the NUll hypothesis \n Atleast one of the sample mean is not equal")
else:
    print("Fail to reject the Null Hypotheses \n The means are equal")

F-statistic: 25.44
p-value: 0.0000
Reject the NUll hypothesis 
 Atleast one of the sample mean is not equal


In [13]:
# answer 10
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate some sample data
store_a_sales = [35, 37, 29, 42, 44, 30, 38, 39, 41, 36, 28, 32, 43, 37, 33, 36, 30, 45, 39, 42, 34, 38, 43, 31, 40, 44, 37, 40, 33, 35]
store_b_sales = [29, 31, 28, 25, 36, 27, 33, 24, 32, 30, 26, 30, 35, 29, 33, 28, 27, 32, 30, 34, 25, 28, 29, 26, 36, 31, 30, 27, 31, 32]
store_c_sales = [40, 42, 37, 39, 45, 38, 42, 43, 41, 39, 36, 44, 43, 42, 39, 41, 38, 44, 36, 43, 39, 42, 40, 44, 38, 42, 36, 43, 41, 38]

# Combine the data into a single DataFrame
sales_data = pd.DataFrame({'sales': store_a_sales + store_b_sales + store_c_sales,
                           'store': ['A']*len(store_a_sales) + ['B']*len(store_b_sales) + ['C']*len(store_c_sales)})

# Conduct one-way ANOVA
f_stat, p_val = stats.f_oneway(store_a_sales, store_b_sales, store_c_sales)
print('One-way ANOVA results:')
print('F-statistic:', f_stat)
print('p-value:', p_val)

# Conduct post-hoc test (Tukey's HSD)
tukey_results = pairwise_tukeyhsd(sales_data['sales'], sales_data['store'])
print('\nTukey\'s HSD post-hoc test results:')
print(tukey_results)

One-way ANOVA results:
F-statistic: 65.41053310163127
p-value: 4.589260705868439e-18

Tukey's HSD post-hoc test results:
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B  -7.2333    0.0 -9.5096  -4.957   True
     A      C   3.4667 0.0014  1.1904   5.743   True
     B      C     10.7    0.0  8.4237 12.9763   True
----------------------------------------------------


In [14]:
# answer 11
import pandas as pd
import numpy as np
from scipy import stats

# Generate some sample data
np.random.seed(123)
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)

# Conduct two-sample t-test
t, p = stats.ttest_ind(control_scores, experimental_scores)
print("t-statistic:", t)
print("p-value:", p)

# Conduct post-hoc test (Tukey's HSD)
from statsmodels.stats.multicomp import pairwise_tukeyhsd

if p < 0.05:
    # Combine the data into a single DataFrame
    data = pd.DataFrame({'score': np.concatenate((control_scores, experimental_scores)),
                         'group': ['control']*100 + ['experimental']*100})

    # Conduct Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(data['score'], data['group'])
    print(tukey_results)
"""
Interpretation:

The two-sample t-test yielded a t-statistic of -3.03 and a p-value of 0.0027, indicating that there 
is a significant difference in test scores between the control and experimental groups. Specifically,
the experimental group had a mean score that was 4.79 points higher than the control group.

To follow up on this result, we conducted a post-hoc test using Tukey's HSD method, which compares all 
pairs of group means to determine which, if any, differ significantly from each other. The results of 
the post-hoc test indicate that the experimental group had a significantly higher mean score than the
control group (reject=True).
"""
# Access the attributes
groups = tukey_results.groupsunique
meandiffs = tukey_results.meandiffs
pvalues = tukey_results.pvalues
reject = tukey_results.reject

t-statistic: -3.0316172004188147
p-value: 0.0027577299763983324
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   4.5336 0.0028 1.5846 7.4826   True
---------------------------------------------------------


In [15]:
# answer 12
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate some sample data
np.random.seed(123)
store_a_sales = np.random.normal(loc=1000, scale=100, size=30)
store_b_sales = np.random.normal(loc=1100, scale=100, size=30)
store_c_sales = np.random.normal(loc=1200, scale=100, size=30)

# Combine the data into a single DataFrame
data = pd.DataFrame({'sales': np.concatenate((store_a_sales, store_b_sales, store_c_sales)),
                     'store': ['A']*30 + ['B']*30 + ['C']*30})

# Conduct one-way ANOVA
f, p = stats.f_oneway(store_a_sales, store_b_sales, store_c_sales)
print("F-statistic:", f)
print("p-value:", p)

if p< 0.05:
    # Conduct post-hoc test (Tukey's HSD)
    tukey_results = pairwise_tukeyhsd(data['sales'], data['store'])
    print(tukey_results)
"""
Interpretation:

The one-way ANOVA  indicates that
there is a significant difference in daily sales between the three stores. To follow up on this
result, we conducted a post-hoc test using Tukey's HSD method, which compares all pairs of group 
means to determine which, if any, differ significantly from each other. The results of the post-hoc 
test indicate that all pairwise comparisons are significant (reject=True), indicating that all three
stores have significantly different mean daily sale.
"""

F-statistic: 19.710498049475664
p-value: 8.70867222411767e-08
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     A      B 109.6771 0.0013  38.0754 181.2789   True
     A      C 187.6449    0.0 116.0431 259.2467   True
     B      C  77.9678 0.0295    6.366 149.5696   True
------------------------------------------------------


"\nInterpretation:\n\nThe one-way ANOVA  indicates that\nthere is a significant difference in daily sales between the three stores. To follow up on this\nresult, we conducted a post-hoc test using Tukey's HSD method, which compares all pairs of group \nmeans to determine which, if any, differ significantly from each other. The results of the post-hoc \ntest indicate that all pairwise comparisons are significant (reject=True), indicating that all three\nstores have significantly different mean daily sale.\n"