# Bioestatística 2019 - prova 10

> Marcos Duarte  
http://pesquisa.ufabc.edu.br/bmclab/ensino/intro_stat/

## Problem  
1. Consider the following LDL data from blood samples taken from normal diet patient samples (in mg / dL): 155; 157; 153; 149; 154; 162 and other patients with vegan diet: 155; 151; 151; 140; 155.  
 a. Formulate statistical hypotheses about the difference between the two groups;  
 b. Calculate Cohen's d statistic for effect size.  
 c. Calculate the 95% confidence interval for the difference between the two groups.  
 d. Take a null hypothesis significance test; adopt a significance level of 0.05.  
 e. Write a paragraph describing these statistical results (you can use the APA style) and the conclusion on this statistical analysis.

## Solution

a.  
Possible hypotheses:
 - $H_0$: $\mu_{normal} = \mu_{vegan}$.  
 - $H_A$: $\mu_{normal} \neq \mu_{vegan}$.  

Comments:  
Other possible alternative hypothesis is: $H_A$: $LDL_{normal} > LDL_{vegan}$. We could have adopted this directional hypothesis if for example there is sufficient data in the literature showing such directional effect (and it seems there is, see [Yokoyama et al., 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5914369/)).  
For a discussion about when to adopt directional alternative hypothesis, see for example ([Ruxton & Neuhäuser, 2010](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210X.2010.00014.x)).

b.  
Cohen's d statistic for effect size is defined as the ratio between the difference between the means of the groups and the pooled standard deviation:

$$ d = \frac{\bar x_v - \bar x_n}{s}  $$

Where $s$ is the pooled standard deviation:

$$ s_P = \sqrt {\frac {(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}} $$

And the variance $s^2$ (the standard deviation squared) for one of the groups is:

$$ s_{1}^{2}={\frac {1}{n_{1}-1}}\sum _{i=1}^{n_{1}}(x_{1,i}-{\bar {x}}_{1})^{2} $$

Of course, let's use Python for basic arithmetic operations.  
First,let's define some functions (most of them are useless because they already exist in the Python ecosystem):

In [1]:
import numpy as np

def mean(x):
    """Compute mean.
    """
    return sum(x)/x.size

def var(x):
    """Compute variance.
    """
    return sum((x - mean(x))**2)/(x.size - 1)

def poolstd(x1, x2, show=False):
    """Compute pooled standard deviation.
    """
    sp = np.sqrt( ((x1.size-1)*var(x1) + (x2.size-1)*var(x2)) / (x1.size+x2.size-2) )
    if show:
        print(f'The pooled standard deviation between the two samples is {sp:.2f}')
    return sp

def poolsem(x1, x2, show=False):
    """Compute pooled standard error of the mean.
    """
    sep = np.sqrt( poolstd(x1, x2)**2/x1.size + poolstd(x1, x2)**2/x2.size )
    if show:
        print(f'The pooled standard error of the mean between the two samples is {sep:.2f}')
    return sep

def cohensd(x1, x2, show=False):
    """Compute Cohen's d statistic for effect size.
    """
    d = np.abs(mean(x1) - mean(x2)) / poolstd(x1, x2)
    if show:
        print(f'Cohen\'s d statistic for the samples is {d:.2f}')    
    return d

def summary(x, name='The sample '):
    """Display summary statistics.
    """
    print(f'{name} has {x.size} data, mean {mean(x):.2f}, variance {var(x):.2f}.')

Now, let's create some variable for the values:

In [2]:
xn = np.array([155, 157, 153, 149, 154, 162])
xv = np.array([155, 151, 151, 140, 155])

In [3]:
summary(xn, 'xn')
summary(xv, 'xv')
poolstd(xn, xv, show=True);

xn has 6 data, mean 155.00, variance 18.80.
xv has 5 data, mean 150.40, variance 37.80.
The pooled standard deviation between the two samples is 5.22


In [4]:
d = cohensd(xn, xv, show=True)

Cohen's d statistic for the samples is 0.88


c.  
In general, a 95% confidence interval is defined as:

$$ [\bar x - t_{.95}\mathit{SE}, \bar x + t_{.95}\mathit{SE}]  $$

Where $\bar x$ is the mean of the sample, $t_{.95}$ is the $t$ value for a probability of 95% and $\mathit{SE}$ is the standard error of the mean (standard deviation divided by square root of the sample size)

For the difference between the two groups, the mean is the difference between the groups' means and SE is the pooled SE.

The pooled standard error of the mean can be calculated in two different ways:

$$ \mathit{SE}_P = \sqrt{\frac{s^2_{1}}{n_1}+\frac{s^2_{2}}{n_2}} $$

and

$$ \mathit{SE}_P = \sqrt{\frac{s^2_{P,1}}{n_1}+\frac{s^2_{P,2}}{n_2}} $$

Note that in the second formula the pooled variance was employed.  
This latter method is preferred for samples with different sizes or with different variances.

In [5]:
sep = poolsem(xn, xv, show=True)

The pooled standard error of the mean between the two samples is 3.16


For a 95% confidence interval, we want the critic value for the t distribution using a probability of 0.975 (remember 2.5% of the area outside the interval in each side) and degrees of freedom equals to $n_1+n_2-2$.  
Using the stats function from the Scipy package:

In [6]:
from scipy import stats
t_crit = stats.t.ppf(q=0.975, df=11-2)
print(f'Critical t value for p=.95 and df=9: {t_crit:5f}')

Critical t value for p=.95 and df=9: 2.262157


In [7]:
print(f'CI95: [{(mean(xn)-mean(xv))-t_crit*sep:.3f}, {(mean(xn)-mean(xv))+t_crit*sep:.3f}] mg/dL')

CI95: [-2.550, 11.750] mg/dL


d.  
For a null hypothesis significance test with $\alpha=0.05$, we will employ an independent (unpaired) two-tailed t-test:

$$ t = \frac{\bar x_n - \bar x_v}{\mathit{SE}_P}  $$

In [8]:
t = (mean(xn) - mean(xv))/sep
print(f't value: {t:.3f}')

t value: 1.455


And the probability of finding this t value or greater is:

In [9]:
p = 1 - stats.t.cdf(x=t, df=11-2)
print(f'Probability of finding this t value or greater: {p:.3f}')

Probability of finding this t value or greater: 0.090


Since the critical t value for p=.95 and df=9 is 2.262, the calculated t is outside the critical area (the p value is greater than the significance level), we fail to reject (we accept) the null hypothesis.

e.  

There was no significant difference between the LDL values of persons with normal diet and persons with vegan diet (t(9)=1.455, p=0.090, CI95=[-2.550, 11.750] mg/dL, Cohen's d=0.88). 

The adoption of vegan diet didn't affect LDL values in relation to normal diet (t(9)=1.455, p=0.090, CI95=[-2.550, 11.750] mg/dL, Cohen's d=0.88). 

No difference was observed between the LDL values of persons with normal diet and persons with vegan diet (t(9)=1.455, p=0.090, CI95=[-2.550, 11.750] mg/dL, Cohen's d=0.88). 

The difference between the LDL values of persons with normal diet and persons with vegan diet was not statistically significant (t(9)=1.455, p=0.090, CI95=[-2.550, 11.750] mg/dL, Cohen's d=0.88).  

With regard to the LDL levels, persons with normal diet and persons with vegan diet are not significantly different (t(9)=1.455, p=0.090, CI95=[-2.550, 11.750] mg/dL, Cohen's d=0.88).  