### Non parametric test

### Mann Whitney U test
equivalent of unpaired two sample t test

Assumption of this test:
1. dependent variable --- ordinal or continous scale
2. independent variable --- two independent categorical groups
3. independent observations following the same distribution, it doesnot matter that it follows the same distribution or not

### Statistical procedure
1. formulate the research question
2. make the null hypothesis
3. load data and analyse it
4. apply the test statistics based on data
5. Interpret the results

References
https://towardsdatascience.com/prepare-dinner-save-the-day-by-calculating-confidence-interval-of-non-parametric-statistical-29d031d079d0

#### Research question

Let's get a question from research article itself, Statistics in Medicine: Calculating confidence intervals for some non-parametric analyses [link](https://doi.org/10.1136/bmj.296.6634.1454). 

Consider the data on the globulin fraction of plasma (g/l) in two groups of 10 patients given by Swinscow.
A bit about data:
1. two groups which are independent categorical groups ie group 1 and group 2
2. the data of this group ie the globulin fraction is in continous scale

These two groups can represent sampe population like both sample belong to same populaton. Let's suppose one of the group is the treatment group and another group havent received treatment so the group fraction on the average basis should be greater than the other group.
Other examples where we can use this test
1. pay discrimination b/w gender where pay is the continous scale and independent categorical variable are male and female.
2.  

The average basis is the property of group which would be used to compare between groups
1. may be mean
2. median: we would be comparing the estimated median of two population

We know what median is, the middle value when we arrange data in an increasing or decreasing order. 

Here we would use median to represent the group. So, our research question would be median of group 1 and group 2 are not equal.

#### null hypothesis
The median difference of both group is zero = H0 = two population are equal

#### load the data

In [6]:
import numpy as np
group_1 = np.array([38, 26, 29, 41, 36, 31, 32, 30, 35, 33])
group_2 = np.array([45, 28, 27, 38, 40, 42, 39, 39, 34, 45])

#### analyse the data
1. dependent variable is oridnal or continous
2. independent variable is two groups which are categorical and independent
3. diff participant in each group, means all twenty participants are unique
4. the two groups are not normally distributed because sample size is too small

### Applying the test statistic
There are two ways of doing it 
1. calculate the confidence interval for estimated median of the population
2. calculate the confidence interval for the differences of estimated median 

### Estimate difference of median of the two population
Estimate of the median of the population from sample:
1. always know the estimate will be in some x percentage generally 95% confidence interval
2. there would be upper limit and lower limit for it. 


Differeneces of the median of two population is estimated by 
1. the median of all the possible n*m differences. We can do this by arranging them in increasing order 
See the data in diff, the diff is difference between every value of group 1 and group 2 i.e 100 values

In [48]:
group_1 = np.sort(group_1)
group_2 = np.sort(group_2)
diff = np.array([])
for i in group_1:
    for j in group_2:
        diff = np.append(diff, i-j)
print(diff)
print("the median of diff is", np.median(diff))

[ -1.  -2.  -8. -12. -13. -13. -14. -16. -19. -19.   2.   1.  -5.  -9.
 -10. -10. -11. -13. -16. -16.   3.   2.  -4.  -8.  -9.  -9. -10. -12.
 -15. -15.   4.   3.  -3.  -7.  -8.  -8.  -9. -11. -14. -14.   5.   4.
  -2.  -6.  -7.  -7.  -8. -10. -13. -13.   6.   5.  -1.  -5.  -6.  -6.
  -7.  -9. -12. -12.   8.   7.   1.  -3.  -4.  -4.  -5.  -7. -10. -10.
   9.   8.   2.  -2.  -3.  -3.  -4.  -6.  -9.  -9.  11.  10.   4.   0.
  -1.  -1.  -2.  -4.  -7.  -7.  14.  13.   7.   3.   2.   2.   1.  -1.
  -4.  -4.]
the median of diff is -5.5


We now know the median of differences of value.
Now we need to know the differeces of median of the population median:
1. since it is estimated from a sample we need to use confidence intervals
2. the confidence interval will give a smallest value and largest value in which the esimated mean of the population will lie
3. This smallest or largest value would be at the same distance from the either side and it is called K
4. So lower limit of the confidence interval will be Kth smallest diff of median and upper limit will be the Kth largest diff of median
5. so Kth smallest to Kth largest of the differences will give you 100(1-alpha)% confidence interval.
6. How to find this K 

\begin{equation*}
K = W_\frac{alpha}{2} - \frac{n_1(n_1 + 1)}{2} 
\end{equation*}

6. where W alpha/2 is the percentile of distribution of Mann - Whitney test statistic
7. otherwise you can calculate direct K like this

\begin{equation*}
K = \frac{nm}{2} - N_{1-\frac{alpha}{2}} * \sqrt{\frac{nm(n+m+1)}{12}} 
\end{equation*}

7. when alpha = 0.05 and N_alpha is 1.96 
8. therefore the estimate of difference of p

In [60]:
diff = np.sort(diff)
print(diff)
# Kth smallest and largest value
a = 0.05
import scipy.stats as st
n = len(group_1)
m = len(group_2)
k = round(n*m/2 - st.norm.ppf(1-a/2)*(n*m*(n+m+1)/12)**0.5)

print("the 95% confidnece interval for the difference of estimated median lies between", diff[k-1], "to", diff[-k], "with the average median lying at", np.median(diff))

[-19. -19. -16. -16. -16. -15. -15. -14. -14. -14. -13. -13. -13. -13.
 -13. -12. -12. -12. -12. -11. -11. -10. -10. -10. -10. -10. -10.  -9.
  -9.  -9.  -9.  -9.  -9.  -9.  -8.  -8.  -8.  -8.  -8.  -7.  -7.  -7.
  -7.  -7.  -7.  -7.  -6.  -6.  -6.  -6.  -5.  -5.  -5.  -4.  -4.  -4.
  -4.  -4.  -4.  -4.  -3.  -3.  -3.  -3.  -2.  -2.  -2.  -2.  -1.  -1.
  -1.  -1.  -1.   0.   1.   1.   1.   2.   2.   2.   2.   2.   3.   3.
   3.   4.   4.   4.   5.   5.   6.   7.   7.   8.   8.   9.  10.  11.
  13.  14.]
the 95% confidnece interval for the difference of estimated median lies between -10.0 to 1.0 with the average median lying at -5.5


### Interpreting the results
1. with alpha of 0.05 that is p value of 0.95 we got
    1. estimated median diff of -5.5
    2. within the range of -10 to 1.0

since the difference of median also contains the zero value we are not able to reject null hypothesis that we dont have sufficient evidence to conclude that there is difference

Now we need to verify with the package formula but it doesnt give us confidence interval, range as well as estimated median diff

In [61]:
st.mannwhitneyu(group_1, group_2)

MannwhitneyuResult(statistic=26.5, pvalue=0.0817535829999502)

### Creating own formula from all above items

In [75]:
import numpy as np
import scipy.stats as st

def Mann_Whitney(a, b, alpha = None):
    if alpha is None:
        alpha = 0.05
    a = np.sort(a)
    b = np.sort(b)
    diff = np.array([])
    for i in a:
        for j in b:
            diff = np.append(diff, i-j)
    diff = np.sort(diff)
    # Kth smallest and largest value
    n = len(a)
    m = len(b)
    k = round(n*m/2 - st.norm.ppf(1-alpha/2)*(n*m*(n+m+1)/12)**0.5)
    return print("The", 100*(1-alpha),"%", "confidnece interval for the difference of estimated median lies between", diff[k-1], "to", diff[-k], "with the average median lying at", np.median(diff))

Now testing our function, you could level of significance that is alpha, if empty u can leave it blank

In [74]:
Mann_Whitney(group_1, group_2)

the 95.0 % confidnece interval for the difference of estimated median lies between -10.0 to 1.0 with the average median lying at -5.5
