# Question 1

What tests have we learned, and when should they be applied (on what kind of data)? Can you also specify when a test needs special ways to deal with ties.

## Paired data $\{(x_i, y_i) \; | \; i = 1, \ldots, n\}$
  - Same location parameter? Is the dist for $D = Y-X$ centered on zero? I.e., test on $d_i = y_i - x_i$. 
    - Sign test
    - (Wilcoxon) Sign rank test
  - Correlation: Spreaman $\rho$, Kendal $\tau$

## Independent samples $\{x_i\; |\; i=1, \ldots , n\}, \{y_j\; |\; j=1, \ldots , m\}$
  - Equal variances? Conover sqaured ranks test.
  - Same location parameter? Count number of times $x_i > y_j$ for all $i$ and $j$. Mann-Whitney test $\equiv$ (Wilcoxon) rank sum test.
  - Compare EDFs: Kolmogorov-Smirnov, Cramer-von Mises, or Anderson-Darling test

## $k$ samples: $\{x_{i,j}\;|\; i=1, \ldots,k; j=1, \ldots, n_i\}$ 
Generalization: $x_j \rightarrow x_{1,j}$, $y_j\rightarrow x_{2, j}$, $n\rightarrow n_1$, $m\rightarrow n_2$ etc.

- Rank sum test $\rightarrow$ Kruskal-Wallis test
  - Conover-Iman test to check which pairs of samples are responsible.
- 2-sample Conover squared ranks $\rightarrow$ $k$-sample squared ranks

## Complete block design: $\{x_{i,j}\;|\;i=1, \ldots, b; j=1, \ldots, k\}$
Compare "treatments", labelled by $j$ within each block labelled by $i$. Generalize paired data to $k$ "samples".

$x_i \rightarrow x_{i1}$; $y_i \rightarrow x_{i,2}$ $n\rightarrow b$

- Sign test $\rightarrow$ Friedman (rank treatments within blocks)
- Sign rank test $\rightarrow$ Quade test (weight ranks by spread within block)

## ROC curves/ARE/power curves; one sample K-S etc. tests

# Question 2

For $p$-values: use Monte Carlo or built-in functions?

Answer: Generally okay to use functions providing null distribution (e.g. `stats.ksone()` for Kolmogorov distribution), but don't use function to do the whole test for you.

# Question 3

One-sample K-S test for discrete data; what replaces Kolomogorov distribution?

Method in Conover section 6.1 (**A Method of Obtain the Exact $p$-Value When $F^*(x)$ is Discrete**) e.g., Example 2, is not recommended.

Better to do a Monte Carlo as in lesson 7.2.

# Review

...

In [1]:
%matplotlib inline

In [26]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import itertools
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

## Kruskal-Wallis

The null is assumed Chi squared.

In [27]:
def kruskal_wallis(x_i_j):
    n_i = np.array([len(xi_j) for xi_j in x_i_j])
    k = len(n_i)
    N = np.sum(n_i)
    
    x_r = np.concatenate(x_i_j)
    R_r = stats.rankdata(x_r)
    i_r = np.concatenate([(i,)*n_i[i] for i in range(k)])
    R_i_j = [R_r[i_r==i] for i in range(k)]
    R_i = np.array([np.sum(Ri_j) for Ri_j in R_i_j])
    
    Rbar = 0.5*(N+1)
    T = (N-1) * np.sum((R_i-n_i*Rbar)**2/n_i) / np.sum((R_r-Rbar)**2)
    
    p = stats.chi2(df=k-1).sf(T)
    
    return T, p

In [28]:
x_i_j = [
    np.array([ 14.97,   5.80,  25.03,   5.50 ]),
    np.array([  5.83,  13.96,  21.96]),
    np.array([ 17.89,  23.03,  61.09,   18.62,  55.51])
]
kruskal_wallis(x_i_j)

(4.153846153846154, 0.12531520484413722)

## Conover-Iman

The null is `t`.

In [29]:
def conover_iman(newx_i_j):
    n_i = np.array([len(xi_j) for xi_j in newx_i_j])
    k = len(n_i)
    N = np.sum(n_i)
    
    x_r = np.concatenate(newx_i_j)
    R_r = stats.rankdata(x_r)
    i_r = np.concatenate([(i,)*n_i[i] for i in range(k)])
    R_i_j = [R_r[i_r==i] for i in range(k)]
    R_i = np.array([np.sum(Ri_j) for Ri_j in R_i_j])
    
    Rbar = 0.5*(N+1)
    T = (N-1) * np.sum((R_i-n_i*Rbar)**2/n_i) / np.sum((R_r-Rbar)**2)
    
    newx_r = np.concatenate(newx_i_j)
    newR_r = stats.rankdata(newx_r)
    
    newR_i_j = [newR_r[i_r==i] for i in range(k)]
    
    newR_i = np.array([np.sum(Ri_j) for Ri_j in newR_i_j])
    newSsq = np.sum((newR_r-Rbar)**2)/(N-1)
    newRbar_i = newR_i/n_i
            
    newT = (N-1) * np.sum((newR_i-n_i*Rbar)**2/n_i) / np.sum((newR_r-Rbar)**2); newT
    p = stats.chi2(df=k-1).sf(newT)
    
    newT_ii = (newRbar_i[:,None]-newRbar_i[None,:])/np.sqrt(newSsq*(N-1-T)/(N-k)*(1/n_i[:,None]+1/n_i[None,:]))
    ps = 2*stats.t(df=N-k).sf(np.abs(newT_ii))
    
    return (newT, p), (newT_ii, ps)

In [30]:
newx_i_j = [
    np.array([ 14.97,   5.80,  15.03,   5.50 ]),
    np.array([  5.83,  13.96,  21.96 ]),
    np.array([ 27.89,  23.03,  61.09,   18.62,  55.51 ])
]
conover_iman(newx_i_j)

((7.476923076923077, 0.023790676031139445),
 (array([[ 0.        , -0.87060544, -4.16315718],
         [ 0.87060544,  0.        , -2.91360309],
         [ 4.16315718,  2.91360309,  0.        ]]),
  array([[1.        , 0.40659156, 0.00243626],
         [0.40659156, 1.        , 0.01721003],
         [0.00243626, 0.01721003, 1.        ]])))

## Conover Squared-Ranks

...

In [31]:
def conover_squared_ranks_k_sample(x_i_j):
    n_i = np.array([len(xi_j) for xi_j in x_i_j])
    k = len(n_i)
    N = np.sum(n_i)

    xbar_i = np.array([np.mean(xi_j) for xi_j in x_i_j])
    
    U_i_j = [np.abs(xi_j-np.mean(xi_j)) for xi_j in x_i_j]
    
    U_r = np.concatenate(U_i_j) 
    RU_r = stats.rankdata(U_r)
    i_r = np.concatenate([(i,)*n_i[i] for i in range(k)])
    RU_i_j = [RU_r[i_r==i] for i in range(k)]
    
    S_i = np.array([np.sum(RUi_j**2) for RUi_j in RU_i_j])
    Sbar = np.mean(RU_r**2)
    
    Dsq = N/(N-1)*np.mean((RU_r**2-Sbar)**2)
    T = np.sum((S_i-n_i*Sbar)**2/n_i)/Dsq
    
    p = stats.chi2(df=k-1).sf(T)
    
    return T, p

In [32]:
x_i_j = [
    np.array([ 14.97,   5.80,  25.03,   5.50 ]),
    np.array([  5.83,  13.96,  21.96]),
    np.array([ 17.89,  23.03,  61.09,   18.62,  55.51])
]
conover_squared_ranks_k_sample(x_i_j)

(7.436484543493889, 0.024276602034339175)

In [33]:
x_i = np.array([8.56, 5.03, 48.1, 1.31, 4.82]); y_j = np.array([15.0, 12.3, 28.0, 13.9])
x_i_j = [x_i, y_j]
conover_squared_ranks_k_sample(x_i_j)

(2.3133164235890935, 0.1282701146119386)

## Pearson’s $r$

...

In [34]:
def get_pearsons_r(x_i, y_i):
    xbar = np.mean(x_i)
    ybar = np.mean(y_i)
    
    return np.sum((x_i-xbar)*(y_i-ybar))/np.sqrt(np.sum((x_i-xbar)**2)*np.sum((y_i-ybar)**2))

In [35]:
x_i = np.array([9.64, 5.91, 3.22, 2.04, 5.49, 9.24, 6.38, 7.79, 0.48, 8.86])
y_i = np.array([5.53, 3.48, 3.16, 2.98, 7.11, 7.75, 3.37, 8.24, 3.00, 3.75])

get_pearsons_r(x_i, y_i)

0.5901002196595794

## Spearman’s $\rho$

In [36]:
def get_spearmans_rho(x_i, y_i):
    Rx_i = stats.rankdata(x_i)
    Ry_i = stats.rankdata(y_i)
    
    Rbar = np.mean(Rx_i)
    
    rho = np.sum((Rx_i-Rbar)*(Ry_i-Rbar) / np.sqrt(np.sum((Rx_i-Rbar)**2)*np.sum((Ry_i-Rbar)**2)))
    
    return rho

In [37]:
get_spearmans_rho(x_i, y_i)

0.7333333333333334

## Kendall’s $\tau$

In [38]:
def get_kendall_tau(x_i, y_i):
    def is_concordant(pt1,pt2):
        return ((pt1[0]>pt2[0])&(pt1[1]>pt2[1])|(pt1[0]<pt2[0])&(pt1[1]<pt2[1]))
    def is_discordant(pt1,pt2):
        return ((pt1[0]>pt2[0])&(pt1[1]<pt2[1])|(pt1[0]<pt2[0])&(pt1[1]>pt2[1]))
    
    assert len(x_i) == len(y_i)
    n = len(x_i)
    
    Nc = np.sum([is_concordant((x_i[i],y_i[i]),(x_i[j],y_i[j])) for (i,j) in itertools.combinations(range(n),2)])
    Nd = np.sum([is_discordant((x_i[i],y_i[i]),(x_i[j],y_i[j])) for (i,j) in itertools.combinations(range(n),2)])
    
    tau = (Nc-Nd)/(Nc+Nd); tau
    
    return tau

In [39]:
get_kendall_tau(x_i, y_i)

0.5555555555555556

## Friedman 

...

In [41]:
def get_friedman(X_ij):
    b, k = np.shape(X_ij)
    
    R_ij = stats.rankdata(X_ij,axis=-1)
    R_j = np.sum(R_ij,axis=0)
    
    T1 = (12/(b*k*(k+1)))*np.sum((R_j-0.5*b*(k+1))**2)
    p = stats.chi2(df=k-1).sf(T1)
    
    return T1, p

In [49]:
X_ij = np.array([[  2.  ,  50.0,   9.17],
                 [  100005,   3.1 ,   3.34],
                 [  0.14,  25.4 ,  26.59],
                 [ 14.6 ,   -700,  10.95]])
get_friedman(X_ij)

(0.5, 0.7788007830714049)