# Hypothesis Tests for Correlations and Distributions

This tutorial assumes you have had a basic [introduction](https://github.com/astrosheila/hypothesistests/blob/master/BasicStatsI.pdf) to statistics.

Author: Sheila Kannappan
Last Modified: November 2019

## Setup

If you're looking at this notebook, you've presumably already followed these instructions. Please take a moment to complete any that you have not yet finished.

 * go to https://jupyter.org/try
 * click "Try JupyterLab"
 * close open tabs in the Lab (not necessary, just less confusing)
 * open a terminal in the Lab (File>New>Terminal)
 * paste the following into the terminal to get the jupyter notebook:<br/>
  `wget https://github.com/astrosheila/hypothesistests/blob/master/hypothesistests.ipynb -P /home/jovyan/demo`
 * paste the following into the terminal to get the input file:<br/>
  `wget https://github.com/astrosheila/hypothesistests/blob/master/anscombe.txt -P /home/jovyan/demo` <br>
 * if necessary, click the refresh page (curled arrow) at the top of the webpage
 * launch the jupyter notebook and open the two python codes by double clicking on them
 * you can run or re-run individual cells in the notebook by clicking on them and typing Ctrl-Enter

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# ipython "magic" to enable static plot output directly to notebook
%matplotlib inline

## Correlation Tests

Correlation tests are a special case of hypothesis tests, which:

* Need not involve a model; may be “non-parametric”
* Return the probability of the null hypothesis

For correlation tests, the null hypothesis is that the two data sets have no association.

In [None]:
# pearson vs. spearman rank [and kendall tau] correlation tests
data=np.loadtxt("anscombe.txt")
# four data sets all of which have linear fits y = 3.00 + 0.500x
# and nearly identical statistics (mean, sigma, linear correlation coeff.)
# used to illustrate the importance of LOOKING AT YOUR DATA
x1=data[:,0]
y1=data[:,1]
x2=data[:,2]
y2=data[:,3]
x3=data[:,4]
y3=data[:,5]
x4=data[:,6]
y4=data[:,7]

In [None]:
# standard data set
plt.plot(x1, y1,'g.',markersize=10)
testxvals=np.array([3.,7.,11.,15.,19.]) # use to make line
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y1-(3.+0.5*x1))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
# notice how the "plt.text" command works
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('standard')

In [None]:
# curved data set
plt.plot(x2, y2,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y2-(3.+0.5*x2))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('curved')

In [None]:
# bad outlier data set
plt.plot(x3, y3,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y3-(3.+0.5*x3))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('outlier')
plt.xlabel('x')
plt.ylabel('y')

In [None]:
# uncomment each line, filling in the missing info correctly
# garbage data set
plt.plot(x4, y4,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y4-(3.+0.5*x4))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('garbage')

In [None]:
#define sigma symbol as a string for use on plots
sigmasym=r'$\sigma$'

In [None]:
plt.plot(x1, y1,'g.',markersize=10)
testxvals=np.array([3.,7.,11.,15.,19.]) # use to make line
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y1-(3.+0.5*x1))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
# notice how the "plt.text" command works
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('standard')
plt.ylabel('y')

print(" ")
print("Standard:")

# Spearman rank correlation test
cc,pnull=stats.spearmanr(x1,y1)
# pnull is returned as a 2-sided p-value by spearmanr, pearsonr, kendalltau
# print info to screen
print("Spearman rank correlation coefficient %f" % cc)
print("Spearman rank probability of no correlation %f" % pnull)
# convert pnull to equivalent confidence expressed as # sigma for Gaussian
confidence=stats.norm.interval(1-pnull) # fill in with enclosed prob. (0 to 1)
# note that by default "interval" assumes a Gaussian of mean 0 and sigma 1
# returns 2-sided upper & lower c.i. bounds
# add expression of confidence as # sigma to plot at position (x,y)=(8.5,3)
leveltext='Spearman rank %0.1f' % confidence[1]
plt.text(8.5,3,leveltext+sigmasym, size=11, color='b')

# Pearson correlation test
cc,pnull=stats.pearsonr(x1,y1)
print("Pearson correlation coefficient %f" % cc)
print("Pearson probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Pearson %0.1f' % confidence[1]
plt.text(8.5,4,leveltext+sigmasym, size=11, color='b')

# now try adding stats.kendalltau by analogy

In [None]:
# multi-panel plotting (see http://matplotlib.org/users/pyplot_tutorial.html)
plt.figure(1,figsize=(12, 8))
plt.clf()

# standard data set
ax1=plt.subplot(221) # 2 rows x 2 columns of plots, 1st plot (top left)
plt.plot(x1, y1,'g.',markersize=10)
testxvals=np.array([3.,7.,11.,15.,19.]) # use to make line
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y1-(3.+0.5*x1))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
# notice how the "plt.text" command works
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('standard')
plt.ylabel('y')
plt.setp(ax1.get_xticklabels(), visible=False) # hide its xlabels

print(" ")
print("Standard:")
# Spearman rank correlation test
cc,pnull=stats.spearmanr(x1,y1)
print("Spearman rank correlation coefficient %f" % cc)
print("Spearman rank probability of no correlation %f" % pnull)
# convert pnull to equivalent confidence expressed as # sigma for Gaussian
confidence=stats.norm.interval(1-pnull) # fill in with enclosed prob. (0 to 1)
# note that by default "interval" assumes a Gaussian of mean 0 and sigma 1
# returns 2-sided upper & lower c.i. bounds
# add expression of confidence as # sigma to plot at position (x,y)=(8.5,3)
leveltext='Spearman rank %0.1f' % confidence[1]
plt.text(8.5,3,leveltext+sigmasym, size=11, color='b')
# Pearson correlation test
cc,pnull=stats.pearsonr(x1,y1)
print("Pearson correlation coefficient %f" % cc)
print("Pearson probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Pearson %0.1f' % confidence[1]
plt.text(8.5,4,leveltext+sigmasym, size=11, color='b')
# Kendall tau correlation test
cc,pnull=stats.kendalltau(x1,y1)
print("Kendall tau correlation coefficient %f" % cc)
print("Kendall tau probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Kendall tau %0.1f' % confidence[1]
plt.text(8.5,5,leveltext+sigmasym, size=11, color='b')

# curved data set
ax2=plt.subplot(222) # 2 rows x 2 columns of plots, 2nd plot (top right)
plt.plot(x2, y2,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y2-(3.+0.5*x2))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('curved')
plt.setp(ax2.get_xticklabels(), visible=False) # hide its xlabels
plt.setp(ax2.get_yticklabels(), visible=False) # hide its ylabels

print(" ")
print("Curved:")
# Spearman rank correlation test
cc,pnull=stats.spearmanr(x2,y2)
print("Spearman rank correlation coefficient %f" % cc)
print("Spearman rank probability of no correlation %f" % pnull)
# convert pnull to equivalent confidence expressed as # sigma for Gaussian
confidence=stats.norm.interval(1-pnull) # fill in with enclosed prob. (0 to 1)
# note that by default "interval" assumes a Gaussian of mean 0 and sigma 1
# returns 2-sided upper & lower c.i. bounds
# add expression of confidence as # sigma to plot at position (x,y)=(8.5,3)
leveltext='Spearman rank %0.1f' % confidence[1]
plt.text(8.5,3,leveltext+sigmasym, size=11, color='b')
# Pearson correlation test
cc,pnull=stats.pearsonr(x2,y2)
print("Pearson correlation coefficient %f" % cc)
print("Pearson probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Pearson %0.1f' % confidence[1]
plt.text(8.5,4,leveltext+sigmasym, size=11, color='b')
# Kendall tau correlation test
cc,pnull=stats.kendalltau(x2,y2)
print("Kendall tau correlation coefficient %f" % cc)
print("Kendall tau probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Kendall tau %0.1f' % confidence[1]
plt.text(8.5,5,leveltext+sigmasym, size=11, color='b')

# bad outlier data set
ax3=plt.subplot(223) # 2 rows x 2 columns of plots, 3rd plot (bottom left)
plt.plot(x3, y3,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y3-(3.+0.5*x3))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('outlier')
plt.xlabel('x')
plt.ylabel('y')

print(" ")
print("Outlier:")
# Spearman rank correlation test
cc,pnull=stats.spearmanr(x3,y3)
print("Spearman rank correlation coefficient %f" % cc)
print("Spearman rank probability of no correlation %f" % pnull)
# convert pnull to equivalent confidence expressed as # sigma for Gaussian
confidence=stats.norm.interval(1-pnull) # fill in with enclosed prob. (0 to 1)
# note that by default "interval" assumes a Gaussian of mean 0 and sigma 1
# returns 2-sided upper & lower c.i. bounds
# add expression of confidence as # sigma to plot at position (x,y)=(8.5,3)
leveltext='Spearman rank %0.1f' % confidence[1]
plt.text(8.5,3,leveltext+sigmasym, size=11, color='b')
# Pearson correlation test
cc,pnull=stats.pearsonr(x3,y3)
print("Pearson correlation coefficient %f" % cc)
print("Pearson probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Pearson %0.1f' % confidence[1]
plt.text(8.5,4,leveltext+sigmasym, size=11, color='b')
# Kendall tau correlation test
cc,pnull=stats.kendalltau(x3,y3)
print("Kendall tau correlation coefficient %f" % cc)
print("Kendall tau probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Kendall tau %0.1f' % confidence[1]
plt.text(8.5,5,leveltext+sigmasym, size=11, color='b')

# garbage data set
ax4=plt.subplot(224)
plt.plot(x4, y4,'g.',markersize=10)
plt.plot(testxvals,3.+0.5*testxvals,'r',linestyle=':',linewidth=2.)
rms=np.sqrt(np.mean((y4-(3.+0.5*x4))**2))
plt.text(3,12,'rms %0.2f' % rms,size=11,color='b')
plt.xlim(2,20)
plt.ylim(2,14)
plt.title('garbage')
plt.xlabel('x')
plt.setp(ax4.get_yticklabels(), visible=False) # hide its ylabels

print(" ")
print("Garbage:")
# Spearman rank correlation test
cc,pnull=stats.spearmanr(x4,y4)
print("Spearman rank correlation coefficient %f" % cc)
print("Spearman rank probability of no correlation %f" % pnull)
# convert pnull to equivalent confidence expressed as # sigma for Gaussian
confidence=stats.norm.interval(1-pnull) # fill in with enclosed prob. (0 to 1)
# note that by default "interval" assumes a Gaussian of mean 0 and sigma 1
# returns 2-sided upper & lower c.i. bounds
# add expression of confidence as # sigma to plot at position (x,y)=(8.5,3)
leveltext='Spearman rank %0.1f' % confidence[1]
plt.text(8.5,3,leveltext+sigmasym, size=11, color='b')
# Pearson correlation test
cc,pnull=stats.pearsonr(x4,y4)
print("Pearson correlation coefficient %f" % cc)
print("Pearson probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Pearson %0.1f' % confidence[1]
plt.text(8.5,4,leveltext+sigmasym, size=11, color='b')
# Kendall tau correlation test
cc,pnull=stats.kendalltau(x4,y4)
print("Kendall tau correlation coefficient %f" % cc)
print("Kendall tau probability of no correlation %f" % pnull)
confidence=stats.norm.interval(1-pnull)
leveltext='Kendall tau %0.1f' % confidence[1]
plt.text(8.5,5,leveltext+sigmasym, size=11, color='b')

Now that you've seen the plots, take a look at the companion slides on correlations [here](https://github.com/astrosheila/hypothesistests/blob/master/BasicStatsII.pdf).