# Correlation Metrics

## Authors
B.W. Holwerda

## Learning Goals
* learn about different correlation metrics
* Pearson, Spearman and Kendall Tau rankings
* Correlation is not causation
* But strong correlation does imply a common origin.

## Keywords
Pearson ranking, Spearman ranking, Kendall tau ranking, correlation

## Companion Content


## Summary
One of the first things in physics of large data samples is to determine if there are correlations between samples. The Pearson, Spearman and Kendall rankings are all correlation metrics which come with a significance estimate.

<hr>


## Student Name and ID:



## Date:

<hr>

### Galaxy properties.

In Assignment 5, we encountered a catalog of galaxies. We will now see which of the galaxy properties are related, how strongly and how significantly.


0 - GAMA CATAID

1 - Stellar mass (log10 solar masses)

2 - u-r colour

3 - S'ersic index (log10)

4 - Half-light radius (log10 kpc)

5 - Specific star formation rate (log10 Gyr^-1)



In [3]:
import matplotlib.pyplot as plt
from astropy.io import ascii
from scipy.stats import spearmanr
from scipy.stats import pearsonr
from scipy.stats import kendalltau
import numpy as np


# data = ascii.read("GAMA.csv", format='csv', names=['cataid','Mstar','u-r','n','r50','sSFR'],fast_reader=False)

data = np.genfromtxt("GAMA.csv",  names=['cataid','Mstar','ur_color','n','r50','sSFR'], delimiter=',')

print(data)

[(   6802.,  0.03344004, -0.28054242, -0.71393211, -0.99912927, 0.31430825)
 (   6821., -1.69010237, -2.0907617 ,  2.30089063, -1.6604111 , 1.08471539)
 (   6989., -1.43010011, -0.89139542, -0.8691514 , -0.87506814, 0.61295776)
 ...
 (3913968., -0.52378387, -0.61064301, -0.53414938,  0.26213536, 0.24422547)
 (3913987.,  0.78663551, -0.3127837 , -0.14638224,  0.51549851, 0.36364951)
 (3913997., -0.6167152 , -0.87555832,  0.36171549, -0.32440635, 0.77301044)]


## Spearman and Pearson Rankings.

**Spearman's ranking** is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a *monotonic function*.

**Pearson ranking** is a measure of the *linear* correlation between two variables X and Y. 

**Kendall $\tau$** is another measure of monotonic function. Both this and the Spearman ranking are cases of a general correlation coefficient.


### Exercise 1 -- Which two parameters are the most correlation according to their Spearman ranking

hint: run through the list of column names and use *for* loops.

In [2]:
# student work here




### Exercise 2 -- Is that correlation statistically significant? 

What is the p-value?

*your answer here*

### Exercise 3 -- Did you use an 'if' statement to find the highest ranking? 

if not, what would one look like?

*your answer here*

### Exercise 4 --  Which two parameters are the most correlation according to their Pearson ranking

hint: define the list of column names and use *for* loops.

In [3]:
# student work here


### Exercise 5 -- Now do this for the Kendall $\tau$

In [4]:
# student work here




### Exercise 6 -- Plot the most and second most correlated values against each other

According to the above correlation functions, two pairs of values are most correlated. 

In [5]:
## Plot the two most correlated values 
# student work here


In [6]:
## plot the second most correlated values here:
# student work here


### Exercise 7 -- Are both correlations the similar?

The two strongest correlations, are they similar? Or is there evidence for sub-populations? (hint: set alpha=0.1 to see the density of points).


*your answer here*

### Exercise 8 -- Use the correlation

Do you think the u-r color would make a good estimator for the stellar mass? Motivate why or why not.

*your answer and motivation here*