# <font color='maroon'>Correlations between ranked data</font>

Sometimes, in exploring relationships between variables, the data variables are [ranked](http://www.biostathandbook.com/variabletypes.html#ranked). In the LMS, we looked at an example of ranking oil rig wear-and-tear in an inspection score then assessing that against failure rates.

A more universally applicable example is in the film industry. For example, the scores given to movies by movie critics and viewers place movies into some order, with the most appealing or artistic movie ranked high. [Spearman's rank correlation](http://www.biostathandbook.com/spearman.html) is used to determine correlations between ranked variables.  It can also be used when one variable is ranked and the other a continuous variable. We simply convert the continuous variable to a ranked variable. For example, you might want to know if movie critics scores go down, do moviegoers's scores also go down? A Spearman rank correlation measure can be used to determine the strength of the relationship between these two variables. However, unlike the Pearson correlation measure, the Spearman measure does not only look for a linear relationship between variables. 

## Spearman's rank correlation coefficient

The Spearman correlation measure is able to detect monotonic relationships between ranked continuous variables. The diagram below indicates the kinds of monotonic relationships the measure is able to detect.

<img src="images/spearman-1-small.png"/ width="850" height="850">

Image courtesy of [Spearman's Rank-Order Correlation](https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php)

Just like the Pearson correlation measure, the Spearman test results also lie between -1 and +1, where 

    a correlation coefficient close to +1 indicates a strong positive relationship;

    a correlation coefficeint close to -1 indicates a strong negative relationship; and

    a correlation coefficient equal zero indicates no relationship exists between the variables.


### The data

As a data scientist, you may be asked if film critics always get it right when they critique films. The scores they give to a movie may not reflect moviegoers's scores. To help us get insight into the correlation between critic scores and user scores, we can apply the Spearman's rank correlation test on movie rankings between critics and moviegoers. We use data from <a href=' https://www.theguardian.com/news/datablog/2013/jul/12/movies-audience-loved-critics-hated'>The Guardian's Datablog</a> about scores given to movies by viewers and critics.  Twenty films were scored by users and critics. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
%matplotlib inline

In [None]:
data = pd.read_csv('scores.csv', sep=',', index_col='Title')
data

The data is not ranked, so let's go ahead and rank each column using the `rank()` function in Pandas. The `rank` function is applied to each column in the dataframe returning the rank of each entry within the column. There is some additional computation that occurs for rows with the same ranking. You can read more on this in the Pandas documentation.

In [None]:
rdata = data.rank() 
rdata

In [None]:
rdata.shape

Our data is now ranked and we can apply the Spearman test to the ranks.

## Quick Visualizations

We begin with a quick visualization of the ranks of the scores from critics and users.

In [None]:
rdata.plot(kind='scatter',x='User',y='Critic')

We observe a positive monotonic relationship between the ranked data. But how strong is this relationship?

## Spearman's rank correlation test

We quantify this relationship by using the `spearmanr` function in Scipy.

In [None]:
stats.spearmanr(rdata)

The Spearman function returns a correlation coefficient and a p-value. We will learn to interpret p-values later in the module.

Here's how the Spearman correlation coefficient is calculated. For each pair of values, we calculate $d = $rank(['User']) - rank(['Critic']). The Spearman correlation is then calculated using the formula,

$$r_s = 1 - \dfrac{6 \sum d_i^2}{n(n^2-1)},$$ where $n$ is the size of the number of observations.


#### Interpreting the results

We performed the Spearman correlation test to determine the relationship between `User` scores and `Critic` scores. The Spearman correlation coefficient returned is $r_s$ = 0.710. This is a strong positive correlation between the two rankings. For the sample, critics and moviegoers tend to be in agreement about a movie 71% of the time. 

### Exercise

What are some situations in which you can apply the Spearman's ranking test in your work? Identify a dataset you have worked with in the past or are currently working with to apply the Spearman's test.

#### Enter  your response here

In [None]:
#your code here

### References

Care must be taken when interpreting the correlation coefficient. See this link on <a href='https://www.stat.berkeley.edu/~rabbee/correlation.pdf'>Thirteen Ways to Look at the Correlation Coefficient.</a> 

See more on [Spearman Rank Correlation](https://rstudio-pubs-static.s3.amazonaws.com/191093_4169c5282eb145a491a5b1924941a6ba.html).

