# Assignment 1: Strength of relationships in an empirical context

---
## Background

### Problem Analysis
The assignments asks to determine the strength of the relation between different variables. 
For this purpose, we'll use two different correlation coefficients: 
- the pearson product-moment correlation coefficient
- a rank correlation coefficient (exemplarly with spearman's rank correlation coefficient)

### What is the methods' use case?
In both methods the goals is to make a statement about the relation between two (potentially dependent) variables.
##### Pearson's product-moment correlation coefficient
- In the context of this assignment, pearson's coefficient is used to determine linear relationships between two variables
- It measures the strength of the relationship, as well as its direction (e.g. both variables increase -> positive correlation)

##### Spearman's rank correlation coefficient
- In the context of this assignment, spearmans coefficient is used to determine monotonic (not necessarily linear) relationships between two variables
- It measures the strength and the direction of the relation between two variables

### The Methods
A common perspective in the literature is assume two n-dimensional probabilistic variables as input.
In that sense, both coefficients rely on the covariance, which is explained next.

##### Covariance (Murphy)  
The covariance itself is a measure to determine the degree of the linear relation between two n-dimensional vectors. The degree can range 
- from $-\infty$ (strong negative correlation) 
- over 0 (no linear correlation)
- to $\infty$ (strong positive correlation)

The covariance is composed of the expected values of the product of the variances of the individual variables. As mathematical formula:  
$$cov[X,Y] = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$ 
If X and Y are 1-dimensional, $cov[X,Y]$ will be 1-dimensional, for higher-dimensional variables, the covariance is described as a matrix.

**Example A:**  
X = [1,2,3,4,5,6,7,8], Y = [2,4,6,8,10,12,14,16]

$$cov[X,Y] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = \mathbb{E}[2, 8, 18, 32, 50, 72, 98, 128] - \mathbb{E}[1,2,3,4,5,6,7,8]\mathbb{E}[2,4,6,8,10,12,14,16] = 51 - 4.5*9 = 10.5$$

From the numbers in this example, we can see, that there is a positive linear correlation. The covariance also shows that the direction of the relation is positive, but we can not make a statement about the strength of the relation (is 10.5 a strong correlation?).  


The major disadvantages of the covariance (and why we use correlation coefficients) are the unboundness of its results, making interpretation difficult and its limitation to linear relationships.

##### Pearson coefficient
To counter the issue of unboundness of the covariance, the pearson coefficient normalizes the covariance by the square root of the product of the variances of the individual variables. As mathematical formula:
$$corr[X,Y] = \frac{cov[X,Y]}{\sqrt{var[X]var[Y]}}$$

Through the normalization, the correlation coefficient ranges
- from -1 (perfect negative correlation)
- over 0 (no linear correlation)
- to 1 (perfect positive correlation)

**Example A:**  
X = [1,2,3,4,5,6,7,8], Y = [2,4,6,8,10,12,14,16]
$$corr[X,Y] = \frac{cov[X,Y]}{\sqrt{var[X]var[Y]}} = \frac{10.5}{\sqrt{5.25*21}} = 1.0$$

As we can see, X and Y are linear correlated and in contrast to the covariance, here we can make a statement about the strength of the relationship: its perfectly linear correlated ($corr[X,Y]=1$).

**Example B:**  
X = [1,2,3,4,5,6,7,8], Y = [1,4,6,7,8,9,11,14]
$$corr[X,Y] = \frac{cov[X,Y]}{\sqrt{var[X]var[Y]}} = \frac{1.75}{\sqrt{5.25*21}} = 0.17$$

From the second example, we can already observe the biggest disadvantage of the pearson coefficient. Even though there is an obvious positiv correlation between X and Y, the pearson coefficient is low, since the relation is non-linear.

**Assumptions**
- Variables should be at least continuous, so that the expected value and the variance are meaningful
- Variables should be somewhat normal distributed (especially relevant for hypothesis testing)
- The variable should have a linear relationship. Non-linear relationships, can't be measure with pearson
- No outliers in the variables to not distort the correlation by variances of different levels

##### Spearman coefficient
To also cover non-linear (but monotonic) relationships, the spearman coefficient assigns ranks to each variable entry.
This is done by sorting them and assigning ranks. Afterward, the pearson coefficient formula is employed on the ranks.

$$corr[R[X],R[Y]] = \frac{cov[R[X],R[Y]]}{\sqrt{var[R[X]]var[R[Y]]}}$$

**Example B:**
| X | Y | $R[X]$ | $R[Y]$ |
|:-:|:-:|:-----:|:-----:|
| 1 | 1 | 1     | 1     |
| 2 | 4 | 2     | 2     |
| 3 | 6 | 3     | 3     |
| 4 | 7 | 4     | 4     |
| 5 | 8 | 5     | 5     |
| 6 | 9 | 6     | 6     |
| 7 | 11| 7     | 7     |
| 8 | 14| 8     | 8     |

$$corr[R[X],R[Y]] = \frac{5.25}{\sqrt{5.25*5.25}} = 1$$

In this example, the relationship is non-linear, but monotonicly increasing, hence the spearman coefficient is one. But also the spearman coefficient gets into trouble once the data is no longer montonic.

**Example C:**
| X | Y | $R[X]$ | $R[Y]$ |
|:-:|:-:|:-----:|:-----:|
| 1 | 1 | 1     | 1     |
| 2 | 4 | 2     | 3     |
| 3 | 6 | 3     | 5     |
| 4 | 7 | 4     | 6     |
| 5 | 5 | 5     | 4     |
| 6 | 3 | 6     | 2     |
| 7 | 11| 7     | 7     |
| 8 | 14| 8     | 8     |

$$corr[R[X],R[Y]] = \frac{3.625}{\sqrt{5.25*5.25}} = 0.69$$

Here we can see, even though overall the data is positively correlated, the two datapoints (5,5) and (6,3) render the variables no longer montonic, hence interfereing with the correlation coefficient value.

**Assumptions**
- Variables are in a monotonic relationship, as we can't measure non-monotonic relationships accurately
- Variables are at least ordinal otherwise they can't be ranked since no comparison would be possible

#### Key differences between pearson and spearman
- Pearson focuses on linear relationships, spearman aims to measure monotonic (potentially non-linear) relationships.
- Pearson assumes a normal distribution of the variables, while spearman does not make assumption about the distribution
- Pearson expects its data to be continuous, while spearman minimum requirements are ordinal data, since spearman transforms the data into continuous data via the ranks later on
- Pearson is sensitive to outliers while spearman (through the rank transformation) is insensitive to them

---
## Solution

In [49]:
import pandas
import numpy as np

df = pandas.read_csv('data/table_1.csv', delimiter=";")
df.head()


Unnamed: 0,Failure,SDL pages,Tasks,Outputs,Inputs,If,States,McCabe (design),Ext. input,Ext. Output,Internal
0,0,9,68,16,14,21,1,83,10,11,0
1,0,14,76,33,34,30,2,131,21,21,0
2,0,15,85,18,19,39,11,80,17,16,1
3,0,18,68,24,19,59,13,99,18,18,2
4,1,18,42,33,36,27,39,105,39,36,3


In [93]:
pearson_corr = df.corr(method='pearson')
relevant_p_corr = pearson_corr[(pearson_corr > 0.7) | (pearson_corr < -0.7)]
np.fill_diagonal(relevant_p_corr.values, np.nan)
relevant_p_corr = relevant_p_corr.where(np.tril(np.ones(relevant_p_corr.shape)).astype(bool)).fillna('')

print('Pearson Correlation')
relevant_p_corr.fillna(' ')

Pearson Correlation


Unnamed: 0,Failure,SDL pages,Tasks,Outputs,Inputs,If,States,McCabe (design),Ext. input,Ext. Output,Internal
Failure,,,,,,,,,,,
SDL pages,,,,,,,,,,,
Tasks,,0.91189,,,,,,,,,
Outputs,,,,,,,,,,,
Inputs,,,,0.921258,,,,,,,
If,,0.864604,0.899348,,,,,,,,
States,,0.840457,,,,,,,,,
McCabe (design),,0.820166,0.853165,0.810416,0.870197,0.740248,,,,,
Ext. input,,,,0.879832,0.783334,,,,,,
Ext. Output,,,,0.866593,0.784997,,,0.725436,0.981725,,


In [94]:
spearman_corr = df.corr(method='spearman')
relevant_s_corr = spearman_corr[(spearman_corr > 0.7) | (spearman_corr < -0.7)]
np.fill_diagonal(relevant_s_corr.values, np.nan)

relevant_s_corr = relevant_s_corr.where(np.tril(np.ones(relevant_s_corr.shape)).astype(bool)).fillna('')

print('Spearman Correlation')
relevant_s_corr

Spearman Correlation


Unnamed: 0,Failure,SDL pages,Tasks,Outputs,Inputs,If,States,McCabe (design),Ext. input,Ext. Output,Internal
Failure,,,,,,,,,,,
SDL pages,,,,,,,,,,,
Tasks,,0.879025,,,,,,,,,
Outputs,,0.791068,,,,,,,,,
Inputs,,0.804303,,0.914841,,,,,,,
If,,0.905865,0.929774,,,,,,,,
States,,,,,,,,,,,
McCabe (design),,0.935343,0.87642,0.803613,0.853388,0.895155,,,,,
Ext. input,,0.766786,,0.896372,0.799808,,,0.735012,,,
Ext. Output,0.70934,0.792049,,0.896878,0.82,,,0.782692,0.972607,,


In [95]:
print('Difference between Pearson and Spearman values')
(relevant_p_corr.apply(pandas.to_numeric, errors='coerce')-relevant_s_corr.apply(pandas.to_numeric, errors='coerce')).fillna('')


Difference between Pearson and Spearman values


Unnamed: 0,Failure,SDL pages,Tasks,Outputs,Inputs,If,States,McCabe (design),Ext. input,Ext. Output,Internal
Failure,,,,,,,,,,,
SDL pages,,,,,,,,,,,
Tasks,,0.032865,,,,,,,,,
Outputs,,,,,,,,,,,
Inputs,,,,0.006417,,,,,,,
If,,-0.041261,-0.030426,,,,,,,,
States,,,,,,,,,,,
McCabe (design),,-0.115177,-0.023255,0.006803,0.016809,-0.154907,,,,,
Ext. input,,,,-0.01654,-0.016474,,,,,,
Ext. Output,,,,-0.030284,-0.035003,,,-0.057256,0.009119,,


---
### Summary

There is strong positive correlations between most variables except for three of them:
- States
- Ext. Output
- Internal

Almost all relations seem to be linear, as pearson yielded all strong relations that spearman also yielded, with neglectable differences in the calculated strength's.