# Pearson Correlation

- Pearson
- Spearson
- Kendall

<br/>

**Parameters**
$$\rho
= E\Big[(\frac{Y - \mu_y}{\sigma_Y}) \times (\frac{X - \mu_x}{\sigma_x})\Big]
= \frac{E\Big[(Y - \mu_y)(X - \mu_x)\Big]}{\sigma_x \sigma_y}
= \frac{\text{Cov(X, Y)}}{\sigma_y \sigma_x}$$

**Statistics**
$$\hat{\rho} = r 
= \frac{\sum (Y_i - \bar{Y})(X_i - \bar{X}) }{s_x s_y}
= \frac{\sum X_i Y_i - n \bar{Y} \bar{X}}{s_x s_y}$$

# Import data

In [2]:
# input data
mat <- matrix(
    c(5.12,  2.30,
      6.18,  2.54,
      6.77,  2.95,
      6.65,  3.77,
      6.36,  4.18,
      5.90,  5.31,
      5.48,  5.53,
      6.02,  8.83,
     10.34,  9.48,
      8.51, 14.20), ncol=2)

# arrange data into dataframe
dat_chol <- data.frame(mat)
colnames(dat_chol) <- c("Cholesterol", "Tryglyceride")

print(dat_chol)

   Cholesterol Tryglyceride
1         5.12         5.90
2         2.30         5.31
3         6.18         5.48
4         2.54         5.53
5         6.77         6.02
6         2.95         8.83
7         6.65        10.34
8         3.77         9.48
9         6.36         8.51
10        4.18        14.20


Let's calculate all the things we need

In [3]:
# calculation
x <- dat_chol$Cholesterol
y <- dat_chol$Tryglyceride
n <- nrow(dat_chol)

xy    <- crossprod(x, y) ; xy <- drop(xy) # sum(x * y)
xx    <- crossprod(x, x) ; xx <- drop(xx) # sum(x * x)
sum_x <- sum(x)
sum_y <- sum(y)
bar_x  <- mean(x)
bar_y  <- mean(y)

# Parameters
beta_1 <- (xy - 1/n * sum_x * sum_y) / (xx - 1/n * sum_x * sum_x)
beta_0 <- bar_y - beta_1 * bar_x

# prediction
yhat <- beta_0 + beta_1 * x

# MSE
sig2 <- 1/(n-2) * sum((y - yhat)^2)

cat("",
    "Beta 0 hat (intercept)  = ", beta_0, "\n",
    "Beta 1 hat (slope)      = ", beta_1, "\n",
    "Estimate Error Var = ", sig2, "\n",
    "Estimate Error std = ", sig2**0.5)

 Beta 0 hat (intercept)  =  7.551103 
 Beta 1 hat (slope)      =  0.08733394 
 Estimate Error Var =  9.355567 
 Estimate Error std =  3.058687

# Pearson Correlation and $\hat{\beta_1}$

$$\hat{\beta_1} 
= \frac{
    \sum(Y_i - \bar{Y})(X_i - \bar{X})
}{
    \sum(X_i - \bar{X})(X_i - \bar{X})}$$

$$\hat{\rho} = r = \frac{
    \sum(Y_i - \bar{Y})(X_i - \bar{X})
}{
    s_x x_y}$$


<br/>

$$r 
= \frac{
    \sum(Y_i - \bar{Y})(X_i - \bar{X})
}{
    s_x s_y} 
= \frac{
    \hat{\beta_1} \sum(X_i - \bar{X})(X_i - \bar{X})
}{
    s_x x_y}
= \frac{
    \hat{\beta_1} s_x^2
}{
    s_x s_y}
= \hat{\beta_1} \frac{ s_x}{s_y}$$

In [13]:
cat("Correlation = ", cor(x, y))

Correlation =  0.05317951

In [14]:
beta_1 * sd(x) / sd(y)

# t-statistics for correlation

**SST = SSR + SSE**

SST = $\sum (Y_i-\bar{Y})^2 = s_y^2$  
SSR =
$\sum (\hat{Y_i}-\bar{Y})^2 
= \hat{\beta_1}^2 \sum (X_i - \bar{X})^2 
= (\frac{
    \sum(Y_i - \bar{Y})(X_i - \bar{X})
}{
    \sum(X_i - \bar{X})(X_i - \bar{X})})^2 \sum (X_i - \bar{X})^2
= \frac{
    \big[\sum(Y_i - \bar{Y})(X_i - \bar{X})\big]^2
}{
    \sum(X_i - \bar{X})(X_i - \bar{X})}$  

SSE = $\sum (Y_i-\hat{Y_i})^2 = (n-2)s_{y.x}^2$  

**Further derivation**

$SSR = \frac{
    \big[\sum(Y_i - \bar{Y})(X_i - \bar{X})\big]^2
}{
    \sum(X_i - \bar{X})(X_i - \bar{X})}
= \frac{
    \big[ r s_x s_y \big]^2
}{
    s_x^2
}
= \big[ r s_y \big]^2$

**Therefore, we can rewrite the equation SSE = SST - SSR as**

$(n-2)s_{y.x}^2 = s_y^2 - (rs_y)^2 = s_y^2 (1 - r^2)$

$s_{y.x}^2 = s_y^2 - (rs_y)^2 = s_y^2 \frac{(1 - r^2)}{n-2}$

$s_{y.x} = s_y \sqrt{\frac{1 - r^2}{n-2}}$

**Recall the variation of $\hat{\beta_1}$ is**

var($\hat{\beta_1}$) = $\frac{s_{y.x}^2}{\sum(X_i - \bar{X})^2}$

SE($\hat{\beta_1}$) = $\frac{s_{y.x}}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{s_{y.x}}{s_x}$

$t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} =  \frac{\hat{\beta_1} s_x}{s_{y.x}}$

**by combining above expresssion we get**

$t = \hat{\beta_1}\frac{s_x}{s_{y.x}} = \hat{\beta_1} \frac{s_x}{s_y} \sqrt{\frac{n-2}{1 - r^2}} = r \sqrt{\frac{n-2}{1 - r^2}} = \frac{r}{\sqrt{(1 - r^2)/(n-2)}}$

In [13]:
fit <- lm(Tryglyceride ~ Cholesterol, data = dat_chol)
summary(fit)


Call:
lm(formula = Tryglyceride ~ Cholesterol, data = dat_chol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6108 -2.2128 -0.8474  1.4551  6.2838 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  7.55110    2.88180   2.620   0.0306 *
Cholesterol  0.08733    0.57980   0.151   0.8840  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.059 on 8 degrees of freedom
Multiple R-squared:  0.002828,	Adjusted R-squared:  -0.1218 
F-statistic: 0.02269 on 1 and 8 DF,  p-value: 0.884


In [23]:
fit_param <- get_param(fit)
print(fit_param$beta)

(Intercept) Cholesterol 
 7.55110250  0.08733394 


In [15]:
x <- dat_chol$Cholesterol
y <- dat_chol$Tryglyceride
n <- nrow(dat_chol)

In [27]:
beta_1 <- fit_param$beta[2]
var_beta_1 <- fit_param$MSE / sum(( x - mean(x) )^2)
t_val <- beta_1 / (var_beta_1)^0.5
print(t_val)

Cholesterol 
  0.1506275 


In [28]:
r <- cor(x, y)
r / ((1-r^2) / (n-2))^0.5

# Hypothesis test on correlation in one sample

**Testing if we $\rho$ is zero**

$H_0 : \rho = 0$

$H_1 : \rho \neq 0$

$t = r \sqrt{\frac{n-2}{1 - r^2}} \sim t_{n-2}$

**Testing if we $\rho$ is $\rho_0$, where $\rho_0$ $\neq$ 0**

if $\rho$ is not zero, then the distribution of r is skewed and lacks the normality, in this case, one can use fisher transformation

$Z = \frac{1}{2} ln \frac{1+r}{1-r} \sim N(\frac{1}{2} ln \frac{1+\rho}{1-\rho}, \frac{1}{n-3})$ 

Therefore, 

$\frac{ 
    \frac{1}{2} ln \frac{1+r}{1-r} - \frac{1}{2} ln \frac{1+\rho_0}{1-\rho_0} 
}{  
    \frac{1}{n-3}
} \sim N(0, 1)$

<br/>

**Confidence Interval**

$$
(
    \frac{
        \frac{1+r}{1-r} e^{\frac{-2z}{\sqrt{n-3}}}-1
    }{
        \frac{1+r}{1-r} e^{\frac{-2z}{\sqrt{n-3}}}+1
    }
,
    \frac{
        \frac{1+r}{1-r} e^{\frac{2z}{\sqrt{n-3}}}-1
    }{
        \frac{1+r}{1-r} e^{\frac{2z}{\sqrt{n-3}}}+1
    }
)$$

In [5]:
r    <- cor(x, y)
tmp1 <- (1 + r) / (1 - r)
tmp2 <- exp(2 * qnorm(0.975) / (n - 3)^0.5)

cat("(", 
    (tmp1 * 1/tmp2 - 1) / (tmp1 * 1/tmp2 + 1)
    ,",", 
    (tmp1 * tmp2 - 1) / (tmp1 * tmp2 + 1)
    ,")")

( -0.5964167 , 0.660684 )

**Example**

n = 100, r = 0.25  
$H_0 : \rho = 0$  
$H_1 : \rho \neq 0$

In [11]:
n <- 100
r <- 0.25
z <- 1/2 * log((1 + r) / (1 - r))
z <- z * (n-3)**0.5 # var = 1 / (n-3)
print(z)

[1] 2.515524


In [19]:
cat("critical value C_0.05", qt(1 - 0.05 / 2, df = n-2))

critical value C_0.05 1.984467

since z = 2.52 > 1.98, we reject the null with sufficient evidence

In [16]:
cat("p-valule", 2 * (1 - pt(z, df = n-2)))

p-valule 0.01351119

# Hypothesis test on correlation in two independent samples

$H_0 : \rho_1 = \rho_2$  
$H_1 : \rho_1 \neq \rho_2$

var($z_1 - z_2$) = $\frac{1}{n_1 - 3} + \frac{1}{n_2 - 3}$  
under $H_0$  
$$Z = \frac{z_1 - z_2}{\sqrt{\frac{1}{n_1 - 3} + \frac{1}{n_2 - 3}}} \sim N(0, 1)$$

In [17]:
get_param <- function(fit){
    require(gtools)
    res <- list()
    res$beta  <- fit$coefficients
    
    res$y     <- fit$fitted.values + fit$residuals
    res$yhat  <- fit$fitted.values
    res$ybar  <- mean(res$y)
    res$err   <- fit$residuals
    
    res$SST   <- crossprod(res$y - res$ybar)[1]
    res$SSE   <- crossprod(res$err)[1]
    res$SSR   <- res$SST - res$SSE
    
    res$n    <- length(res$y)
    res$p    <- res$n - fit$df.residual
    res$df_t <- res$n - 1
    res$df_e <- fit$df.residual
    res$df_r <- res$df_t - res$df_e
    
    res$MST   <- res$SST / res$df_t
    res$MSR   <- res$SSR / res$df_r
    res$MSE   <- res$SSE / res$df_e
    
    res$Fval <- res$MSR / res$MSE
    res$pval <- 1 - pf(res$Fval^2, df1 = 1, df2 = res$n - 1)
    
    res$R2   <- 1 - res$SSE / res$SST
    res$R2a  <- 1 - res$MSE / res$MST
    res$MSE  <- res$SSE / res$df_e
    res$r    <- (res$R2)^0.5 # correlation r^2 = R2
    return(res)
}

# Testing homogeneity of k correlations

In [22]:
cholesterol <- data.frame(
    n = c(277,     479,     508,     373,     216),
    r = c(-0.08,   -0.25,   -0.19,   -0.18,   -0.15),
    z = c(-0.0802, -0.2554, -0.1923, -0.1820, -0.1511)
)

In [1]:
# critical value
qchisq(1-0.05, 4)

$$C_{0.05} = 9.488$$

In [24]:
n  <- cholesterol$n
z  <- cholesterol$z
T1 <- sum((n - 3) * z)
T2 <- sum((n - 3) * (z^2))
cat("T1", T1, "\n")
cat("T2", T2)

T1 -340.181 
T2 68.60493

$$H = T_2 - \frac{T_1^2}{\sum(n_i - 3)}$$

In [33]:
cat("T2 - T1^2 / sum(n_i - 3) =", T2 - T1^2 / sum(n - 3))

T2 - T1^2 / sum(n_i - 3) = 5.643496

In the slide:

$$H = 68.605 - \frac{(-340.181)^2}{1853} = 6.15$$

In [30]:
sum(n - 3)

In [32]:
sum(n)