# Correlations

We're now going to look at the relationships between two __numerical__ variables.  This will allow us to make this final leap from ANOVA to linear regression.  

Correlations are a numerical representation of the strength of the relationship between two numerical variables - and in a way reflects the influence that the variables have on each other - **HOWEVER** correlation is not directional and correlation does not imply causation.

<img src="images/horse.png" width="300" height="400">



The bivariate (relationship between __two__ variables) correlation tells us:
- If the association exists
- The strength of the association
- The direction of the association

Correlations specifically tell us about *__linear__* relationships between two variables.  They are represented by the lower-case letter $r$, and range from -1 to 1, where 0 is no correlation (or association) between the two variables, -1 as the strongest possible _negative_ correlation and 1 as the strongest possible _positive_ correlation.

As you might have guessed, these $r$ values are related to our r-squared ($r^2$) values we've looked at previously.  $r$ is the "coefficient of correlation," and when we square that value we get $r^2$ the "coefficient of determination."  Keep in mind that while these values are related, the interpretations are different.

So, what are linear relationships?  What is a positive vs. a negative correlation?  Let's look at some scatterplots!

In [None]:
## loading some libraries!
library(tidyverse) ## all of our normal functions for working with data
library(ggcorrplot) ## make pretty corrplot
library(GGally) ## scatterplot matrix function
library(gtrendsR) ## Google Trends API

options(repr.plot.width=4, repr.plot.height=3) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

In [None]:
## loading up our old friend the mpg dataset.  
## remember this is a dataset that is "built-in" to R
## this is not how you load data from elsewhere
data(mpg) #load built-in dataset
head(mpg) #peek at the first 6 observations
summary(mpg) #look at summary of variables

In [None]:
?mpg ## obtain documentation for r functions or built-in datasets using ?

First we're going to plot hwy mpg vs. cty mpg.  What type of relationship would we expect between these variables?

In [None]:
mpg  %>% ggplot (aes(x = hwy, y = cty)) +  ## the variables we want to plot
    geom_point() ## the type of plot we want

This graph shows an example of positive correlation - see how all of the dots "line up" and show a general trend that as the x variable increases, the y variable increases.  Because they are both increasing together, the correlation is positive.

Remember the correlation value is an estimation of the _linear_ relationship between the two variables - and this is a clump of dots.  We can fit a "best fit" line to this graph to show that estimated linear relationship.

In [None]:
mpg %>% ggplot(aes(x=hwy, y=cty)) + ## the variables we want to plot
  geom_point()+  ## the type of plot we want
  geom_smooth(method=lm, se=FALSE) ## adding that "best fit" line using a linear model

This is our "best fit" line.  These linear relationships are going to form the basis of Linear Regression that we will get into next week, but right now we're going to focus simply on the correlation between these two variables.  Note that the distance of the dots from the line affects the strength of the relationship - the closer to the line that the dots are clumped, the stronger the relationship.  

Let's look at the relationship between hwy and another numerical variable - displ - instead.  displ is the engine displacement, in litres.

In [None]:
mpg %>% ggplot(aes(x=hwy, y=displ)) + 
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

Now what we see is an example of a negative correlation, because the line starts at the top left corner and goes down toward the bottom right.  So as hwy increases, displ decreases.  These dots seem to be more spread out from the line, so I would guess that the strength of the relationship between hwy and displ (the correlation between the variables) is lower than that between hwy and cty.

In [None]:
mpg %>% ggplot(aes(x=hwy, y=cyl)) + 
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

You need to make sure your variables are truely numeric, and not simply ordinal.  Because cylinder can only be 4, 5, 6, or 8, and no values in between those - we see that all of the values line up at those values of y.  The line indicates that there may be a general trend of negative association between these variables, but since cyl is ordinal it would not be best suited for a correlation analysis (however we can do an ANOVA using cyl as a categorical grouping variable, and hwy as the numerical DV).  There are non-parametric versions of the correlation coefficient that could also be used, but we will not cover in this course.

There are various statistical tests that are most appropriate for certain types of data.  Once you have "stocked" your statistical "toolbelt" you can do an inference test on any combination of data and variable types, by using the one that is most appropriate for your data types and your RQ.

We can also quickly inspect the correlations among all of the numerical variables in a dataset.  This will be more important when we get into modeling and visualize our data as a preparation for creating our linear models.

In [None]:
## select_if selects the columns that match our logical statement, 
## so here we're selecting the columns of mpg that return TRUE from is.numeric()

corr <- cor(select_if(mpg, is.numeric))  ## obtain all of the correlations within pairs of all the num vars
ggcorrplot(corr) ## plot those correlations

This (above) is a heatmap of the strength and direction of the correlations between these variables where very blue is -1 and very red is +1.

We can also create a scatterplot matrix where a number of variables are compared in scatterplots (like our examples above) in a grid.

In [None]:
options(repr.plot.width=8, repr.plot.height=8) ## plot size options for Jupyter notebook ONLY
pairs(select_if(mpg, is.numeric)) ## plot the correlations between pairs of variables in columns 1 through 4 of iris dataset


We can also use a categorical variable to color the dots in the grid by groups, to see how the relationship between the two numerical variables might be associated with an additional categorical variable....... more about this to come soon.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

# cyl is more categorical, so use it to color the dots
mpg %>% select(displ, cty, hwy) %>% pairs(col=mpg$cyl) ## col= adds color based on the grouping variable specified

We also have situations when there is no correlation between the two variables: <BR>
<img src="images/r2model.png" width="500" >
    
To summarize:
<br>
<img src="images/strength.PNG" width="1000" >

There are also cases where variables have obvious associations, but they are not linear.  They have a __*correlation*__ of 0, but they are __*associated*__.  We have to be careful how we use the word correlation in writing our statistical results.

<img src="images/curvy.PNG" width="600" >

## Assumptions
The assumptions are pretty basic, you need two variables, both numeric.  

_IF_ you want to do significance testing, they would need to be normally distributed.

## Significance Testing?
Correlation coefficients by themselves are interpretable as the size of the relationship between two variables. However, there is also a significance test we can conduct on a correlation which will generate a t-score we can compare to the t-distribution to obtain a p-value.

The hypotheses for this type of test is pretty basic - is the correlation coefficient significantly different from 0?  This tells us nothing about the relative strength, only if a significant effect exists (or not).

#### Non-directional (two-tailed):
$H_0: r = 0$ <BR>
$H_A: r \neq 0$ <BR>

#### Directional (one-tailed):
$H_0: r = 0$ <BR>
$H_A: r > 0$ <BR>
    __OR__ <BR>
 $H_A: r < 0$ <BR>   
    
## Reminder: Correlation does not equal Causation
<img src="images/venti.jpg" width="300" >

Correlation _could_ be evidence of potential causality, but:
- there could be a third variable that is actually causing the effect (ice cream sales -> rise in crime)

- we don't know which direction the effect occurs - does X predict Y or does Y predict X?

## Calculating Correlation:

## $r = \frac{cov_{xy}}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2}\sum{(y-\bar{y})^2}}}$


Let's look at some examples now. For fun, we're going to connect to the Google Trends API using the R package `gtrendsR` and get data about certain keyword searches over a period of time.

In [None]:
options(repr.plot.width=10, repr.plot.height=6) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

## call the google trends api and return hits info for keyword searches
## hits are scaled to values between 0 - 100
trends <- gtrends(c("hand washing", "face masks"), geo = "US-MD", time = "2019-01-01 2020-04-14", low_search_volume = T)
plot(trends)

Keep in mind that to be correlated the variables __*do not*__ have to have the same magnitude - they just have to trend together.

In [None]:
## extract the interest_over_time df from the gtrends object

trend_time <- as_tibble(trends$interest_over_time)
glimpse(trend_time)

__NOTE: the values in the output are subject to change as Google Trends samples from the overall data and only returns a small sample of their massive dataset. This is a real-life example of all of that sampling variance we've been talking about.__

In [None]:
## look at basic summary statistics

trend_time %>%
  group_by(keyword) %>%
  summarise(mean(hits), median(hits), var(hits))

To use this data we need to pivot it so that our "long" format is in "wide" format - so that we have hits for statistics and hits for pugs in their own columns.

In [None]:
trend_wide <- 
  trend_time %>%
  spread(key = keyword, value = hits)

colnames(trend_wide)[6:7] <- c("face", "hands")
glimpse(trend_wide)

Let's look at a scatterplot.

In [None]:
trend_wide  %>% 
    ggplot (aes(x = face, y = hands))  + 
    geom_point() +
    geom_smooth(method=lm, se=FALSE)

And finally, let's calculate the correlation between these two variables.  For this we will use the function (base R) cor().  The arguments for cor are simple just specify x and y (your two variables to compare).

## $r = \frac{cov_{xy}}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2}\sum{(y-\bar{y})^2}}}$

We'll call statistics X and pugs Y (but for correlation there's no directionality, so it doesn't matter as long as your consistent.)

In [None]:
# calculate deviation from value to mean of that variable
trend_wide$xdiff <- trend_wide$face - mean(trend_wide$face)
trend_wide$ydiff <- trend_wide$hands - mean(trend_wide$hands)

# calculate sq deviations
trend_wide$sqdevx <- trend_wide$xdiff^2
trend_wide$sqdevy <- trend_wide$ydiff^2

# r given the above values

r = (sum(trend_wide$xdiff * trend_wide$ydiff) / sqrt(sum(trend_wide$sqdevx) * sum(trend_wide$sqdevy)))
r

In [None]:
# confirm with cor function
# correlation between hits for statistics and hits for pugs

cor(trend_wide$face, trend_wide$hands) 

This correlation is not as low as we may have expected.  It is a substantial correlation, almost 0.5.  But is it significant?  For that we need to use cor.test() which performs the hypothesis test as well.

In [None]:
cor.test(trend_wide$face, trend_wide$hands) ## correlation with CI and t-test

So our p-value is below 0.05, so the correlation is statistically significant from zero, and therefore there is at least some correlation.  However, the confidence interval for the correlation ranges from a value close to zero, and the range of the CI is wide.

__NOTE: the values in the output are subject to change as Google Trends samples from the overall data and only returns a small sample of their massive dataset. This is a real-life example of all of that sampling variance we've been talking about.__


### Reporting a correlation
How would we report this formally?

Keyword searches in Google for "face masks" has a decent correlation with keyword searches for "handwashing" ($r \approx 0.48$)* over the time period from 2019 to today within the state of Maryland.  The correlation is both statistically significant (p < 0.001), and substantive.

*number subject to change each time the code is run

## R-squared
Remember, our value of the proportion of variance explained is literally this correlation coefficient, $r$, squared - $r^2$.  Let's take a look:

In [None]:
rsq <- cor(trend_wide$face, trend_wide$hands)^2 ## calculate the correlation, and square it - ^2
rsq

The interpretation of this is that this % of the variance in the hits for face masks is explained by the hits for handwashing, or vice versa.

## Fun stuff related to correlations:
- <a href="http://guessthecorrelation.com/">Guess the Correlation game</a>
- <a href="https://www.tylervigen.com/spurious-correlations"> Spurious Correlations </a>