# Correlations

We're now going to look at the relationships between two __numerical__ variables.  This will allow us to make this final leap from ANOVA to linear regression.  

Correlations are a numerical representation of the strength of the relationship between two numerical variables - and in a way reflects the influence that the variables have on each other - **HOWEVER** correlation is not directional and correlation does not imply causation.

The bivariate (relationship between __two__ variables) correlation tells us:
- If the association exists
- The strength of the association
- The direction of the association

In this lab you'll see one full example of an analysis of the relationship between two numerical variables from the Cards Against Humanity poll that we've used in previous labs - income and self-rated attractiveness.  Do people with more money think they're more attractive?



In [None]:
## loading some libraries!
library(tidyverse) ## all of our normal functions for working with data

options(repr.plot.width=7, repr.plot.height=6) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

In [None]:
## LOAD the DATA
cah <- read_csv("201806-CAH_PulseOfTheNation_Raw.csv")
## variable names currently full questions - need to rename
new_names <- c("gender", "age", "agerange", "race", "income", "educ", "partyid", "polaffil", 
               "trump", "hollymoney", "fed_min_is", "fed_min_should", "fed_tax_is", "fed_tax_should", 
               "redist", "redist_you", "redist_people", "baseincome", "faircomp", "ceofair", "attractive")
colnames(cah) <- new_names
glimpse(cah)

## Example: Correlation between Income and Self-rated Attractiveness
We're going to see if there's a correlation between a person's income and how they rate their own attractiveness.  

We'll first do some data cleaning, then proceed.

In [None]:
# data cleaning
cah %<>% drop_na(income) %>% filter(income < 200000 & attractive != "DK/REF") %>% 
                mutate(attractive = replace(attractive, attractive == "Not attractive at all", "1")) %>% 
                mutate(attractive = replace(attractive, attractive == "Very attractive", "10")) %>% 
                mutate(attractive = as.numeric(attractive))
summary(cah$income)
summary(cah$attractive)

The first thing we'll do is create a scatterplot of income and self-rated attractiveness.  Remember, since a correlation doesn't have direction, it doesn't matter which variable you make the x variable and which variable you make the y variable.  We will also include a best fit line.

In [None]:
cah %>% ggplot(aes(x = income/1000, y = attractive)) + #divide income by 1000 to make axis labels easier to read
            geom_point() + 
            geom_smooth(method = "lm") +
            labs(x = "Income in $1000s",
                 y = "Self-rated attractiveness",
                 title = "Correlation between income and self-rated attractiveness")

What does this plot include?
1. Dots that represent each obervation.
2. A best fit line that indicates the general trend of the relationship between income and self-rated attractiveness.
3. A confidence interval around the best fit line.  This represents the uncertainity in our sample in inferring to the population (standard error).

What do we see specifically about our variables?
1. Attractiveness is more ordinal than continuous, so we have all of our observations on the integer levels of the y-axis.  There is a lot of spread in the observations between attractiveness and income.
2. The spread in the values shows there's not a strong linear relationship, however the observations at the highest end of income do not have any very low ratings of attractiveness.  The best fit line is not completely horizontal, indicating that there may be some small correlation between the variables.
3. The CI for the best fit line indicates that the correlation could range from completely non-existent (horizontal line) to larger than the current best fit line (steeper slope).

## Assumptions
The assumptions are pretty basic, you need two variables, both numeric.  

**These variables are numeric, although self-rated attractivenss is more ordinal than continuous.**

_IF_ you want to do significance testing, they would need to be normally distributed.

We can look at QQ plots to see if these variables are normally distributed.

In [None]:
# QQ plot for income
cah %>% ggplot(aes(sample = income)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Income")

The QQ plot for income shows a moderate deviation from normality, fairly large in the tails.  Because both of the tails are above the 45 degree reference line, this indicates that the data is positive, or right skewed, which we would typically expect from a distribution of income.

In [None]:
# QQ plot for attractive
cah %>% ggplot(aes(sample = attractive)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Self-rated Attractiveness")

This QQ plot is harder to interpret, because attractiveness is relatively ordinal and can only take integer values.  However, there is indication that the distribution follows a relatively normal pattern, despite the fact that all of the observations sit on the line for the integer values on the y-axis.

## Correlation Coefficient
The next step is calculating the correlation coefficient, which is pretty simple and involves one line of code.

In [None]:
# cor(firstvar, secondvar)
# the order of the vars does not matter
cor(cah$income, cah$attractive)

Our correlation coefficient tells us 3 things:

1. Is there a correlation?

Well, this number is not zero, but it's not large either, so significance testing will be required.

2. What is the magnitude of the correlation?

A correlation coefficient of 0.09 indicates a very, very small, potentially nonexistent relationship between the variables.

3. What is the direction of the correlation?

Because the sign of the coefficient is positive, the correlation is positive.  This corresponds to the positive slope we saw on the best fit line in the scatterplot above.

## Significance Testing?
Correlation coefficients by themselves are interpretable as the size of the relationship between two variables. However, there is also a significance test we can conduct on a correlation which will generate a t-score we can compare to the t-distribution to obtain a p-value.

The hypotheses for this type of test is pretty basic - is the correlation coefficient significantly different from 0?  This tells us nothing about the relative strength, only if a significant effect exists (or not).

In the case of our very, very small correlation coefficient (r = 0.09), a significance test is required to determine if the correlation is even statistically different from zero.

#### Non-directional (two-tailed):
$H_0: r = 0$ <BR>
$H_A: r \neq 0$ <BR>
    
For this, we use a related function, `cor.test()`

In [None]:
## cor.test(firstvar, secondvar)
cor.test(cah$income, cah$attractive)

Let's review this output.

The first line tells us about our t-test that tests if our correlation is significantly different from zero.  With a p-value of 0.08, we can conclude that the correlation is not significantly different from zero (at alpha = 0.05). 

This conclusion is supported via the 95% confidence interval, which crosses 0.  Because this interval crosses zero, 0 (or no correlation) is a potential relationship between these two variables in our population.

The final line displays the correlation that we've already seen above in the output of `cor()`.

## R-squared
Remember, our value of the proportion of variance explained is literally this correlation coefficient, $r$, squared - $r^2$.  

We can calculate r-squared by simply squaring the correlation coefficient.  The result is the coefficient of determination, our r-squared value.

In [None]:
rsq <- cor(cah$income, cah$attractive)^2 ## calculate the correlation, and square it - ^2
rsq

As can be expected, due to the fact that the correlation was not significantly different than zero, income explains 0% of the variance in attractiveness (or we can interpret it as attractiveness explaining 0% of the variance in income).

## Correlation Matrix

Another useful thing we can do is create a correlation matrix.  This gives us correlation coefficients for every pairing of numerical variables in the dataset.

In [None]:
cah2 <- cah %>% drop_na()

cor(select_if(cah2, is.numeric))

The correlations along the diagonal represent the correlation of each variable with itself, and is therefore one.

It looks like the strongest correlations are between fed_tax_should and fed_tax_is (0.999) and fed_min_should and either fed_tax_is or fed_tax_should (0.768).

Remember - the magnitude of the correlation is what is important.  A correlation of -0.17 is stronger than a correlation of 0.14, it's just a negative correlation.

## Exercise 
Pick two variables with a weak, but somewhat small correlation (around 0.1 to 0.2).

1. Plot a scatterplot of the relationship between these two variables, including a best fit line.
2. Test the correlation to determine if it's statistically different from 0.