# Scatterplots, Correlation, and Regression

In this lab, we're going to cover some basic modeling, data visualization, and reporting functions. 

To begin, install the **{modelsummary}** package (if needed) and load it along with **{tidyverse}**.



In [None]:
install.packages("modelsummary") # this will help us generate attractive regression output later

In [40]:
library(tidyverse)
library(modelsummary)

## Simulated data

Let's start by simulating some fake data. Simulated data has wonderful properties for learning regression and other methods, namely, we can control the extent to which our data aligns with the assumptions utilized by the estimator. We also know the population parameters ($b$ and $se$) because we created the data.

Start with a random variable, *x*. 


In [None]:
# run the set.seed() function if you would like to be able to reproduce 
# the results I have below.

set.seed(5)

# this function draws 100 random values from a uniform distribution between 0 and 50

xvar <- runif(n = 100, min = 0, max = 50)

# now create the outcome variable yvar as a linear function of xvar.
# where does the rest of this equation come from?

yvar <- 32 - 1.2*xvar + rnorm(n = 100, mean = 0, sd = 30)

# combine new vectors into a data frame using data.frame or tibble
# alternatively, we could have used the cbind() function "column bind"

fake <- tibble(xvar, yvar)

# look at your data: 
head(fake)

## A basic scatterplot

We can use the *plot()* function in base R to take a quick look at the relationship between *xvar* and *yvar*.

In [None]:
plot(fake$xvar, fake$yvar) # note that x variable goes first, y variable second

This plot matches what we told R we wanted - a predictor variable between 0 and 50, an outcome variable that is a linear function of x, and a negative relationship between the two variables with substantial random error. 

## Covariance and Correlation

The *cov()* and *cor()* functions calculate covariance and correlation, respectively. *cor.test()* provides some additional output for hypothesis testing. 

In [None]:
# get the covariance between xvar and yvar

cov(fake$xvar, fake$yvar) 

# correlation

cor(fake$xvar, fake$yvar) # default method is pearson's r

# correlation + significance test 

cor.test(fake$xvar, fake$yvar)

## Regression 

The *lm* function (for **l**inear **m**odels) in base R can be used to run classic OLS regression models. Note that by default, running *lm* does not generate very useful output in the console. We are better off storing regression model results and then calling them up in other functions, like *summary()*. We can also use the **{modelsummary}** package for nicely formatted default output.

In [None]:
lm(fake$yvar ~ fake$xvar)

This output is limited. Now try:

In [None]:
m1 <- lm(fake$yvar ~ fake$xvar)

# or you could do:
m1 <- lm(yvar ~ xvar, data = fake)

summary(m1)

That is better. We actually get uncertainty estimates and significance tests along with our coefficient estimates.

Now try out the *modelsummary()* function. (Check out the help file for more info on various arguments you can utilize).

In [46]:
# the output  = "jupyter" format is for visualization in this notebook. 
# you can skip the argument altogether or choose "html". 
modelsummary(m1, output = "jupyter") 

Unnamed: 0,Model 1
(Intercept),29.091
,(5.794)
xvar,−1.108
,(0.193)
Num.Obs.,100
R2,0.251
R2 Adj.,0.244
AIC,962.0
BIC,969.8
RMSE,28.82


## Linear fitted regression lines with scatterplots.

Let's revisit our original scatterplot:

In [None]:
plot1 <- plot(fake$xvar, fake$yvar)

## **ggplot2**

*ggplot()* is a data visualization system built into **tidyverse**. It is very powerful, but as in true R fashion, clunky and frankly weird. Let's learn just a bit of basics by replicating our scatterplot. 

We'll do a bunch in ggplot, so it is better to start learning the code structure now. Check out these resources:

1. R for Data Science, [data viz chapter](https://r4ds.hadley.nz/data-visualize.html). 
2. [ggplot reference guide and cheat sheets](https://ggplot2.tidyverse.org/index.html).
3. [R for Graphics Cookbook](https://r-graphics.org/) (with some simple examples).

First, create an empty canvas for your graph using the *ggplot()* function. We'll specify the data argument to select the data.frame. 

In [None]:
ggplot(data = fake)

Good. The function worked. But there isn't anything there! That's because we need to "map" elements of the graph to this canvas. 

In [None]:
# I think the "mapping" and "aes" names are strange. "aes" stands 
# for "aesthetics", which might make a little more sense.

ggplot(data = fake, mapping = aes(x = xvar, y = yvar))


Also good. Now we have a graph with x and y dimensions, appropriately scaled to our two variables. But still no data! We can add them with various *geom_* calls to this canvas. You might find it useful to store this original canvas and then add stuff to it in additional commands, or you could do the whole thing in one long run-on command. 

In [None]:
p1 <- ggplot(data = fake, mapping = aes(x = xvar, y = yvar))

p1 + geom_point()

In [None]:
# alpha adjusts the transparency, size adjusts the size of the dot,
# and color does, well, you see it. you can also type color() in 
# the console to see the names R knows about it. 

# modern R GUIs will give you a little thumbnail of the color. Nice!
p1 + geom_point(alpha = .4, size = 3, color = "steelblue")

Now, let's add in our simple regression line using the *geom_smooth()* function and save it as p2. 

In [None]:
p2 <- p1 + geom_point(alpha = .4, size = 3, color = "steelblue") +
       geom_smooth(method = "lm", fill="grey40", color = "black", alpha = .3) 

# to view, just type: 
p2        

How does our linear fitted regression line match up with smoother function that doesn't force a straight line? They should be fairly similar in this case because we made y a linear function of x. But anyway, let's super-impose a loess smoother on top of our existing *p2* plot and add some labels and a title. 

In [None]:
p2 + geom_smooth(method = "loess", alpha = .2, fill = "orange", color = "orange") +
    labs(
        title = "Scatterplot and Regression Example",
        x = "Predictor (fake)",
        y = "Outcome (fake)"
    ) +
    theme_light()