# Simple Linear Regression
In this lab we'll look at fitting a model to predict a numerical outcome using one predictor.  We'll look at an example using a numerical predictor, and another example using a categorical predictor.

<a id="top"></a>

### Table of Contents
- [Example 1: Age and Income](#ex1)
    - [Preliminary Inspection](#prelim1)
    - [Model Fitting](#modfit1)
    - [Checking Assumptions](#assump1)
    - [Effect Size](#eff1)
- [Example 2: Gender and Income](#ex2)
    - [Preliminary Inspection](#prelim2)
    - [Model Fitting](#modfit2)
    - [Checking Assumptions](#assump2)
    - [Effect Size](#eff2)
- [OPTIONAL: Special Model Fit Dashboard - `performance` package](#perf)

We'll use the same data from the Cards Against Humanity poll that we've used in previous labs.

In [None]:
## loading some libraries!
library(tidyverse) ## all of our normal functions for working with data
library(olsrr) ## ols plots
library(performance) ## OPTIONAL
library(magrittr)

options(repr.plot.width=10, repr.plot.height=8) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

In [None]:
## LOAD the DATA
cah <- read_csv("201806-CAH_PulseOfTheNation_Raw.csv")
## variable names currently full questions - need to rename
new_names <- c("gender", "age", "agerange", "race", "income", "educ", "partyid", "polaffil", 
               "trump", "hollymoney", "fed_min_is", "fed_min_should", "fed_tax_is", "fed_tax_should", 
               "redist", "redist_you", "redist_people", "baseincome", "faircomp", "ceofair", "attractive")
colnames(cah) <- new_names
cah %<>% drop_na(income, age)
glimpse(cah)

<a id = "ex1"></a>
## Example 1: Predicting Income using Age
In this first example we'll see if age is a good predictor of a person's income.

### Preliminary Inspection
Let's plot a scatterplot of age and income and see what linear relationship may exist.

Unlike when we looked at correlations in the last lab, it now matters which is the x and which is the y variable.  The x variable should be your predictor (in this example, age) and the y variable should be your outcome (income).

In [None]:
cah %>% ggplot(aes(x = age, y = income)) +
            geom_point() +
            geom_smooth(method = "lm")

This preliminary plot makes it look like there's not a lot going on.  Let's do a few things to improve this plot.

1. Remove outliers - cap income at 200k as in previous labs.
2. Use income/1000 on the y-axis to make labels more interpretable.
3. Improve the plot to be fully PQ.

In [None]:
cah %<>% filter(income < 200000)

cah %>% ggplot(aes(x = age, y = income/1000)) +
            geom_point() +
            geom_smooth(method = "lm") +
            labs(x = "Age",
                 y = "Income in $1000s",
                 title = "Relationship between Age and Income")

It's pretty clear there's likely nothing going on here, but let's fit the model to check.

NOTE - this graph starts at age == 18, because it's a survey of adults, so the y-intercept is not actually shown on the scatterplot.  If you wanted to extend the graph to include x == 0 you could use the ggplot option `xlim(0,0)`

[Return to Top](#top)
<a id = "modfit1"></a>

### Model Fitting

In [None]:
# model is fitted using lm function
# lm(outcome ~ predictor, data = yourdf)
mod1 <- lm(income ~ age, data = cah)

# use summary on saved model output to see the results
summary(mod1)

### Interpretation:

#### The intercept:
This is a survey of adults 18 and older, so age == 0 is not possible.  Therefore the intercept is not interpretable.

#### The coefficient estimte for age:
Looking first at the p-value (last column) we can see that p > alpha (0.05), therefore we fail to reject null.  This means that the coefficient for age is **not** significantly different from zero.  This is that horizontal line (slope == 0) that we saw on the scatterplot.

The coefficient for age reflects the change in the y-variable (income) with an increase of one year in age.  So the coefficient is in units of y (dollars).  It's not surprising that $14 is not significantly different from zero.

Even though this is not significant, the proper interpretation of the coefficient estimate here is:

**A one year increase in age is associated with a $14 increase in income.**

Note that the increase is always one unit of x (here age is in years) and the yield of y is reflected by the coefficient estimate in units of y (dollars).

#### R-squared
R-squared is negative.  R-squared technically shouldn't be negative - it's the square of correlation and therefore should be constrained to between 0 and 1.  However, the R-squared is soooooooooo low that the adjustment to not overstate the r-squared in the population makes it below zero.

#### F-statistic
The F-statistic is our overall (omnibus) test of model fit.  It tells us:
1. If our r-squared is significantly different from 0. AND
2. If our model is significantly better than a "null" / "empty" / intercept only model.

For this model our p-value is greater than alpha, therefore the model is not an acceptable model for predicting age, and as we could tell from the negative value of r-squared, r-squared is not significantly different from zero.

*Because we have only one predictor variable (parameter), the p-value for the t-test of the coefficient and for the F-test of the model are identical.*

[Return to Top](#top)
<a id = "assump1"></a>

### Checking Assumptions
Even thought this model is not predictive of income, we still need to review the assumptions.  We might find issues with assumption violations that affect our model fit.

#### Linear in the Parameters / Errors are Independent
The assumption of linear in the parameters says that the variables are entered into the model correctly (that the relationships are truly linear.  This is not something we can test, but something we need to confirm as modelers.  However, if the errors are not independent, and there is a curvilinear relationship, that may be an indication that we possibly need to square one of our predictors.

In [None]:
# non-PQ / quick and dirty plot
# note - we're plotting with our saved model output - mod1
plot(mod1, which = 1) ## residuals vs. fitted is plot #1

In [None]:
# pq version of resid vs. fit - ggplot

# First, save residuals and fitted values from model output to dataframe
resfit <- data.frame(resid = mod1$residuals, 
                     fitted = mod1$fitted.values)

#plot with ggplot
resfit %>% ggplot(aes(x = fitted, y = resid)) +
            geom_point() +
            geom_smooth(color = "red", se = FALSE) + 
            ## do not use method = "lm" - we want to see possible curvilinear relationships
            ## se = FALSE because we don't need CI around line.
            labs(x = "Fitted Values",
                 y = "Residuals",
                 title = "Model One - Residuals vs. Fitted")

Here we would look for any apparent linear or curvilinear relationship between the fitted values (y_hat) and the residuals.  If there is no relationship, the red "guide line" would be horizontal.  In this case there is a slight curve (very slight curve) perhaps indicating that age has a curvilinear relationship with income (which may make sense - young people have low income and older people as the retire might not earn as much.  So we could refit our model to see if including age^2 improves the fit of the model.  But first let's evaluate the rest of the assumptions.

#### Right variables 
Like linear in the parameters, it relies on the analyst to make sure they're using the right variables in the model.

#### Normally Distributed Errors
This we're already familiar with from ANOVA - we look at the normality of the distribution of the residuals using a QQ plot.

In [None]:
# quick and dirty QQ
plot(mod1, which = 2)

In [None]:
# PQ QQplot

# we can use the previously saved df with residuals included
resfit %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

We can see some moderate deviations from normality in the upper and lower tails.  This shape (tails above line) indicates that the distribution is right skewed.  Because the distribution of income is typically right skewed (long tail in the upper income levels) this is not surprising.  But should be noted as a violation of the assumption of normally distributed errors.

#### No influential outliers
We need to make sure that we don't have any observations that 1. Have inordinate influence on the fit of the line and 2. are outliers.  For this week look at a plot of Residuals vs. Leverage and/or Cook's Distance (Cook's D).

In [None]:
# quick and dirty plot
plot(mod1, which = 5) ## plot 5 is Residuals vs. Leverage

In [None]:
# check for any obs that exceed Cook's D threshold 0.5
cd <- cooks.distance(mod1)
lev <- cd > 0.5
cd[lev]

In [None]:
## PQ plot, OLSRR version
ols_plot_resid_lev(mod1)

The quick and dirty plot uses a threshold of 0.5, therefore does not indicate that there are any influential outliers.  The same thing with running the code looking for any observation with Cook's D higher than 0.5.  OLSRR uses a different threshold, and therefore identifies two possible influential outliers - observations 235 and 183.  Note that the numbers included on all of the plots are the observation number (the row number) in the dataframe of the potential influential outlier.  We can address this when we refit the model.

#### Homoscedasticity / Homoskedasticity
Finally, we turn to our last assumption, Homoscedasticity.  This is the assumption of constant variance in x over the range of y.  If we violate this assumption, that condition is called Heteroscedasticity.  We can examine this using a scale-location plot, a plot of fitted vs. residuals (same as above with Independent Errors) and by running the Breusch Pagan Test of Heteroskedasticity (a statistical test).

In [None]:
# quick and dirty plots
plot(mod1, which = c(1,3)) ## residuals vs. fitted is plot #1, scale-location is plot 3

In both of these, we're looking for a situation where the dots form a funnel or cone shape.  If we have constant variance (homoscedasticity) the resid/fitted points will form roughly a rectangle shape around the horizontal midline.  In this case, any potential shape is not readily apparent.  We can confirm with the Breush-Pagan Test.

In [None]:
ols_test_breusch_pagan(mod1)

Our p-value is greater than an alpha of 0.5, which confirms that we do not violate the assumption of constant variance.  So we're good here - we have homoscedasticity.

PQ graph - you can use the same ggplot code as we did in the first block with Independent Errors - they both use residuals vs. fitted.

NOTE: Even if you reject null on Breusch Pagan test I would still look on the plot to see how much it looks cone/funnel shaped.  The B-P test is VERY sensitive.

### Refitting the Model
We made two observations in reviewing our assumptions/model fit that we could potentially "fix" to see if it improves the model.

1. Use age-squared (age^2) instead of age to address possible curvilinear relationship.
2. Remove influential outliers.

Let's prepare a copy of the data and see if the model fit is improved:

In [None]:
# add age^2 as a variable
cah2 <- cah %>% mutate(agesq = age^2)

# influential outliers (by reviewing OLSRR plot) are 183 and 235
# remove these from df

cah2 <- cah2[-c(183,235), ] # remove rows by row number

In [None]:
# refit model
# include both age and agesq as predictors
mod2 <- lm(income ~ age + agesq, data = cah2)
summary(mod2)

WOW!  That really improved our model fit.

1. Age and Age-squared are both significant predictors of age.  The coefficient of age describes the linear relationship between age and income, and the coefficient for age-squared defines the curvilinear relationship.  They are interpreted separately (now we actually have two predictors/parameters in our model).
    - With each year increase in age, income increases by about 2000 dollars.
    - With each year increase in *squared age*, income decreases by $19.  This is small, but significant.  Squared age is not interpretable as a unit, however adding this accounts for the decrease in income among the oldest respondents (and separates that out from the linear effect of age).
    
2. The p-value for the F-test of the overall model fit is less than alpha.  So our model is now significantly better than a null or empty model.  This indicates that we have an acceptable model with some predictive power.

3. The r-squared is 0.03.  This means that 3% of the variance in income is explained by the model (combination of age and squared age).  This is not a huge amount, but based on the F-test, we know it's significantly different from zero.

4. The intercept is still irrelevant.

Reviewing our plots for our assumptions:

This time I'm just going to run plot on the model and let the 4 default plots print.  These will cover the ones we need to review. (reminder - these are not pq)

In [None]:
plot(mod2)

The errors appear to be independent, there residuals are somewhat normal (no change from previous), no influential outliers, and no clear violation of constant variance is evident.  So we got a good model, with significant but not strong predictive power.

Which leads us to...

[Return to Top](#top)
<a id = "eff1"></a>

### Effect Size

#### Unstandardized Effect Size
The unstandardized effect size is the value of the coefficient(s) in units of Y (in our example here, income).  The increase of 1 year of age yields an increase of $2000 in income.  We need to decide if we think this is large, or not.

#### Standardized Effect Size
We can't directly compare coefficients with different units of x.  A difference of 1 year of age yielding an increase of $2000 in income cannot be directly compared to an increase of 1 month of work experience yielding a smaller increase in income - the one unit increase of the x value is not directly comparable.  

But, if we standardize the coefficients to constrain our means to 0 and our standard deviations to 1, the units are then standard deviations (z-scores).  This makes it harder to understand the interpretation of coefficients (especially for a non-statistician), but allows us to directly compare the magnitude of the different coefficients.

In [None]:
# standardize variables using scale()

cah2 %<>% mutate(age_std = scale(age),
                 income_std = scale(income),
                 agesq_std = scale(agesq))

# refit the model
mod3 <- lm(income_std ~ age_std + agesq_std, data = cah2)
summary(mod3)

Notice this doesn't change any of the p-values, it only changes our coefficient estimates.  
1. A one standard deviation increase in age yields a 1 standard deviation (1.04 SD) increase in income.
2. A one standard deviation increase in squared age yields a 1 standard deviation **decrease** (-1.02) in income.

#### R-squared
R-squared is still an effect size we discuss.  Again, it's the proportion of variance in our outcome that is explained by the model.  We interpreted that in the model interpretation above - this model only explains 3% of the variance in income, so although the model is significant, the predictive power of this model is poor.

[Return to Top](#top)
<a id = "ex2"></a>

## Example 2: Predicting Income using Gender
Now we'll see if gender is a good predictor of a person's income.

<a id = "prelim2"></a>
### Preliminary Inspection
Because we have a categorical predictor, we cannot use a scatterplot to preview the relationship between the variables.  But let's peek at a boxplot.

In [None]:
#data cleaning
cah %<>% filter(gender %in% c("Male", "Female"))  %>% mutate_at(vars(gender), as.factor)

# boxplot
cah  %>% ggplot(aes(x = gender, y = income/1000, fill = gender)) +
            geom_boxplot() +
            labs(x = "Gender",
                 fill = "Gender",
                 y = "Income in $1000s",
                 title = "Income by Gender")

There may be a difference here, but let's fit the model and see.

[Return to Top](#top)
<a id = "modfit2"></a>

### Model Fitting
Because gender is categorical, "dummy" coding is used to enter it into the model.  These are 0/1 variables that indicate whether the respondent belongs to each level of the categorical variable.  We always have one fewer "dummy" variables than levels of our categorical predictor.  Because gender has only two levels, we have only one parameter for gender.  R automatically sets the first factor level as the reference category to which the other parameters are compared.

In [None]:
mod4 <- lm(income ~ gender, data = cah)
summary(mod4)

Let's take a look at what this means.

### Interpretation:

#### The intercept:
In this model with a single categorical predictor, the intercept is the mean of income for the reference category (in this case Females).  So the average income for females in the sample is about $47k.

#### The coefficient estimte for genderMale:
Looking first at the p-value (last column) we can see that p < alpha (0.05), therefore we reject null.  This means that the coefficient for gender is significantly different from zero.  Gender is a significant predictor of income.

Because the variable is categorical, genderMale is like an on/off switch.  So we cannot interpret this coefficient the same way we would for a continuous/numerical predictor. Here, we would say as compared to the reference group (Female), Males earn, on average, $11,570 more than females.

#### F-statistic
The F-statistic is our overall (omnibus) test of model fit.  It tells us:
1. If our r-squared is significantly different from 0. AND
2. If our model is significantly better than a "null" / "empty" / intercept only model.

For this model our p-value is less than alpha, indicating that this model fits better than a null or empty model.

*Because we have only one predictor variable (parameter), the p-value for the t-test of the coefficient and for the F-test of the model are identical.*

#### R-squared
R-squared is 0.02, which is very low, indicating that gender, while significant, does not explain much of the variance in income.  Because the F-test was significant, however, we do know that this r-squared is significantly different from zero.

[Return to Top](#top)
<a id = "assump1"></a>

### Checking Assumptions

#### Linear in the Parameters / Errors are Independent
The assumption of linear in the parameters says that the variables are entered into the model correctly.  We've entered gender as a single dummy variable (genderMale = 1 is Male, genderMale = 0 is Female).  

We need to check if we have independent and identically distributed (i.i.d) errors.

In [None]:
# non-PQ / quick and dirty plot
# note - we're plotting with our saved model output - mod4
plot(mod4, which = 1) ## residuals vs. fitted is plot #1

Model fit/ diagnostic plots for models with just categorical predictors are a bit weird compared to those for numerical predictors.  Our x variable can only take two values, 0 or 1, so we just get two columns of observations.

Because the red line is horizontal, we can conclude that our errors are independent.

#### Right variables 
Like linear in the parameters, it relies on the analyst to make sure they're using the right variables in the model.

#### Normally Distributed Errors
This we're already familiar with from ANOVA - we look at the normality of the distribution of the residuals using a QQ plot.

In [None]:
# quick and dirty QQ
plot(mod4, which = 2)

It looks like there is some substantive deviation from normality in the tails, again because the distribution income is right-skewed.

#### No influential outliers
We need to make sure that we don't have any observations that 1. Have inordinate influence on the fit of the line and 2. are outliers.  For this week look at a plot of Residuals vs. Leverage

In [None]:
# quick and dirty plot
plot(mod4, which = 5)

The dashed red Cook's distance line doesn't even show on the plot, therefore we don't have any influential outliers.

#### Homoscedasticity / Homoskedasticity
Finally, we turn to our last assumption, Homoscedasticity.

In [None]:
# quick and dirty resid/pred and scale location 
plot(mod4, which = c(1,3))

The red reference line in both plots appears to be relatively horizontal, therefore I don't anticipate that we have heteroscedasticity.  We are not violating the assumption of constant variance.

### Effect Size

#### Unstandardized
The difference in mean income is about $11k, which I think is substantive.

#### Standardized
We can't standardize gender (it's 0/1) but we can standardize income.

In [None]:
cah3 <- cah %>% mutate(income_std = scale(income))

mod5 <- lm(income_std ~ gender, data = cah3)
summary(mod5)

The mean income for males is a third of a standard deviation higher than the overall (grand) mean of income.  The mean income for females (intercept), is 0.16 of a SD below the grand (overall) mean.

### Isn't this just a two-sample t-test?
You may not have been thinking this, but this linear model, with one two-level categorical variable is **EXACTLY THE SAME** as conducting a two-sample t-test of income by gender.  

Let's see:

In [None]:
summary(mod4)
t.test(income ~ gender, data = cah)

And look at that - 

1. The intercept of our linear model is the same thing as the mean in group Female in our t-test output.
2. The t-score and p-values are nearly identical (the t-test uses the Welch's adjustment)

Almost everything in statistics is a linear model "underneath the hood." 
https://twitter.com/ChelseaParlett/status/1249112302348980226?s=20

[Return to Top](#top)
<a id = "perf"></a>

## OPTIONAL - Model Fit Dashboard using `performance`

The `performance` package aims to make model diagnostics easy to generate.  
https://easystats.github.io/performance/

Let's look at some of the functions they offer.  We'll look at our models using age and agesq.

In [None]:
# model performance, single model
model_performance(mod2)

We get the model performance measures, including r-squared and adjusted r-squared.

We can compare two models:

In [None]:
# compare performance
compare_performance(mod1, mod2)

And finally, we can get our diagnostic plots as one large "dashboard."

In [None]:
check_model(mod1)

We can ignore the warnings - we can't check for multicollinearity because there's only one predictor.  But overall this gives us a nice concise overview of all of the diagnostic plots we need to look at.

In [None]:
# check heteroscedasticity (B-P test)
check_heteroscedasticity(mod1)

In [None]:
# check normality of residuals using sig test (not typically preferred over QQ plot due to sensitivity)
# Statistical test is the Shapiro-Wilk Normality Test 
check_normality(mod1)

"Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case." https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test