# Multiple Linear Regression
In this lab we'll look at one example of an iterative model fitting process, similar to how you should approach Project 4.  The notebook will also include code examples for generating PQ tables and graphs to include in your paper.

<a id="top"></a>

### Table of Contents
- [Preliminary Inspection](#prelim1)
- First Model (one variable)
    - [Model Fitting](#modfit1)
    - [Checking Assumptions](#assump1)
- [Model Fitting - Part 2](#modfit2)
    - [Evaluating Model 2](#eval2)
- [Model Fitting - Part 3](#modfit3)
    - [Evaluating Model 3](#eval3)
- [PQ Format](#pqform)
     - [2x2 Grid of Graphs](#pqgraph)
     - [PQ Table - Model Output](#modeloutput)
     - [PQ Plots - Model Assumptions](#pqassump)
- [OPTIONAL: Special Model Fit Dashboard - `performance` package](#perf)

We'll use the same data from the Cards Against Humanity poll that we've used in previous labs.

In [None]:
## loading some libraries!
library(tidyverse) ## all of our normal functions for working with data
library(magrittr)
library(olsrr) ## ols plots
library(car) ## for the vif() function
# library(performance) ## OPTIONAL

options(repr.plot.width=10, repr.plot.height=8) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

In [None]:
## LOAD the DATA
cah <- read_csv("201806-CAH_PulseOfTheNation_Raw.csv")
## variable names currently full questions - need to rename
new_names <- c("gender", "age", "agerange", "race", "income", "educ", "partyid", "polaffil", 
               "trump", "hollymoney", "fed_min_is", "fed_min_should", "fed_tax_is", "fed_tax_should", 
               "redist", "redist_you", "redist_people", "baseincome", "faircomp", "ceofair", "attractive")
colnames(cah) <- new_names
cah %<>% drop_na(income, age, fed_min_should) %>% 
            mutate_if(is.character, as.factor) %>% 
            filter(educ != "DK/REF" & gender != "DK/REF" & gender != "Other" & 
                   race != "DK/REF" & educ != "Other" & partyid != "DK/REF" &
                   polaffil != "DK/REF")  %>% 
            droplevels()

glimpse(cah)

In [None]:
summary(cah)

<a id = "ex1"></a>
## Example: Predicting what people think the Federal Minimum Wage should be
We want to see which variables are possible predictors/explainers of how much people think the federal minimum wage should be - `fed_min_should`.

### Preliminary Inspection
Let's plot a scatterplot matrix of the variables in the dataframe.  Because there are so many variables, I will print in "batches" by specifying column numbers, always starting with fed_min_should so that it's the first row/column in the matrix.

In [None]:
pairs(cah[c(12,1:5)])

In [None]:
pairs(cah[c(12,6:11)])

In [None]:
pairs(cah[c(12,13:16)])

In [None]:
pairs(cah[c(12,17:21)])

It's hard to really see if there are any really great predictors here, but let's start with income, which we should expect to be related to how much a person thinks the minimum wage should be.

Because we've decided on using this variable, let's print a scatterplot to look at the relationship.

In [None]:
cah %>% ggplot(aes(y = fed_min_should, x = income/1000)) +
            geom_point() +
            geom_smooth(method = "lm") +
            labs(y = "What the federal minimum wage should be",
                 x = "Income in $1000s",
                 title = "Relationship between Income and opinion about Minimum Wage")

Well, there may not be a lot going on here, but let's start building our model and see.

[Return to Top](#top)
<a id = "modfit1"></a>

### Model Fitting

In [None]:
# model is fitted using lm function
# lm(outcome ~ predictor, data = yourdf)
mod1 <- lm(fed_min_should ~ income, data = cah)

# use summary on saved model output to see the results
summary(mod1)

### Interpretation:

#### The intercept:
When a person's income == 0, their opinion, on average, is that the federal minimum wage should be around $13.

#### The coefficient estimte for income:
Looking first at the p-value (last column) we can see that p > alpha (0.05), therefore we fail to reject null.  This means that the coefficient for income is **not** significantly different from zero.  This is that horizontal line (slope == 0) that we saw on the scatterplot.

The coefficient for income reflects the change in the y-variable (fed_min_should) with an increase of one dollar in income.  So the coefficient is in units of y (dollars).  We should divide income by 1000 in future models to aid in interpretation (so 1 unit is 1k, not $1).

#### R-squared
R-squared is negative.  R-squared technically shouldn't be negative - it's the square of correlation and therefore should be constrained to between 0 and 1.  However, the R-squared is soooooooooo low that the adjustment to not overstate the r-squared in the population makes it below zero.

#### F-statistic
The F-statistic is our overall (omnibus) test of model fit.  It tells us:
1. If our r-squared is significantly different from 0. AND
2. If our model is significantly better than a "null" / "empty" / intercept only model.

For this model our p-value is greater than alpha, therefore the model is not an acceptable model for predicting age, and as we could tell from the negative value of r-squared, r-squared is not significantly different from zero.

*Because we have only one predictor variable (parameter), the p-value for the t-test of the coefficient and for the F-test of the model are identical.*

[Return to Top](#top)
<a id = "assump1"></a>

### Checking Assumptions
Even though this model is not predictive of fed_min_should, we still need to review the assumptions.  We might find issues with assumption violations that affect our model fit.

#### Linear in the Parameters / Errors are Independent
The assumption of linear in the parameters says that the variables are entered into the model correctly (that the relationships are truly linear.  This is not something we can test, but something we need to confirm as modelers.  However, if the errors are not independent, and there is a curvilinear relationship, that may be an indication that we possibly need to square one of our predictors.

In [None]:
# non-PQ / quick and dirty plot
# note - we're plotting with our saved model output - mod1
plot(mod1, which = 1) ## residuals vs. fitted is plot #1

In [None]:
# pq version of resid vs. fit - ggplot

# First, save residuals and fitted values from model output to dataframe
resfit <- data.frame(resid = mod1$residuals, 
                     fitted = mod1$fitted.values)

#plot with ggplot
resfit %>% ggplot(aes(x = fitted, y = resid)) +
            geom_point() +
            geom_smooth(color = "red", se = FALSE) + 
            ## do not use method = "lm" - we want to see possible curvilinear relationships
            ## se = FALSE because we don't need CI around line.
            labs(x = "Fitted Values",
                 y = "Residuals",
                 title = "Model One - Residuals vs. Fitted")

Here we would look for any apparent linear or curvilinear relationship between the fitted values (y_hat) and the residuals.  If there is no relationship, the red "guide line" would be horizontal.  In this case there is a slight curve (very slight curve) perhaps indicating a potential curvilinear relationship, however it seems to connect to an extreme outlier, so this is not a concern right now.

#### Right variables 
Like linear in the parameters, it relies on the analyst to make sure they're using the right variables in the model.

#### Normally Distributed Errors
This we're already familiar with from ANOVA - we look at the normality of the distribution of the residuals using a QQ plot.

In [None]:
# quick and dirty QQ
plot(mod1, which = 2)

In [None]:
# PQ QQplot

# we can use the previously saved df with residuals included
resfit %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

There is an extreme deviation from normality in the upper tail that is concerning.  This also is potentially due to outliers, so let's evaluate to see if there are influential outliers.

#### No influential outliers
We need to make sure that we don't have any observations that 

1. Have inordinate influence on the fit of the line and 

2. are outliers.  

For this we look at a plot of Residuals vs. Leverage 

In [None]:
# quick and dirty plot
plot(mod1, which = 5) ## plot 5 is Residuals vs. Leverage

In [None]:
# check for any obs that exceed Cook's D threshold 0.5
cd <- cooks.distance(mod1)
lev <- cd > 0.5
cd[lev]

In [None]:
## PQ plot, OLSRR version
ols_plot_resid_lev(mod1)

It looks like observation 182 is an influential outlier.  We will refit the model without this observation before adding variables.

#### Homoscedasticity / Homoskedasticity
Finally, we turn to our last assumption, Homoscedasticity.  This is the assumption of constant variance in x over the range of y.  If we violate this assumption, that condition is called Heteroscedasticity.  We can examine this using a scale-location plot, a plot of fitted vs. residuals (same as above with Independent Errors) and by running the Breusch Pagan Test of Heteroskedasticity (a statistical test).

In [None]:
# quick and dirty plots
plot(mod1, which = c(1,3)) ## residuals vs. fitted is plot #1, scale-location is plot 3

In both of these, we're looking for a situation where the dots form a funnel or cone shape.  If we have constant variance (homoscedasticity) the resid/fitted points will form roughly a rectangle shape around the horizontal midline.  In this case, any potential shape is not readily apparent - a few outliers should not be considered when looking for the cone shape.  We can confirm with the Breush-Pagan Test.

In [None]:
ols_test_breusch_pagan(mod1)

Our p-value is greater than an alpha of 0.05, which confirms that we do not violate the assumption of constant variance.  So we're good here - we have homoscedasticity.

NOTE: Even if you reject null on Breusch Pagan test I would still look on the plot to see how much it looks cone/funnel shaped.  The B-P test is VERY sensitive.

### Refitting the Model
In our review of the assumptions we determined that observation 182 is an influential outlier.  Let's remove that observation and refit model 1.

In [None]:
# influential outlier is 182
cah <- cah[-c(182), ] # remove rows by row number
# convert income to income/1000
cah %<>% mutate(income_k = income / 1000)

In [None]:
# refit model
mod1b <- lm(fed_min_should ~ income_k, data = cah)
summary(mod1b)

This didn't improve our model fit - actually the p-value is now higher.

Let's double-check our assumptions quickly before proceeding.

In [None]:
plot(mod1b)

These look relatively unchanged.  Let's go ahead and add another variable.

[Return to Top](#top)
<a id = "modfit2"></a>

## Model Fitting - Model 2 - adding other variables
Let's add more predictors to the model - gender, age, race, educ, and partyid.  We will have a total of 6 predictor variables (including income_k).

Gender, age, partyid, and race are categorical variables.  The reference group defaults to the first factor level.

In [None]:
mod2 <- lm(fed_min_should ~ income_k + gender + age + race + educ + partyid, data = cah)
summary(mod2)

We have a lot of parameters here, but most are not significant.  Let's review:

1. **intercept** - This is the mean of fed_min_should when all of the numerical variables are 0 and all the categorical variables are at their reference level.  This is not meaningful therefore we will not interpret.
2. **partyid** - The only significant predictor in this model is partyid.  As compared to the reference group of Democrats, republicans, on average, think the federal minimum wage should be about $1.60 lower, holding all else constant.
3. The other parameters are not significant.
4. Model fit - The overall model F-test shows the model is not significantly better than a null model.

Before checking assumptions, I'm going to reduce the number of predictors - I'll keep age, race, educ, and partyid.

In [None]:
mod2b <- lm(fed_min_should ~ age + race + educ + partyid, data = cah)
summary(mod2b)

The model is still not significant, but it's improved over the previous model with more predictors.  How could a model with fewer predictors be better?  Well there are fewer conflicting variables that can complicate model fitting.

Let's see if this model is significantly better than mod2.

In [None]:
anova(mod2, mod2b)

So mod2 is the full model, and mod2b is the nested model.

The p-value is greater than alpha = 0.05, therefore we fail to reject null.

This means that the smaller model is preferred, in interest of parsimony.

We cannot compare mod2b to mod1b because mod2b does not include income_k.

Let's check the assumptions, now including multicollinearity.

[Return to Top](#top)
<a id = "eval2"></a>

### Evaluating Model 2 
We're going to use mod2b as model 2.  Let's check the assumptions, starting with our new assumption, multicollinearity.

#### Multicollinearity
You may recall that there was an additional issue/assumption about OLS we discussed, which was not relevant with simple linear regression where we only had one IV.  This issue is multicollinearity - when two of our IVs are *too* correlated with each other.  They "overlap" in their prediction of the DV.

##### What's the big deal?
- redundancy in predicted variance - both IVs are trying to explain the same variation in the DV.
- Low correlation in IVs = little to no effect to the model
- Medium correlation in IVs = affects regression coefficients
- High correlation = near perfect redundancy - the model "blows up"

It will be very important to make sure there's no multicollinearity in our model, especially since our model fit is not good and we're looking for an explanation.  

We check for multicollinearity using VIF - Variance Inflation Factor

#### VIF (Variance Inflation Factor)

Estimate of the amount of inflated variance in coefficient due to multicollinearity in the model. The higher the VIF, the less reliable the regression model.

In [None]:
# vif function from the car package
vif(mod2b)

In [None]:
# or you can use this function from olsrr
ols_vif_tol(mod2b)

#### Interpreting VIF
- Scores start at 1 – all clear
- Scores >1 to 4: Still safe
- Score of 5: Threshold for VIF issues (prefer 4 & under)
- Score of 10 & up: Need to do something about it, massive multicollinearity

When we reach a VIF of 4 that means the standard errors, which build the confidence intervals, would be twice as big
as they would be otherwise.

For our model, all of the values are close to 1, therefore none are concerning.  We do not have multicollinearity in this model.

We can proceed to look at our other assumptions

### Errors are Independent 
We need to check if we have independent and identically distributed (i.i.d) errors.

In [None]:
# non-PQ / quick and dirty plot
# note - we're plotting with our saved model output - mod4
plot(mod2b, which = 1) ## residuals vs. fitted is plot #1

Because the red line is horizontal, we can conclude that our errors are independent.  There does not appear to be any linear or curvilinear relationship between the residuals and predicted values.

### Normally Distributed Errors

In [None]:
# quick and dirty QQ
plot(mod2b, which = 2)

We still have substantial deviation in the upper tail that would lead us to a conclusion that the residuals are NOT normally distributed.

### No influential outliers

In [None]:
# quick and dirty plot
plot(mod2b, which = 5)

We do not have influential outliers - none of the observations surpass the red dashed line that indicates a Cook's Distance of 0.5.

### Homoscedasticity / Homoskedasticity
Finally, we turn to our last assumption, Homoscedasticity.

In [None]:
# quick and dirty resid/pred and scale location 
plot(mod2b, which = c(1,3))

There does not seem to be any noticeable cone/funnel shape, therefore I don't anticipate that we have heteroscedasticity.  We are not violating the assumption of constant variance.

[Return to Top](#top)
<a id = "modfit3"></a>

## Model 3 - stepwise regression?
Now we've gone from one predictor variable, to many predictor variables, and have a model with four predictors - age, race, educ, and partyid.  We still haven't found a model that fits our data well to predict fed_min_should.  Let's attempt stepwise regression to see if we can "throw all of the predictors" at the model and have R make the best model.  

**NOTE: Stepwise regression is not typically recommended, it's better to specify the model based on your theory/hypotheses, and then see if those hypotheses are supported.**

There are two kinds of stepwise regression - forward where you start with an empty model and add significant predictors one by one, and backwards where you start with a full model and remove predictors one by one.  I'm going to use a process that works both ways, starts with an empty model, adds predictors, then removes them if necessary at subsequent steps.

In [None]:
# first, specify a model object with all possible predictors included
# outcome ~ . means that all other variables in the df are treated as predictors.
model <- lm(fed_min_should ~ . , data = cah)

# step through the model to arrive at the best model (according to model fit - rsquared, F-test)
ols_step_both_p(model)

The best possible model we can make to predict fed_min_should is using the predictors polaffil (political affiliation, which is not exactly the same as partyid), baseincome (support for universal basic income), and agerange (categorical age variable).

We can fit this model and check assumptions.

In [None]:
mod3 <- lm(fed_min_should ~ polaffil + baseincome + agerange, data = cah)
summary(mod3)

[Return to Top](#top)
<a id = "eval3"></a>
### Evaluating Model 3

Let's start with interpretations:
1. **intercept** - this is the average value of what people think the minimum wage should be at the base level of all of the categorical variables.  So for a conservative, baseincome == DK/REF, age range 18-24, the fitted/predicted value of fed_min_should is $10.84.

2. **polaffil** - We get two coefficients - one for Liberal and one for Moderate.  These need to be evaluated in comparison to the reference group.  The coefficient for liberal is significant, but the coefficient for moderate is not:
    - Holding all else equal, liberals, on average, think the minimum wage should be 2 dollars higher than conservatives.
    - Holding all else equal, moderates, on average, think the minimum wage should be 77 cents higher than conservatives.

3. **baseincome** - We didn't remove "DK/REF" from the data, so that was used as the reference level.  Neither coefficient is significant, however the interpretations would be:
    - Holding all else equal, people who do not support UBI think the minimum wage should be 49 cens _lower_ than whose with no opinion on UBI.
    - Holding all else equal, people who do support UBI think the minimum wage should be $1.06 _higher_ than those with no opinion on UBI.

4. **agerange** - The reference group here is the lowest age group - 18 to 24 year olds.  So all of these are compared to that reference group.  The only significant difference is between 18-24 y/o and 25-34 y/o groups.
    - Holding all else constant, people between 25 and 24 think the minimum wage should be $3.19 higher than those between 18-24 y/o.
    - Holding all else constant, people between .... think the minimum wage should be .... higher than those between 18-24 y/o.

5. **r-squared** - The adujsted r-squared is 0.07, indicating that this model explains 7% of the variance in fed_min_should, which is not substantial.

6. **model F-test** - The p-value is less than alpha, therefore this model is significantly better than a null or empty model.  

#### Assumptions:

**Multicollinearity / VIF**

In [None]:
vif(mod3)

All VIFs are around 1, so not concerning.

**Errors are independent**

In [None]:
plot(mod3, which = 1)

This doesn't look concerning.  The red line isn't perfectly straight, but barely deviates from horizontal.  I would conclude that the errors are independent.

**Normally Distributed Errors**

In [None]:
plot(mod3, which = 2)

The substantial deviation in the upper tail indicates that the residuals are not normally distributed.

**Constant Variance / Homoscedasticity**

In [None]:
plot(mod2b, which = c(1,3))

There is not any substantial cone/funnel shape, therefore I will conclude that the assumption of constant variance (homoscedasticity) is not violated.  Let's see how to prepare these plots, and our other output, in PQ format for your assignments.

[Return to Top](#top)
<a id = "pqform"></a>

## PQ Format
For both HW10 and Project 4 you will be asked to include PQ format plots, graphs, and tables in your assignment.  While the base R `plot()` images are good enough for quickly working through the model fitting process, they are not appropriate for a report or publication.

<a id = "pqgraph"></a>

### 2x2 Grid of Graphs of Predictors vs. Outcome
You may have noticed in the Project 4 assignment that I asked you to produce a plot that shows the relationship between each predictor and your outcome.  Because you have been asked to include a minimum of 4 predictors, these can easily be laid out in a 2x2 grid.  I'm going to graph the predictors from the second model (mod2b) - age, race, educ, and partyid.  

Categorical predictors (race, educ, partyid) should be plotted using boxplot.
Continuous predictors (age) should be plotted using scatterplot.

In [None]:
## create each ggplot as a separate object
p1 <- cah %>% ggplot(aes(x=age, y=fed_min_should)) + ## indicate df, x and y variables.
    geom_point() +
    geom_smooth(method=lm, se=TRUE) + ## method is lm, show CI
    labs(x = "Age", y = "Minimum Wage Should be in $")

p2 <- cah %>% ggplot(aes(x = race, y = fed_min_should, fill = race)) +
    geom_boxplot() +
    labs(x = "Race", y = "Minimum Wage Should be in $") +
    theme(legend.title = element_blank(), ## remove legend title
          legend.position="bottom", ## move legend 
          axis.text.x = element_blank(), ## remove x-axis tick text
          axis.ticks = element_blank())  ## remove x-axis ticks

p3 <- cah %>% ggplot(aes(x = educ, y = fed_min_should, fill = educ)) +
    geom_boxplot() +
    labs(x = "Education", y = "Minimum Wage Should be in $") +
    theme(legend.title = element_blank(), ## remove legend title
          legend.position="bottom", ## move legend
          legend.text = element_text(size = 8), ## make legend font smaller so it all fits in grid
          axis.text.x = element_blank(), ## remove x-axis tick text
          axis.ticks = element_blank())  ## remove x-axis ticks

p4 <- cah %>% ggplot(aes(x = partyid, y = fed_min_should, fill = partyid)) +
    geom_boxplot() +
    labs(x = "Party ID", y = "Minimum Wage Should be in $") +
    theme(legend.title = element_blank(), ## remove legend title
          legend.position="bottom", ## move legend 
          axis.text.x = element_blank(), ## remove x-axis tick text
          axis.ticks = element_blank())  ## remove x-axis ticks

In [None]:
library(grid)
library(gridExtra)

# combine plots into a grid
grid.arrange(p1, p2, p3, p4, ncol = 2, 
             top = textGrob("Possible predictors of Opinion about appropriate Minimum Wage",gp=gpar(fontsize=20))) 


#save as file - commented out for purposes of the notebook
# g <- arrangeGrob(p1, p3, p2, p4, ncol = 2, 
#                 top = textGrob("YOUR TITLE HERE",gp=gpar(fontsize=20))) 
# ggsave(file="scatterplots.png", g, scale = 2) #saves g

The relationship between each of our four predictors and our y-variable (outcome) of interest are laid out in a tidy grid.  The code to save this four-panel graph object is included in the code block above, but commented out for the purpose of this notebook.

[Return to Top](#top)
<a id = "modeloutput"></a>

### PQ Table - Model Output / Summary
In your project 4 you are asked to include a PQ table of the summary of each model.  Here is an example of creating that summary using tidy and kable. For this example I'll use mod3.

In [None]:
library(broom) ## for tidy
library(kableExtra)
library(scales) ## for formatting functions like percent

# format the model output as a dataframe using tidy
tidy_mod3 <- tidy(mod3)

# update the "term" to PQ 
tidy_mod3$term <- c("Intercept", "Political Affiliation - Liberal", "Political Affiliation - Moderate",
                    "Does not support UBI", "Does support UBI", "25-34 years old", "35-44 years old", "45-54 years old",
                    "55-64 years old", "65 and older")
# round estimate, std.error, and statistic to 2 or 3 decimal places
tidy_mod3$estimate <- round(tidy_mod3$estimate, 3)
tidy_mod3$std.error <- round(tidy_mod3$std.error, 3)
tidy_mod3$statistic <- round(tidy_mod3$statistic, 2)

# convert p-values to either < 0.001 or actual value if higher than 0.001.
tidy_mod3 %<>% mutate(p.value = ifelse(p.value < 0.001, #logical
                                       "< 0.001",  #value if true
                                       format(p.value, scientific = FALSE, nsmall = 3, digits = 0))) #value if false

# rename columns
colnames(tidy_mod3) <- c("Predictor", "Estimate", "Std. Error", "t-statistic", "p-value")

tname <- "Model 3: Characteristics associated with Opinion Regarding Minimum Wage"
titlehead <- c(tname = 5)
names(titlehead) <- tname

#create footnote with n, r-squared, and F-test
mod_foot <- paste0("n = ",
                    nrow(cah),
                    ". r-squared = ",
                    round(summary(mod3)$adj.r.squared, 2),
                    ", F(",
                    summary(mod3)$fstatistic[2], ",", summary(mod3)$fstatistic[3],
                    ") = ", round(summary(mod3)$fstatistic[1], 2),
                    ".")
ref_foot <- "Reference levels are Political Affiliation - Conservative, No Opinion on UBI, and 18-24 years old."

tidy_mod3 %>% kable(booktabs = T, align = "rcccc") %>% 
                kable_styling(full_width = FALSE) %>% 
                add_header_above(header = titlehead, align = "l",
                     extra_css = "border-top: solid; border-bottom: double;") %>%
                row_spec(0, extra_css = "border-bottom: solid;") %>% 
                row_spec(nrow(tidy_mod3), extra_css = "border-bottom: solid;")  %>% 
                footnote(general = c(ref_foot, mod_foot)) %>% 
                save_kable("mod3.png")


Keep in mind that you may need to do more/different clean up to the numbers, especially the estimates, depending on the magnitude.  If your coefficients are in the 1000s, you should use comma() from scales to add commas (for example, convert 2000 to 2,000).

This is what my kable created in the code block above looks like:
![](mod3.png)

[Return to Top](#top)
<a id = "pqassump"></a>

### PQ Plots of Model Assumptions
Finally, we need to have PQ versions of our model assumption plots in order to include those in our projects in case they show important violations that are justifications for your modeling decisions.

#### Residuals vs. Fitted

In [None]:
# First, save residuals and fitted values from model output to dataframe
resfit <- data.frame(resid = mod3$residuals, 
                     fitted = mod3$fitted.values)

#plot with ggplot
resfit %>% ggplot(aes(x = fitted, y = resid)) +
            geom_point() +
            geom_smooth(color = "red", se = FALSE) + 
            ## do not use method = "lm" - we want to see possible curvilinear relationships
            ## se = FALSE because we don't need CI around line.
            labs(x = "Fitted Values",
                 y = "Residuals",
                 title = "Model Three - Residuals vs. Fitted")

#### QQ Plot

In [None]:
resfit %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

#### Residuals vs. Leverage

We can take advantage of the fact that olsrr plots are plotted using ggplot to customize titles, etc. for these plots.

In [None]:
# use olsrr plot 
reslevplot <- ols_plot_resid_lev(mod3) + labs(title = "Model 3: Residuals vs. Leverage")
reslevplot

The default plot is printed when we call the function, but the adjusted plot with the additional ggplot code (changing the overall plot title) is the second one we would use.  You can use other ggplot code to tweak this if you wish, changing colors, size of text, etc.

You can ignore the warning.


[Return to Top](#top)
<a id = "perf"></a>

## OPTIONAL - Model Fit Dashboard using `performance`

The `performance` package aims to make model diagnostics easy to generate.  
https://easystats.github.io/performance/

Let's look at some of the functions they offer.  We'll look at our models using age and agesq.

In [None]:
library(performance)
# model performance, single model
model_performance(mod3)

We get the model performance measures, including r-squared and adjusted r-squared.

We can compare two models:

In [None]:
# compare performance
compare_performance(mod1b, mod2b, mod3)

And finally, we can get our diagnostic plots as one large "dashboard."

In [None]:
check_model(mod3)

In [None]:
# check heteroscedasticity (B-P test)
check_heteroscedasticity(mod1)

In [None]:
# check normality of residuals using sig test (not typically preferred over QQ plot due to sensitivity)
# Statistical test is the Shapiro-Wilk Normality Test 
check_normality(mod1)

"Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case." https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test