# After-Course Activity: Regression Analysis

## Objectives

In this exercise, you will perform regression analysis on structured data using R. This exercise allows you to predict a target variable from a number of predictor variables. The goal is to show you how regression can be used to predict unknown values from a model trained on an existing data set.

## Overview

You will work on a data set called `Prestige` that is included with the `car` package. You will:

- Review the distribution of the target variable
- Examine the data set for correlated variables
- Define a linear model that best describes data from which we can make future predictions

## Data loading and text preprocessing

In RStudio, create a new script (e.g. `regression_analysis.R`). Add commands to the file according to the instructions that follow in this exercise, and execute each command as you move through the steps.

Load the `car` library.

#### <font color="green">Solution...</font>

In [None]:
library(car)

Examine the structure of the `Prestige` data set.

#### <font color="green">Solution...</font>

In [None]:
str(Prestige)

Examine the distribution of the target variable `prestige`.

#### <font color="green">Solution...</font>

In [None]:
summary(Prestige$prestige)

Is there is a difference between the mean and the median?

#### <font color="green">Solution...</font>

Yes (46.83 vs 43.60)  

What might this indicate?

#### <font color="green">Solution...</font>

A (right-)skewed distribution.

Generate a histogram to confirm it visually.

#### <font color="green">Solution...</font>

In [None]:
hist(Prestige$prestige)

Take a look at the distribution of the levels of the `type` attribute.

#### <font color="green">Solution...</font>

In [None]:
table(Prestige$type)

Create a correlation matrix to examine the relationship between the `income` and `education` variables.

#### <font color="green">Solution...</font>

In [None]:
cor(Prestige$income, Prestige$education)

Create a correlation matrix to examine the relationship between the `income`, `education`, and `women` variables.

#### <font color="green">Solution...</font>

In [None]:
cor(Prestige[c("education","income","women")])

Visualize the relationship among these three variables.

#### <font color="green">Solution...</font>

In [None]:
pairs(Prestige[c("education","income","women")])

Are there any patterns in the plots?

#### <font color="green">Solution...</font>

There appears to be a relationship between `education` and `income`.

Load the `stats` library.

#### <font color="green">Solution...</font>

In [None]:
library(stats)

Using the `lm()` function from the stats package, fit a linear regression model to relate the independent variables to the total.

#### <font color="green">Solution...</font>

In [None]:
prestige_model <- lm(prestige ~ ., data=Prestige) 

# Same as: prestige_model <- lm(prestige ~ education + women + income + type + census, data=Prestige)

View the model to see the estimated coefficients.

#### <font color="green">Solution...</font>

In [None]:
prestige_model

Evaluate the model to see how well the model fits the data.

#### <font color="green">Solution...</font>

In [None]:
summary(prestige_model)

What was the maximum error in our predictions (the maximum residual)?

#### <font color="green">Solution...</font>

19.2402

What was the range of errors in the inter-quartile range (IQR) of residuals?

#### <font color="green">Solution...</font>

50% of predictions were between 4.98 points over and 4.87 points under the true value.

Does the model have statistically significant variables?

#### <font color="green">Solution...</font>

In [None]:
Yes

How much of the variation in the dependent variable is explained by the model (Multiple $R^{2}$)?

#### <font color="green">Solution...</font>

83%

## Congratulations!

You have successfully performed regression analysis on structured data using R.