# MATH 3375 Examples Notebook #4
# Data Preparation

We will look at three aspects of data preparation using the car data set that you are acquainted with from Project 1.


In [None]:

car_data <- read.csv("cars2004.csv", stringsAsFactors=TRUE)
head(car_data)
car_data$Length = as.integer(as.character(car_data$Length))
car_data$Width = as.integer(as.character(car_data$Width))
tail(car_data)


## 1. Transformations

Transformations can help when

* The relationship between predictor and response is not linear
* The scale of one variable (relative to the other) makes the plot difficult to interpret

### Example: City MPG and Engine Size

Examine the plot of City MPG by Engine Size.

In [None]:
plot(City.MPG ~ EngineSize, data=car_data, main="City MPG by Engine Size")

##### What do we notice?

There is clearly a relationship, but it is not linear.  

#### Log Transformation
A very common remedy for this is a logarithmic transformation. We take the log of the response variable.

In [None]:
plot(log(City.MPG) ~ EngineSize, data=car_data, main="Log(City MPG) by Engine Size")

#### Square Root Transformation

Another possible transformation is to take the square root of the response variable.

In [None]:
plot(sqrt(City.MPG) ~ EngineSize, data=car_data, main="Square Root of City MPG by Engine Size")

##### Effect of Transformations

We can see that both transformations improve the linearity of the relationship. We can explore this improvement by comparing the regression models for each variant of the response variable (raw data, log transformation, square root transformation).

In [None]:
mpg_model01 <- lm(City.MPG ~ EngineSize, data=car_data)
summary(mpg_model01)

In [None]:
mpg_model02 <- lm(log(City.MPG) ~ EngineSize, data=car_data)
summary(mpg_model02)

In [None]:
mpg_model03 <- lm(sqrt(City.MPG) ~ EngineSize, data=car_data)
summary(mpg_model03)

##### Observations

Both transformations result in a better $R^2$, but the log transformation seems to do the best of the three. 

##### Visualize Linear Model with Data
It is also useful to view the plot of each model with the corresponding scatter plot.

In [None]:
plot(City.MPG ~ EngineSize, data=car_data, main="City MPG by Engine Size")
abline(mpg_model01,lwd=2,col="blue")

In [None]:
plot(log(City.MPG) ~ EngineSize, data=car_data, main="Log(City MPG) by Engine Size")
abline(mpg_model02,lwd=2,col="blue")

In [None]:
plot(sqrt(City.MPG) ~ EngineSize, data=car_data, main="Square Root of City MPG by Engine Size")
abline(mpg_model03,lwd=2,col="blue")

##### Diagnostic Plots

Finally, diagnostic plots can show how transformations improve the model fit.

In [None]:
par(mfrow=c(2,2))
plot(mpg_model01, which=1:2)
plot(mpg_model02, which=1:2)
plot(mpg_model03, which=1:2)


## 2. Scaling and Standardizing Variables

Scaling variables is advisable to ensure that the scale and range of different variables is comparable.  Consider the following example, where the Weight and Engine Size of a car are measured on very different scales.  In particular, pay attendion to the model coefficients for each regression model.

In [None]:
hist(car_data$Weight)
hist(car_data$EngineSize)
plot(EngineSize ~ Weight, data = car_data)

In [None]:
eng_model01 <- lm(EngineSize ~ Weight, data=car_data)
summary(eng_model01)

### Common Scaling

With common scaling, all the values in a given column are scaled so that they fall between 0 and 1.

In the example below, both variables are scaled with common scaling, and another regression model is created. What has changed? What has remained the same?

In [None]:
max_Weight <- max(car_data$Weight)
min_Weight <- min(car_data$Weight)
car_data$Weight_scaled <- (car_data$Weight - min_Weight)/(max_Weight - min_Weight)

max_EngineSize <- max(car_data$EngineSize)
min_EngineSize <- min(car_data$EngineSize)
car_data$EngineSize_scaled <- (car_data$EngineSize - min_EngineSize)/(max_EngineSize - min_EngineSize)

hist(car_data$Weight_scaled)
hist(car_data$EngineSize_scaled)
plot(EngineSize_scaled ~ Weight_scaled, data = car_data)

In [None]:
eng_model02 <- lm(EngineSize_scaled ~ Weight_scaled, data=car_data)
summary(eng_model02)

### Standardizing

With standardizing, the values are transformed so that they have a distribution with mean 0 and standard deviation 1. If the original distribution is normal, this transformation results in the **_standard normal distribution_**.

This is effectively computing the **_z-score_** of each value.

In the example below, both variables are standardized, and a third regression model is created. What has changed? What has remained the same?

In [None]:
mean_Weight <- mean(car_data$Weight)
sd_Weight <- sd(car_data$Weight)
car_data$Weight_std <- (car_data$Weight - mean_Weight)/sd_Weight

mean_EngineSize <- mean(car_data$EngineSize)
sd_EngineSize <- sd(car_data$EngineSize)
car_data$EngineSize_std <- (car_data$EngineSize - mean_EngineSize)/sd_EngineSize

hist(car_data$Weight_std)
hist(car_data$EngineSize_std)
plot(EngineSize_std ~ Weight_std, data = car_data)

In [None]:
eng_model03 <- lm(EngineSize_std ~ Weight_std, data=car_data)
summary(eng_model03)

## 3. Missing Data 

Recall that 2 of the variables in this data set have missing data.  One way to handle this is to omit these data points from the data set altogether.

By default, when plotting data or creating a regression model, R will skip rows where a value is missing in the response variable or a predictor. 

In [None]:
weight_model01 <- lm(Weight ~ Length, data=car_data)
summary(weight_model01)

plot(Weight ~ Length, data=car_data)
abline(weight_model01, lwd=2, col="blue")

### Imputation

Another way to handle missing values is to **_impute_** the missing values based on the _other_ values of the variable in the data set. The most simplistic imputation is to use the **median** of the other data values (or, if the variable is categorical, use the **mode**.)

Notice the distributions of the Length variable before and after imputation:

In [None]:
hist(car_data$Length, main="Car Length Distribution - Before Imputation")

car_data$Length[is.na(car_data$Length)] <- median(car_data$Length,na.rm=TRUE)
hist(car_data$Length, main="Car Length Distribution - After Imputation")

#### Effect of Imputation on the Model

Now that there are no missing values for the length variable, create a new regression model.  How is it different?

In [None]:
weight_model02 <- lm(Weight ~ Length, data=car_data)
summary(weight_model02)

plot(Weight ~ Length, data=car_data)
abline(weight_model02, lwd=2, col="blue")

#### Final Notes

There are many more sophisticated methods of imputation, including:

* Stratified Imputation - Divide data set up based on values of the _response_ variable (e.g., lowest to highest 25% of values) and for the variable being imputed assign the median value of that group to missing values in that subset of data.) 
* Create a linear model using other features in the data set to predict the variable with missing values. 


### Suggestion 

Use one or more code cells below to practice the steps above by exploring other possible regression models.