# MATH 3375 Examples Notebook #11

# Principal Components Analysis (PCA)

We continue using the 2004 cars data set to examine Principal Components Analysis. 


In [None]:
#Look at data set
car_data <- read.csv("cars2004.csv", stringsAsFactors=TRUE)
head(car_data,3)

## Prepare Data

Because PCA can only be used on quantitative data, we remove the categorical features from the data set. For this example, we also remove variables that have missing data.  To make our results easier to visualize, we are using the categorical features to uniquely 'name' each row.

In [None]:
#Drop non-numeric columns and columns with missing data
rownames(car_data) <- paste(car_data[,1],car_data[,2],car_data[,3])
car_data <- car_data[,-c(1:3,13,14)]
head(car_data,3)

## Implementation of PCA with Base R Library

Below we show a basic implementation Principal Component Analysis using functions in the Base R library.

Note the following:

* **_Because we plan to predict MSRP using the principal components,_** MSRP has been omitted from the feature set for the analysis.
* The data should be scaled for PCA, and the **prcomp** function will do this if we set the **scale** parameter to TRUE.
* Other R libraries are available with more features and different variants of implementation.

In [None]:
pca_results <- prcomp(car_data[,-1],scale=TRUE)
head(pca_results$x)

### Principal Components - New "Features"

The output above shows how PCA has created new columns of 'features', PC1 through PC9.  Because of the way PCA works, we know that PC1 will have the most explanatory power, PC2 the second most, and so on. We also know that there is no collinearity among any of the components. We can visualize that by plotting a correlation matrix of the components.

In [None]:
plot(data.frame(pca_results$x))

### Variable Loadings in the Principal Components

The _rotation_ matrix allows us to see the **_loading_** of each original variable in each of the principal components. This helps us to see the composition of each component numerically.

In [None]:
pca_results$rotation

### Variance Explained by Each Component

Of all the variance present in all features combined, we can see the proportion explained by each individual component. Note that this does NOT include the proportion of variance in MSRP, which is the response variable, so it is not a direct link to $R^2$ in the regression models we will create later.

In [None]:
pca_var <- data.frame(pca_results$sdev^2 / sum(pca_results$sdev^2))
rownames(pca_var) <- paste("PC",as.character(1:8))
pca_var



### Regression Models with Principal Components as Predictors

Recall that PC1 captures most of the variability across all the original features combined; PC2 captures the second most, and so on.  

Below, we look at several models, starting with one that only uses PC1 as a predictor, then adding components, _**in order**_. Notice how each component affects the model when it is added.


In [None]:
MSRP_pca_model1 <- lm(car_data$MSRP ~ pca_results$x[,'PC1'])
summary(MSRP_pca_model1)

In [None]:
MSRP_pca_model2 <- lm(car_data$MSRP ~ pca_results$x[,1:2])
summary(MSRP_pca_model2)

In [None]:
MSRP_pca_model3 <- lm(car_data$MSRP ~ pca_results$x[,1:3])
summary(MSRP_pca_model3)

In [None]:
MSRP_pca_model4 <- lm(car_data$MSRP ~ pca_results$x[,1:4])
summary(MSRP_pca_model4)

In [None]:
MSRP_pca_model_full <- lm(car_data$MSRP ~ pca_results$x)
summary(MSRP_pca_model_full)

### Comparing the Regression Models

The first 2 principal components explain 90% of the variance in MSRP, and the first 4 components explain almost 97%. Even though they are statistically significant, the last 4 components could be dropped to achieve simplicity without sacrificing much explanatory power in the model.