## Redundancy and Correlation

In preparation for a principal component analysis, we look at redundancy and correlation in our dataset. In this analysis, we will focus on the numeric features.

In [None]:
source('src/load_data-02.r')
source('src/multiplot.r')

In [None]:
dim(housing_df)

In [None]:
head(housing_df)

In [None]:
count_empty_total()

In [None]:
numeric_features = colnames(Filter(is.numeric, housing_df))

In [None]:
numeric_df = Filter(is.numeric, housing_df)
numeric_df$SalePrice <- NULL
numeric_features = colnames(numeric_df)

In [None]:
attach(numeric_df)

In [None]:
install.packages('rpart')

In [None]:
library(caret)
library(rpart)

### Redundancy

Here, we use machine learning to assess redundancy in our dataset. We iterate through each numeric feature in our dataset. For each feature, we dropped the feature from our input/featureet $X$ And use it as our target $y$ for the training of a supervised regression model. In this case, we use the shortcut for training a model on all features, `~.`, as our regression formula

    this_formula = paste(feature,"~.")
    fit <- rpart(data=train, formula=as.formula(this_formula))
    
In other words we are training a regression model where we use the remaining features to protect each individual feature. We will thus have an $R^2$ score for each numeric feature. Note,  that the `rpart` function is available as part of the `caret` library in R. This is the implementation of a decision tree. 

Note, that we also use machine learning best practices and perform a train–test split on our data. Each model is trained using the training data and assessed using the testing data. In this way, each model tells us if, upon removing a feature, the remaining features are able to predict the removed feature. If the remaining features are able to make this prediction, we may take the removed feature to be somewhat redundant. It is worth clarifying that this is an exploratory data analysis technique, and is not intended to be used at this time as a technique for removing features. We simply wish to understand the relationships within our data.

In [None]:
calculate_r_2 <- function(actual, prediction) {
    return (1 - (sum((actual-prediction)^2)/sum((actual-mean(actual))^2)))
}

calculate_r_2_for_feature <- function(data, feature) {
    n <- nrow(data)
    
    train_index <- sample(seq_len(n), size = 0.8*n)

    train <- data[train_index,]
    test <- data[-train_index,]
    
    this_formula = paste(feature,"~.")
    fit <- rpart(data=train, formula=as.formula(this_formula))

    y_test <- as.vector(test[[feature]])
    test[feature] <- NULL
    predictions <- predict(fit, test)
    return (calculate_r_2(y_test, predictions))
}

mean_r2_for_feature <- function (data, feature) {
    scores = c()
    for (i in 1:10) {
        scores = c(scores, calculate_r_2_for_feature(data, feature))
    }
    
    return (mean(scores))
}

In [None]:
calculate_r_2_for_feature(numeric_df,'LotFrontage')

In [None]:
for (feature in numeric_features){
    print(paste(feature, mean_r2_for_feature(numeric_df, feature)))
}

### Correlation

Next, we assess correlation between our features. Correlation is a function of covariance data, which is itself a measure of linear relationships within data. In the previous section, we use a decision tree to assess redundancy. A decision tree is an information-based (non-linear) analysis. By performing this analysis using two different techniques, one linear and one non-linear, we have a more robust assessment have the underlying relationships in our data. Again, this technique is exploratory data analysis and is not intended at this time to remove features from our dataset.

In [None]:
options(digits=3)
cor(numeric_df)

In [None]:
library(reshape2)
cormat = cor(numeric_df)

cormat[lower.tri(cormat)] <- NA
diag(cormat) <- NA

melted_cormat <- melt(cormat, na.rm = T)

library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()