## Basic EDA

In this notebook, we perform basic data analysis for our dataset. This mostly consists of preparing distribution plots for the numerical features. We also begin to explore the technique of preparing distribution plots for numerical features separated by a categorical feature.

In [None]:
source('src/load_data-02.r')
source('src/multiplot.r')

In [None]:
dim(housing_df)

In [None]:
head(housing_df)

In [None]:
count_empty_total()

In [None]:
str(Filter(is.numeric, housing_df))

In [None]:
colnames(Filter(is.numeric, housing_df))

In [None]:
attach(housing_df)

In [None]:
library(ggplot2)

### Histogram of Target Feature

Here, we display a histogram of the target feature `SalePrice`. We have also included a kernel density estimation (KDE) and the mean and median values plotted as vertical lines. The mean greater than the median signifies a right or positive skew, common with strictly non-negative data.

In [None]:
hist_with_kde <- function (feature) {
    plot <- qplot(feature, geom="histogram", bins=200, alpha=I(.4), y = ..density..)+
        geom_vline(aes(xintercept=mean(feature, rm.na=T), color="mean"), linetype="dashed", size=1, show.legend=TRUE)+
        geom_vline(aes(xintercept=median(feature), color="median"), linetype="dashed", size=1, show.legend=TRUE)+
        geom_density()+
        scale_color_manual("Line.Color", values=c(median="red",mean="blue"))
    return(plot)
}

In [None]:
hist_with_kde(SalePrice)

#### Plot some Histograms with KDE Plots for other Numerical Features

Next we plot histograms with KDE plots for some of the other numerical features in our dataset.

In [None]:
colnames(Filter(is.numeric, housing_df))

We make use of a special function called `multiplot` that is included in the file `src/multiplot.r`.

In [None]:
library(repr)
options(repr.plot.width=20, repr.plot.height=4)

In [None]:
multiplot(hist_with_kde(LotFrontage),
          hist_with_kde(LotArea), 
          hist_with_kde(FirstFlrSF),
          hist_with_kde(SecondFlrSF),
          cols = 4)
multiplot(hist_with_kde(PoolArea),
          hist_with_kde(YrSold), 
          hist_with_kde(GarageArea),
          hist_with_kde(LowQualFinSF),
          cols = 4)


## Correlation

Assessing correlation in a data set with mixed numerical and categorical features can be challenging. One way to perform such an analysis is to prepare a series of distribution plots for a single numerical feature each distribution plot corresponds to the values for the numerical feature for a given attribute of a categorical feature.

Here is a list of our categorical features:

| | | |
|:-:|:-:|:-:|
| `Alley`         | `ExterCond`     | `GarageType`    | `MSSubClass`    |        
| `BedroomAbvGr`  | `Exterior1st`   | `HalfBath`      | `MSZoning`      |                         
| `BldgType`      | `Exterior2nd`   | `Heating`       | `Neighborhood`  |                          
| `BsmtCond`      | `ExterQual`     | `HeatingQC`     | `OverallCond`   |                         
| `BsmtExposure`  | `Fence`         | `HouseStyle`    | `OverallQual`   |                             
| `BsmtFinType1`  | `FireplaceQu`   | `KitchenAbvGr`  | `PavedDrive`    |                           
| `BsmtFinType2`  | `Fireplaces`    | `KitchenQual`   | `PoolQC`        |                      
| `BsmtFullBath`  | `Foundation`    | `LandContour`   | `RoofMatl`      |                         
| `BsmtHalfBath`  | `FullBath`      | `LandSlope`     | `RoofStyle`     |                          
| `BsmtQual`      | `Functional`    | `LotConfig`     | `SaleCondition` |                          
| `CentralAir`    | `GarageCars`    | `LotShape`      | `SaleType`      |                      
| `Condition1`    | `GarageCond`    | `MasVnrType`    | `Street`        |                      
| `Condition2`    | `GarageFinish`  | `MiscFeature`   | `TotRmsAbvGrd`  |                           
| `Electrical`    | `GarageQual`    | `MoSold`        | `Utilities`     |    

We can begin by looking at the distribution of `SalePrice` disaggregated by any one of these categorical features.

In [None]:
hist_with_kde_numerical_by_category(SalePrice, BldgType)

It may even make sense to treat one of the numerical features as a categorical feature, for example, `YrSold`.

In [None]:
hist_with_kde_numerical_by_category(SalePrice, as.factor(YrSold))

This plot shows that, for this dataset, the year of the sale has nearly no impact on the `SalePrice`. Note that `SalePrice` has a nearly identical distribution for all five years in the dataset.

In [None]:
multiplot(hist_with_kde_numerical_by_category(SalePrice,HouseStyle),
          hist_with_kde_numerical_by_category(SalePrice,ExterQual), 
          hist_with_kde_numerical_by_category(SalePrice,Street),
          hist_with_kde_numerical_by_category(SalePrice,MoSold),
          cols = 4)

Here, we see that `HouseStyle`, `ExterQual`, and `Street` all have some impact on `SalePrice`, while `MoSold` does not.

Another way to analyze the influence of a categorical feature it is to create a scatter plot of two numerical features, colored by a categorical feature.

In [None]:
ggplot(housing_df)+
   geom_point(aes(x=GrLivArea,y=SalePrice,colour=MSSubClass))