## Introduction  {-}

In this project, the goal is to develop two prediction models to predict the price of a home using the "Ames" dataset, which consists of 2930 rows (i.e., houses) and 83 columns where

- The first column is `PID`, the Parcel identification number;
- The last column is the response variable, `Sale_Price`;
- The remaining 81 columns are explanatory variables describing (almost) every aspect of residential homes.

The performance criteria that we are interested in is the RMSE of the natural log of the predicted price and that of the actual price (from the dataset). We first split the dataset into ten splits using the split ids provided and are looking for a performance target of `0.125` for the first five train/test splits and `0.135` for the last 5 train/test splits.


## Pre-processing {-}

For pre-processing, we started with an exploration of the data in conjunction with the [variable description](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt). Next, we went through the following pre-processing steps.

*Please note that the following pre-processing steps were done on the training and test data separately except for those steps for the **test** dataset that relied on steps for the **training** dataset, i.e., Winsorization and dummy vars.*

### Feature Selection {-}

We decided to perform a similar feature selection as Prof. Liang (#419) and dropped variables that we determined to not be interpretable predictors. These variables are listed below:
```python
["Street", "Utilities", "Condition_2", "Roof_Matl", "Heating", "Pool_QC",
 "Misc_Feature", "Low_Qual_Fin_SF", "Pool_Area", "Longitude", "Latitude"]
```

### Missing Values {-}

Looking at the data, as discussed on CampusWire, the only missing values occur in the `Garage_Yr_Blt` feature. Per investigation by Prof. Liang, setting the missing values in that column to 0 does not have any negative effects on the performance of prediction models. As such, we decided to replace these missing values with 0 as suggested.

### Categorical Features {-}

We used the [variable description](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) to determine which categorical features were ordered. That is, if we could convert the category values to a numerical number while encoding the same relationships. For example, for `Overall_Qual`, we determined a mapping between the category values (i.e., `Very_Poor`, `Poor`, `Fair`) to numerical values (i.e., `0`, `1`, `2`) that quantified the relationship, i.e., a higher value means a better `Overall_Qual`. We hand-picked these "ordered" categorical features and automated mapping process using a separate script that used the ordering in the variable description text file to generate the mapping.

For the "unordered" categorical features, we used binary dummy variables. Specifically, for the linear regression model, we generated `K-1` binary dummy variable for each categorical varaiable with `K` levels and for the tree model, we generated `K` binary categorical variables when `K>2`. The `K>2` was not necessary to be considered in the Ames dataset given that all categorical variables had at least three levels (i.e., $K\geq3$).

A complete list of "ordered" and "unordered" features and their corresponding mapping is available in the code (`mymain.py`).

### Numerical Features {-}

#### Scaling {-}

For numerical features, we experimented with a number of different scaling methods. Specifically, we started with the `StandardScaler`. However, we quickly found that this scaling method is susceptible to outliers. As such, we decided to use nonlinear scaling, such as **Quantile Transformer**, which can be more robust to such outliers. Specifically, we found that the **Quantile Transformer** performed the best out of all the available scaling methods. A description of the Quantile Transformer is provided below
> "The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution."

We made the assumption that the remaining numerical features after feature selection, such as `Lot_Area`, are normally distributed and applied `QuantileTransformer` with the final output being a normal distribution. Doing so resulted in significant improvements in the model performence.

#### Winsorization {-}

We also applied *Winsorization* to a select number of numerical features to replace values beyond the upper 95% quantile based on the training data with the value at 95% quantile. The main reason, as Prof. Liang, described is that the contribution of some features, specifically, area-related, should be capped. In other words, after a certain value the relationship of an area-related feature and the `Sale_Price` becomes flat. Similar to scaling, we noticed significant improvements in the model performance against the test dataset after applying Winsorization.


## Implementation {-}


### Linear Regression model {-}

For the linear model, we used linear regression with Ridge penalty with cross validation with the following parameters:

- `alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]`: This parameter specifies an array of alpha values to try for regularization during cross-validation. Here, we used the default (0, 1) range with 10 steps.
- `scoring = None`: This parameter specifies the objective/loss function. With the chosen (default) value, it uses the negative mean squared error since the model will use a leave-one-out cross-validation.
- `cv = None`: This parameter determines the cross-validation splitting strategy. With the chosen value, the model will the Leave-One-Out cross-validation which has proven to be more efficient than other methods.
- `gcv = 'auto'`: This parameter is only applicable with using Leave-On-Out Cross Validation and it indicates which strategy to use while performing that CV strategy. Depending on the whether `n_samples > n_features`, it will either forces the use of singular value decomposition or eigendecomposition. Choosing `'auto'` allows for the model to pick the cheaper option depending on the shape of the training data.
- `alpha_per_target = False`: This parameters indicates whether to optimize the alpha value from the `alphas` list for each target separately. Choosing `False` means that a signle alpha will be used for all targets.



### Tree-based model {-}

For the tree-based model, we used `xgboost.XGBRegressor` with the following custom parameters:

- `n_estimators = 300`: This parameter specifies the number of gradient boosted trees. XGBoost regression is an ensemble method and as such it uses a number of trees to make the final prediction. We needed to make sure to not use a large number of estimators to prevent overfitting to the train dataset and poor performance on the test dataset.
- `learning_rate = 0.085`: This parameter specifies the Boosting learning rate (i.e., shrinkage factor or $\eta$). After each boosting step, we can directly get the weights of new features, and this parameter shrinks the feature weights to make the boosting process more conservative. This helps prevent overfitting and reducing model variance. We chose a small learning rate (more conservative) to make sure that our model is able to generalize well to the test dataset.
- `objective = "reg:squarederror"`: This parameter specifies the learning task and corresponding learning objective. Here, we chose squared loss since we are performing regression against the log home price.


The rest of the parameters were left as default as tuning the above parameters led to the desired performance. Some of these parameters are explained below:

- `max_depth = 6`: This parameter specifies the maximum depth a tree. A large `max_depth` would make the model more complex and increase the chances of overfitting to the training dataset.
- `booster = 'gbtree'`: This parameter specifies which booster to use. The chosen method uses tree-based models which we found to perform the best with our dataset.


## Results {-}


Using the metric provided in the project description, we evaluated our prediction on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted price and the logarithm of the observed sales price. Below, you can see the evaluation result for each model type.

We also recorded the runtime of training and evaluating each model in seconds on each split. Please note that the runtimes reported here do not include data loading and preprocessing to better capture the runtime differences between the two selected models.


RMSE on test data            |  Runtime
:-------------------------:|:-------------------------:
![RMSE](rmse.png "RMSE")  |  ![Runtime](time.png "Runtime")


*RMSE of the natural log of the predicted price and the natural log of the observed `Sale_Price` using either a `Linear` model or a `Tree`-based model.*

*Runtime of training and evaluating each model in seconds on a local **device**, additional detail below.*

**Platform**

- MacBook Pro (15-inch, 2017)
  - *CPU:* 2.8 GHz Quad-Core Intel Core i7
  - *Memory:* 16 GB 2133 MHz LPDDR3
  - *GPU:* Radeon Pro 555 2 GB
  

## Discussion {-}

Overall, as shown in **Results**, both the linear and tree-based models we chose were able to outperform the given performance criteria. In the following subsections, we will highlighted some of the lessons that we have learned along with some interesting findings.

### Lessons Learned {-}

- Throughout this project, we learned that data pre-processing has the most effect on the model performance, much more so than model selection. As a result, we believe that for future improvements and extensions of this project, we can explore different pre-processing techniques as well as ordering of their application. For example, we could experiment with scaling all (numerical and encoded categorical) features after all or a subset (i.e., *Winsorization*) of the other preprocessing steps.
- We were able to drastically improve the performance of the tree-based model. As such, we suspect that additional hyperparameter tuning for the `XGBRegressor` will result in better performance.

### Interesting Findings {-}

- On average, the tree-based model we used, `XGBRegressor`, took almost 40x longer for training and evaluation.
- Without *Winsorization*, our models were performing almost 1.5 times as bad as the performance targets. After applying *Winsorization*, we were able to meet and significantly outperform the target in most cases.
- The learning rate parameter for `XGBRegressor` has a significant effect on the final test performance for this dataset. Specifically, we found that `< .1` values for `learning_rate` result in the best performance in terms of RMSE.