# House Price Prediction


## Boosting

This notebook discusses the algorithm and application of boosting methodologies for better predicting performance than single models.

## Overview
- AdaBoost
- Gradient Boosting
- XGBoost

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
### Read in the data
df = pd.read_csv('../project-house-price-prediction/data/train.csv')
df_test = pd.read_csv('../project-house-price-prediction/data/test.csv')

## Data
The data preprocessing follows the data exploration done in the notebook notebook data-exploration-and-preprocessing and 

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class CleanHouseAttributes(BaseEstimator, TransformerMixin):
    """Apply rules during data exploration to clean house price dataset"""
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df_house_price, cols_to_drop, target_col):
        # Age of building/remodle from YearBuilt and YearRemodAdd
        df_house_price['AgeBuilding'] = 2012 - df_house_price['YearBuilt']
        df_house_price['AgeRemodel'] = 2012 - df_house_price['YearRemodAdd']
    
        # Remove categories not exist in test.csv
        df_filtered = df_house_price[(df_house_price['HouseStyle'] != '2.5Fin') &
                                     (df_house_price['Exterior1st'] != 'Stone') &
                                     (df_house_price['Exterior1st'] != 'ImStucc') &
                                     (df_house_price['Exterior2nd'] != 'Other')]
        
        # Drop columns
        df_dropped = df_filtered.drop(cols_to_drop + target_col, axis=1)
            
        # Fill NA for numeric columns
        df_numeric = df_dropped.select_dtypes(include=['int64', 'float64']).apply(lambda x: x.fillna(x.mean()), axis=1)
        
        # Create dummies for non numeric columns
        df_nonNumeric = pd.get_dummies(df_dropped.select_dtypes(include=['object']).fillna('NA'))
        
        # Create boolean variables for Alley, PoolQC, and Fence
        df['HasAlley'] = list(1 if x is None else 0 for x in df['Alley'])
        df['HasPool'] = list(1 if x is None else 0 for x in df['PoolQC'])
        df['Fence'] = list(1 if x is None else 0 for x in df['Fence'])
        
        X = pd.concat([df_numeric, df_nonNumeric], axis=1)
        y = df_filtered[target_col]

        return X, y

In [None]:
drop_cols = ['Id', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'TotalBsmtSF', 
             'TotRmsAbvGrd', 'MoSold', 'YrSold', 'Street', 'Alley', 'Utilities', 
             'LandSlope', 'Condition2', 'Heating', 'Functional', 'FireplaceQu', 
             'PoolQC', 'Fence', 'MiscFeature'            ]
target_col = ['SalePrice']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

### AdaBoosting

In bagging, each training example is equally likely to be picked. In boosting, the probability of a particular example being in the training set of a particular machine depends on the performance of the prior machines on that sample. In each round, each training samples is assigned a new weight depending on the prediction performance from the previous round. More weights are assigned to samples with worse prediction so that each subsequent machine would focus on training the "difficult" samples.

#### Algorithm
Given $(x_1, y_1),...,(x_n, y_n)$, assign a weight $w_i=1$ for $i = 1,...,n$.

For $t = 1,..., T$
1. The probability that training sample $i$ is in the training set is $p_i = \frac{w_i}{\sum w_i}$ where the summation is over all members of the training set. Pick $n$ samples with replacement to form the training set.
2. Construct a regression machine $t$.
3. Make prediction $y_i^{(p)}(x_i)$ for $i=1,...,n$ with machine $t$. **Note: $y_i^{(p)}(x_i)$ is not the final prediction.**
4. Calculate a loss for $y_i^{(p)}(x_i)$ and $y_i$. The loss function may be of any functional form as long as $L \in$ [0,1]. If we let $$ D = sup|y_i^{(p)}(x_i)-y_i|  i = 1,...,n,$$ which means D is the largest error, then we have three candidate loss functions:
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|}{D} \textit{(linear)}$$
$$L_i=\frac{|y_i^{(p)}(x_i)-y_i|^2}{D^2} \textit{(square law)}$$
$$L_i=1-\exp\left[\frac{-|y_i^{(p)}(x_i)-y_i|}{D}\right] \textit{(exponential)}$$
5. Calculate aveage loss $\overline{L}=\sum_{i=1}^{N_1}L_ip_i$
6. Form $\beta=\frac{\overline{L}}{1-\overline{L}}$. $\beta$ is a measure of confidence in the predictor. Low $\beta$ means high confidence in the prediction.
7. Update the weights: $w_i \rightarrow w_i\beta^{[1-L_i]}$. The smaller the loss, the more weight is reduced , making the sample less likely to be picked in the next round.
8. For a particular input $x_i$, each of the $T$ machines makes a prediction $h_t, t=1,...,T$. $h_f$ is cumulative prediction using the $T$ predictors:
$$h_f = inf\bigg\{y \in Y: \sum_{t:h_t\leq y}log(\frac{1}{\beta_t})\geq\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})\bigg\}$$
This equation is essentially relabel $y_i$ such that $y_i^{(1)}<y_i^{(2)}<,...,y_i^{(T)}$. Then sum the $log(1/\beta_t)$ until we reach the smallest t that is equal or grater than $\frac{1}{2}\sum_{t}log(\frac{1}{\beta_t})$. This is the weighted median. If the $\beta_t$ were all equal, this would be the median.


In [322]:
from sklearn.ensemble import AdaBoostRegressor

In [322]:
abr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20), 
                        n_estimators=100,
                        loss='exponential')
abr.fit(X_train, np.ravel(y_train))
y_pred = abr.predict(X_test)
print(abr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

In [322]:
param_grid = {'n_estimators': [100, 250, 500]}
abr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=10))
gs_cv = GridSearchCV(abr, param_grid, cv = 5)
gs_cv.fit(X_train, np.ravel(y_train))
print(gs_cv.best_params_, gs_cv.best_score_)

### Gradient Boost

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbr = GradientBoostingRegressor(learning_rate = 0.01,
                                n_estimators = 1000,
                                max_depth = 10)
gbr.fit(X_train, np.ravel(y_train))
y_pred = gbr.predict(X_test)
print(gbr.score(X_test, y_test), np.sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
param_grid = {'n_estimators': [100, 150],
              'learning_rate': np.linspace(0.1, 0.2, 10)}
gbr = GradientBoostingRegressor(max_depth=8, max_features='sqrt')
gs_cv = GridSearchCV(estimator=gbr, param_grid=param_grid, cv=5)
gs_cv.fit(X_train, np.ravel(y_train))
print(gs_cv.best_params_, gs_cv.best_score_)

## Performance on Kaggle

In [None]:
X_test_kaggle, y_test_kaggle = house_price_data_cleaning(df_test, drop_cols)
lr_kaggle = LinearRegression()
lr_kaggle.fit(X, y)
y_pred_kaggle = lr_kaggle.predict(X_test_kaggle)
y_pred_kaggle

## References
- H. Drucker, “Improving Regressors using Boosting Techniques”, 1997.
- T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.