# Ensemble

In today's lectuer, we've discussed that ensemble models are more robust than single models. Let's build a decision tree and random forest using Sickit Learn and see if that's really the case. 

### Prepare the dataset

Please download "train.csv" for [Housing Prices dataset](https://www.kaggle.com/competitions/home-data-for-ml-course/data?select=test.csv). With 79 features of residential homes in Iowa, this dataset challenges you to predict the price of each home.



In [73]:
import pandas as pd

In [74]:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


It seems NaN in multiple columns like "Alley" and "Fence". Let's drop the columns that has NaN, only use the columns with numerical values, and drop "Id," which doesn't hold any useful information.

In [75]:
data = data.dropna(axis=1)

numerics = ['int16', 'int32', 'int64']
data = data.select_dtypes(include=numerics)

data = data.drop('Id', axis=1)

In [77]:
x = data.drop('SalePrice', axis=1) # data columns
y = data.SalePrice # target column

In [78]:
# split data into train and test dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1095, 34), (365, 34), (1095,), (365,))

## Decision Tree

Scikit Learn makes it easy to build and train a decision tree model. Since we're solving a regression problem, let's load `DecisionTreeRegressor`.

In [80]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()

# train 
dt.fit(X_train, y_train)

# test the model
dt.score(X_test, y_test)

0.7569195579279113

## Random Forest

In the previous section, we used a single decision tree regressor to predict the house. We can do better than that by using a random forest regressor, which falls under the category of "Bagging." 

One important technique is "bootstrapping," which samples data from the dataset with replacement to create multiple datasets with different distributions of data. Random Forest without bootstrapping is like building 100 trees from the same dataset. As the whole point of ensemble is to build various estimators, we want to create a variety of subsets of observations from the dataset and feed each subset to each decision tree model. After training multiple decision trees, we aggregate their predictions to get a robust result. Thus, bagging is also called "Bootstrap Aggregating."

In [93]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor()
reg.fit(X_train, y_train)
reg.score(X_test, y_test)

0.8659078331992321

## XGBoost

XGBoost (Extreme Gradient Boost) is still a powerful machine learning algorithm that's often used in Kaggle challenges. "Boosting" is an ensemble technique that sequentially build estimators, such that each subsequent estimator tries to reduce the error of the previous one. As you can get the hint from the name, XGBoost does so with gradient descent. The step-by-step algorithm is quite complicated, but as always, Scikit Learn lets you build and test it in just four lines of code!

In [92]:
from sklearn.ensemble import GradientBoostingRegressor

xgboost = GradientBoostingRegressor(n_estimators=100)
xgboost.fit(X_train, y_train)
xgboost.score(X_test, y_test)

0.9019168595556754