### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

### <strong> Random Forest </strong>

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = True (by default), which means it samples replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

In terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. This algorithm saves time because the whole procedure is the same, but it randomly chooses the split point and does not calculate the optimal one.

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

![Bagging](./images/rf_extra.png)

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [1]:
#import libraries
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [3]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [4]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

Exercise: Using the data in scout_data, build a model to predict a product tier(Classification) and a model to predict the number of detail views.(Regression)

In [6]:
data = pd.read_csv("data/scout_data/Case_Study_Data.csv", sep=";")

In [7]:
data.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr
0,350625839,Basic,Mitsubishi,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944
1,354412280,Basic,Mercedes-Benz,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378
2,349572992,Basic,Mercedes-Benz,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013
3,350266763,Basic,Ford,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551
4,355688985,Basic,Mercedes-Benz,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122


In [59]:
data_desc = pd.read_csv("data/scout_data/Data_Description.csv", sep=";")
data_desc.head()

Unnamed: 0,column name,description
0,article_id,unique article identifier
1,product_tier,premium status of the article
2,make_name,name of the car manufacturer
3,price,price of the article
4,first_zip_digit,first digit of the zip code of the region the ...


In [60]:
data.shape

(78321, 12)

In [61]:
data.dtypes

article_id                   int64
product_tier                 int64
make_name                    int32
price                        int64
first_zip_digit              int64
first_registration_year      int64
created_date                object
deleted_date                object
search_views               float64
detail_views               float64
stock_days                   int64
ctr                         object
dtype: object

In [62]:
for col in data.columns:
    print(col, sum(data.loc[:, col].isnull()))

article_id 0
product_tier 0
make_name 0
price 0
first_zip_digit 0
first_registration_year 0
created_date 0
deleted_date 0
search_views 10
detail_views 10
stock_days 0
ctr 24


In [64]:
#drop all cars with first_registration_year of more than 2022
data = data[data['first_registration_year'] <= 2022]

#drop all columns with negative live days. (since you cannot have a listing deleted before it is created)
#data = data[data['live_days'] >= 0]

#drop all negative stock days
data = data[data['stock_days'] >= 0]

#drop all cars with price less than 100 euros
data = data[data.price > 100]

In [65]:
#Label Encoding the product_tier column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #instantiate the Label Encoder
data['product_tier'] = le.fit_transform(data['product_tier'])

In [66]:
#Label Encoding the make_name column
#from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #instantiate the Label Encoder
data['make_name'] = le.fit_transform(data['make_name'])

In [84]:
data.dtypes

article_id                   int64
product_tier                 int64
make_name                    int64
price                        int64
first_zip_digit              int64
first_registration_year      int64
created_date                object
deleted_date                object
search_views               float64
detail_views               float64
stock_days                   int64
ctr                         object
norm_price                 float64
dtype: object

In [85]:
data.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr,norm_price
0,350625839,0,62,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944,0.002691
1,354412280,0,60,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378,0.005775
2,349572992,0,60,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013,0.00192
3,350266763,0,33,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551,0.000281
4,355688985,0,60,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122,0.004257


Classification for Product Tier

In [88]:
y_ex = data["product_tier"]
mask=['make_name','norm_price', 'first_registration_year']



X_ex = data[mask]

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X_ex,y_ex,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average='macro')

0.3269987747973202

In [71]:
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average='macro')

0.3269987747973202

Regression for detail_views

In [42]:
from math import sqrt
from sklearn.metrics import mean_squared_error
def compute_rmse(actual, pred):
    rmse = sqrt(mean_squared_error(actual, pred))
    return rmse

In [35]:
y_reg = data["detail_views"]
mask=['make_name','price','first_registration_year']



X_reg = data[mask]

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X_reg,y_reg,random_state=42)

In [36]:
# AVG
base_line = pd.DataFrame(y_train.copy())

base_line.loc[:,"AVG"] = y_train.mean()
base_line.head().round()

Unnamed: 0,detail_views,AVG
46363,114.0,93.0
67153,60.0,93.0
45945,5798.0,93.0
33612,65.0,93.0
2089,7.0,93.0


In [89]:
compute_rmse(y_train, base_line.loc[:,'AVG'])

ValueError: Found input variables with inconsistent numbers of samples: [58633, 58740]