### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

### <strong> Random Forest </strong>

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = True (by default), which means it samples replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

In terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. This algorithm saves time because the whole procedure is the same, but it randomly chooses the split point and does not calculate the optimal one.

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

![Bagging](./images/rf_extra.png)

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [39]:
#import libraries
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [2]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [3]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

Exercise: Using the data in scout_data, build a model to predict a product tier(Classification) and a model to predict the number of detail views.(Regression)

In [21]:
#load the dataset
df = pd.read_csv('./data/scout_data/Case_Study_Data.csv',sep=';')
df.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr
0,350625839,Basic,Mitsubishi,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944
1,354412280,Basic,Mercedes-Benz,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378
2,349572992,Basic,Mercedes-Benz,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013
3,350266763,Basic,Ford,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551
4,355688985,Basic,Mercedes-Benz,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122


In [22]:
print(df.shape)

(78321, 12)


In [23]:
#check for nas
df.isna().sum()


article_id                  0
product_tier                0
make_name                   0
price                       0
first_zip_digit             0
first_registration_year     0
created_date                0
deleted_date                0
search_views               10
detail_views               10
stock_days                  0
ctr                        24
dtype: int64

In [24]:
#drop nas
df.dropna(inplace=True)

#check for nas
df.isna().sum()

article_id                 0
product_tier               0
make_name                  0
price                      0
first_zip_digit            0
first_registration_year    0
created_date               0
deleted_date               0
search_views               0
detail_views               0
stock_days                 0
ctr                        0
dtype: int64

In [25]:
#explore data - check distribution of product_tier
df['product_tier'].value_counts()

product_tier
Basic      75397
Premium     2324
Plus         576
Name: count, dtype: int64

In [26]:
#explore data - get summary statistics for each variable
df.describe()

Unnamed: 0,article_id,price,first_zip_digit,first_registration_year,search_views,detail_views,stock_days
count,78297.0,78297.0,78297.0,78297.0,78297.0,78297.0,78297.0
mean,357486400.0,15069.670358,4.631876,2011.090336,2297.91333,93.486583,35.99507
std,5076809.0,16375.598837,2.354368,6.538638,6339.52668,228.042547,32.213083
min,347232400.0,50.0,1.0,1924.0,1.0,0.0,-3.0
25%,353638700.0,5750.0,3.0,2008.0,368.0,13.0,10.0
50%,358547900.0,10909.0,5.0,2013.0,920.0,36.0,25.0
75%,361481700.0,18890.0,7.0,2015.0,2234.0,94.0,55.0
max,364704000.0,249888.0,9.0,2106.0,608754.0,13926.0,127.0


In [27]:
#check for any value above 2024 under first_registration_year
df['first_registration_year'].value_counts()

#remove the value with 2106 under that variable
df = df[df['first_registration_year']!=2106]

df.describe()



Unnamed: 0,article_id,price,first_zip_digit,first_registration_year,search_views,detail_views,stock_days
count,78296.0,78296.0,78296.0,78296.0,78296.0,78296.0,78296.0
mean,357486400.0,15069.744687,4.631846,2011.089123,2297.941236,93.487713,35.995504
std,5076839.0,16375.690205,2.354368,6.529876,6339.562356,228.043784,32.213059
min,347232400.0,50.0,1.0,1924.0,1.0,0.0,-3.0
25%,353638700.0,5750.0,3.0,2008.0,368.0,13.0,10.0
50%,358547900.0,10910.0,5.0,2013.0,920.0,36.0,25.0
75%,361481700.0,18890.0,7.0,2015.0,2234.0,94.0,55.0
max,364704000.0,249888.0,9.0,2020.0,608754.0,13926.0,127.0


In [30]:
#check for for stock days that are below zero
df['stock_days'].value_counts()

#remove values that are less than 0
df = df[df['stock_days']>=0]

df.describe()

Unnamed: 0,article_id,price,first_zip_digit,first_registration_year,search_views,detail_views,stock_days
count,78205.0,78205.0,78205.0,78205.0,78205.0,78205.0,78205.0
mean,357486100.0,15072.502206,4.631187,2011.090608,2300.511412,93.591868,36.038693
std,5076904.0,16379.680595,2.354126,6.529607,6342.800041,228.155895,32.206886
min,347232400.0,50.0,1.0,1924.0,1.0,0.0,0.0
25%,353638700.0,5750.0,3.0,2008.0,369.0,13.0,10.0
50%,358547900.0,10920.0,5.0,2013.0,922.0,36.0,25.0
75%,361481700.0,18893.0,7.0,2015.0,2239.0,94.0,55.0
max,364704000.0,249888.0,9.0,2020.0,608754.0,13926.0,127.0


In [32]:
#view the data
df.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr
0,350625839,Basic,Mitsubishi,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944
1,354412280,Basic,Mercedes-Benz,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378
2,349572992,Basic,Mercedes-Benz,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013
3,350266763,Basic,Ford,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551
4,355688985,Basic,Mercedes-Benz,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122


In [35]:
#check how many categories are under make_name
df['make_name'].value_counts()

df['product_tier'].value_counts()

product_tier
Basic      75305
Premium     2324
Plus         576
Name: count, dtype: int64

In [41]:
df.head()

Unnamed: 0,article_id,product_tier,make_name,price,first_zip_digit,first_registration_year,created_date,deleted_date,search_views,detail_views,stock_days,ctr
0,350625839,Basic,Mitsubishi,16750,5,2013,24.07.18,24.08.18,3091.0,123.0,30,0.037803299902944
1,354412280,Basic,Mercedes-Benz,35950,4,2015,16.08.18,07.10.18,3283.0,223.0,52,0.06792567773378
2,349572992,Basic,Mercedes-Benz,11950,3,1998,16.07.18,05.09.18,3247.0,265.0,51,0.0816137973514013
3,350266763,Basic,Ford,1750,6,2003,20.07.18,29.10.18,1856.0,26.0,101,0.0140086206896551
4,355688985,Basic,Mercedes-Benz,26500,3,2014,28.08.18,08.09.18,490.0,20.0,12,0.0408163265306122


In [43]:
#split the data in training and test
X = df.drop(['product_tier','article_id'],axis=1)

y = df['product_tier']

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

In [49]:
#apply one hot encoding for make_name using category_encoders
import category_encoders as ce

encoder = ce.OneHotEncoder(cols=['make_name'])

X_train = encoder.fit_transform(X_train)

X_train.head()

AttributeError: module 'pandas.api.types' has no attribute 'is_categorical'

In [None]:
encoder = ce.OneHotEncoder(cols=['make_name'],use_cat_names=True)

df = encoder.fit_transform(df)

df.head()

In [15]:
# split the data
X = df.drop(['product_tier','article_id'],axis=1)
y = df['product_tier']

In [None]:
#view the data
X.head()


In [None]:
#split between train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

In [9]:
#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

#fit on the data
rf.fit(X_train,y_train)  

ValueError: could not convert string to float: 'Renault'