# Tree-Based models
In this Jupyter notebook, I am exploring the different functionalities offered by Sklearn to use tree-based models.

In [None]:
## links to the datasets
auto_mpg = "https://assets.datacamp.com/production/repositories/1796/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv"
bike_sharing_demand = "https://assets.datacamp.com/production/repositories/1796/datasets/594538f54a854b322d6e4c8031f3f31bc522d3e5/bikes.csv"
breast_concer = 'https://assets.datacamp.com/production/repositories/1796/datasets/0eb6987cb9633e4d6aa6cfd11e00993d2387caa4/wbc.csv'


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv(breast_concer).iloc[:, :-1]
print(df)

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(['diagnosis'], axis=1)
diagnosis_mapper = {"M": 1, "B": 0}
y = df['diagnosis'].apply(lambda x: diagnosis_mapper[x])

In [None]:
X = X.values
y = y.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)
print(X_train.shape, y_train.shape)



In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

max_depth = 6 
random_state = 1

tree = DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
tree_2 = DecisionTreeClassifier(max_depth=8, random_state=random_state, criterion='entropy') # the criterion determines the metrics when splitting the tree nodes
tree_2.fit(X_train, y_train)
print(tree_2.score(X_test, y_test)) # using entropy
print(tree.score(X_test, y_test)) # using gini index

In [None]:
print(less_complex_tree_best.score(X_test, y_test))

# Baggin: Bootstrap Aggregation:
This is an ensemble learning method. We take a number of the same machine learning algorithm and instead of feeding the whole dataset to the a single instance, we provide a random sample with replacement. As a single data sample/example can be picked multiple time, a number of samples are genearlly left out. The latter samples, gathered form what is known by the "out of bag" samples. They represent an unbiased test of the each model's instance referred to as the "oob score". The mean of these scores (also referred to as the oob score) can be used to estimate the ensemble model's performance. 


In [None]:
# since the less_complex_tree_best model is the one with the least generalization error so far, let's experiment with it
from sklearn.ensemble import BaggingClassifier
bag_c = BaggingClassifier(base_estimator=less_complex_tree_best, oob_score=True, random_state=seed, n_estimators=100) # let's set  the rest of the hyperparameters through GridSearchCV

# bag_c_params = {"n_estimators": [100]}
# bag_c_best = GridSearchCV(bag_c, param_grid=bag_c_params, cv=num_folds, n_jobs=-1)
bag_c.fit(X_train, y_train)
# print(bag_c.best_params_)

y_pred = bag_c.predict(X_test)
print(accuracy_score(y_test, y_pred))

## Random Forests
Bagging is not a model specific method. In other words, it can be used for any base estimator. Random forests can be seen as slightly more complicated than baggin using Decision Trees. As Random forests model adds additional randomness by setting a number ***d*** of features to be considered when splitting. Only a random subset of ***d*** features are considered when splitting each node.

In [None]:
# it is helpful to compare between the two following values:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

params = {"max_depth": [2,3,4,5,6,8], "min_samples_leaf": [0.02, 0.04, 0.05, 0.1, 0.12, 0.15], "max_features":["log2", "sqrt"]}

num_folds = 6
tree = DecisionTreeClassifier(random_state=3)
best_tree = GridSearchCV(tree, param_grid=params, cv=num_folds)

best_tree.fit(X_train, y_train)

print(best_tree.best_params_)

y_train_pred = best_tree.predict(X_train)
print("First model performance: ")
print(best_tree.score(X_train, y_train))
print(cross_val_score(best_tree, X_train, y_train, cv=num_folds).mean())

# the differenence might be significant between the train error and the cross validation error which is an indicator of overfitting
# let's try to use a slightly less complex model

seed = 3

less_complex_tree = DecisionTreeClassifier() #(max_depth=4, min_samples_leaf= 0., max_features='sqrt', random_state=seed)
less_complex_params = {"min_samples_leaf": [0.02, 0.03,0.04, 0.05, 0.08, 0.09, 0.1, 0.12]}
less_complex_tree_best = GridSearchCV(less_complex_tree, param_grid=less_complex_params)
less_complex_tree_best.fit(X_train, y_train)
print("Second model performance: ")

print(less_complex_tree_best.score(X_train, y_train))
print(cross_val_score(less_complex_tree_best, X_train, y_train, cv=num_folds).mean())

# thanks to this hyperparameters tuning, the model is performing significantly better on both training and cross validation

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf1 = RandomForestClassifier(random_state=seed, n_estimators=250, oob_score=True)
rf1.fit(X_train, y_train)
print(rf1.oob_score_)
y_pred = rf1.predict(X_test)
print(accuracy_score(y_test, y_pred))

# RanomdForestClassifier determines the importance of each feature

importances = pd.Series(data=rf1.feature_importances_,
                        index= df.drop(['diagnosis'], axis=1).columns).sort_values()
print(importances)

## Boosting
Unlike Bagging, Boosting is another ensemble learning method where weak learners classifiers learn sequentially. The two main types of boosting 
### Adaboost:
the name is an abbreviation for ***Adaptive boosting*** where the first learner learns its coefficients from the initial data. The next learner, will have a different version of the data where the misclassified instances have larger weights/coefficients. The same process keeps going on for all individual learners. Additionally, each instance is assigned a coefficient based on its training error. The latter determines its contribution to the final result.
* $ 0 < \gamma \leq 1$ parameter is used where $\alpha_i = \gamma \cdot \alpha_i$ to shrink the associated parameters.
* n: number of learners
There should be balance between the two hyperpameters as one increases, the other should generally be decreased. 

In [None]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)

ada.fit(X_train, y_train)
print(ada.score(X_test, y_test))
ada_y_pred = ada.predict(X_test)
print(confusion_matrix(y_test, ada_y_pred))

In [None]:
ada_y_pred_prob = ada.predict_proba(X_test)
ada_y_pred = ada.predict(X_test).reshape(-1, 1)
pred_perc = (np.amax(ada_y_pred_prob, axis=1)).reshape(-1, 1)

arrays = [pred_perc, ada_y_pred.astype(int), y_test.reshape(-1, 1).astype(int)]

# for a in arrays:
#     print(a.shape)

final_array = np.concatenate(arrays, axis=1)

# print(final_array)
pred_res_df = pd.DataFrame(final_array, columns=['percentage', 'prediction', 'label'])

print(pred_res_df.head(10))
# let's consider the probabilities associated with misclassified instances
print(pred_res_df[pred_res_df['prediction'] != pred_res_df['label']])


### Gradient Boosting
Gradient boosting is known to be a powerful ensemble learning technique. Unlike adaptive boosting, the gradient version does not modify the weights associated with the training samples. Assuming $n$ estimators, the first estimators is trained on training data, the residual errors as calculated $r_1 = y_1 - \hat{y}_1$. the second training estimator is not fed $y_1$ but $r_1$ and $r_{i + 1} = r_{i} - \hat{y}_i$. The final predictions is generally calculated as $ \hat{y}_1 + \gamma \cdot \sum_{i=2}^{n} \hat{y}_i$ \
Gradient boosting uses only CART as its base estimator.

In [None]:
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4, # the maximum depth of a single tree inside the model
            n_estimators=200, # the number of trees
            random_state=2) # assures reproducability
gb.fit(X_train, y_train)
gb.predict(X_test)

### Stochastic GB
Gradient boosting is used mainly with trees. Yet, trees use exhaustive search when determining the best split, which might lead to using the same features for each split potenitally hurting the performance. Such issue is addressed by SGB where additional randomness is injected into Gradient Boosting in two ways:
1. training each model on a subset of the training samples
2. limiting the number of features a tree can use to split its nodes.


In [None]:
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4, # a single tree's maximum depth 
            subsample=0.9, # the fraction of samples each tree is trained on
            max_features=0.75, # the number of features each tree can consider
            n_estimators=200, # the number of inner trees
            random_state=2) # ensure reproducibility

## Hyperparameter Tuning
Well, there is no much theory behind hyperparameter tuning. It might be necessary to consider the different hyperparameters associated with each model before tuning. There are multiple approaches to be consisedered. Sklearn covers two of them.

In [None]:
# let's consider a tree classifier

params = {"max_depth": [2,3,4,5,6,8], "min_samples_leaf": [0.02, 0.04, 0.05, 0.1, 0.12, 0.15], "max_features":["log2", "sqrt"]}

tree_basic = DecisionTreeClassifier()
tree_basic.fit(X_train, y_train)
tree_searcher = GridSearchCV(estimator=tree_basic, # the model to tune, compulsory parameter
                                param_grid=params, # 
                                scoring='roc_auc', # the scording used to evaluate the best model
                                cv = num_folds, # number of folds in the cross_validation estimation
                                n_jobs=-1, # use every processor available
                                refit=True) # set to fetch the best estimator directly
tree_searcher.fit(X_train, y_train)
tree_best = tree_searcher.best_estimator_
tree_best.score(X_test, y_test)

In [None]:
# let's try to tune a RandomForest model
rf = RandomForestClassifier()
print(rf.get_params())