# Decision Trees - Exercises
1. In the decision tree click-through prediction project, can you also tweak other hyperparameters, such as min_samples_split and class_weight? What is the highest AUC you are able to achieve?


## Importing the data and exploring the columns


In [33]:
import pandas as pd
n_rows = 300_000
df = pd.read_csv("./dataset/train.csv", nrows = n_rows)

In [34]:
# Checking the first 5 samples
print(df.head(5))

             id  click      hour    C1  banner_pos   site_id site_domain  \
0  1.000009e+18      0  14102100  1005           0  1fbe01fe    f3845767   
1  1.000017e+19      0  14102100  1005           0  1fbe01fe    f3845767   
2  1.000037e+19      0  14102100  1005           0  1fbe01fe    f3845767   
3  1.000064e+19      0  14102100  1005           0  1fbe01fe    f3845767   
4  1.000068e+19      0  14102100  1005           1  fe8cc448    9166c161   

  site_category    app_id app_domain  ... device_type device_conn_type    C14  \
0      28905ebd  ecad2386   7801e8d9  ...           1                2  15706   
1      28905ebd  ecad2386   7801e8d9  ...           1                0  15704   
2      28905ebd  ecad2386   7801e8d9  ...           1                0  15704   
3      28905ebd  ecad2386   7801e8d9  ...           1                0  15706   
4      0569f928  ecad2386   7801e8d9  ...           1                0  18993   

   C15  C16   C17  C18  C19     C20  C21  
0  320   50  

In [35]:
# Since the target value is the click column, we save it in the y variable
Y = df['click'].values

In [36]:
# For the remaining columns, there are several columns that should be removed
# from the features (id, hour, device_id, and device_ip) as they do not contain much useful information:
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip' ], axis = 1).values

In [37]:
# Each sample has 19 predictive attributes
print(X.shape)

(300000, 19)


In [38]:
# Since the data is ordered chronologically, we cannot randomly select the
# samples (future clicks cannot predict past clicks).
n_train = int(n_rows * 0.9) # 90 % to training set
X_train = X[:n_train]
X_test = X[n_train:]
y_train = Y[:n_train]
y_test = Y[n_train:]


## Performing one-hot encoding on the categorical values

In [39]:
from sklearn.preprocessing import OneHotEncoder

# This parameter is used to specify how the encoder should handle categories that were not seen during the fit process (i.e., categories that are present in the test data but not in the training data).
enc = OneHotEncoder(handle_unknown = 'ignore')

In [40]:
# Fitting the encoder into the training set
X_train_enc = enc.fit_transform(X_train)
X_train_enc[0]

<1x8204 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

* `<1x8204 sparse matrix of type '<class 'numpy.float64'>` indicates that the
output is a 1x8204 sparse matrix with elements of type numpy.float64. Sparse matrices are used to store large matrices that have a lot of zero elements, as they save memory by only storing non-zero elements.

* `with 19 stored elements in Compressed Sparse Row format` means that there are
 19 non-zero elements in the sparse matrix, and the matrix is stored using the Compressed Sparse Row (CSR) format. CSR is an efficient format for storing and performing operations on sparse matrices, as it stores only the non-zero elements and their corresponding row and column indices.

In [41]:
# Each converted sample is a sparse vector.
print(X_train_enc[0])

  (0, 2)	1.0
  (0, 6)	1.0
  (0, 188)	1.0
  (0, 2608)	1.0
  (0, 2679)	1.0
  (0, 3771)	1.0
  (0, 3885)	1.0
  (0, 3929)	1.0
  (0, 4879)	1.0
  (0, 7315)	1.0
  (0, 7319)	1.0
  (0, 7475)	1.0
  (0, 7824)	1.0
  (0, 7828)	1.0
  (0, 7869)	1.0
  (0, 7977)	1.0
  (0, 7982)	1.0
  (0, 8021)	1.0
  (0, 8189)	1.0


In [42]:
# Transforming the testing set using the one-hot encoder
X_test_enc = enc.transform(X_test)

In [43]:
# Checking the class distribution
print(sum(Y == 1), sum(Y == 0))
print(f"Clicked: {sum(y_train == 0)/n_rows*100:.2f} %, "
      f"Did not click: {sum(y_train == 1)/n_rows*100:.2f}%")

51211 248789
Clicked: 74.43 %, Did not click: 15.57%


## Training the model using GridSearch to find the best hyperparameters
* We'll only tweak the max_depth
* We'll use the AUC of ROC curve as the classification metric since the data
is imbalanced
### Parameters to tweak:
1. max_depth: The deepest individual tree. It tends to overfit if it's too
deep, or underfit if it's too shallow.
2. min_samples_split: Represents the number of samples required for further
splitting. Too small causes overfitting, too large, overfitting. Usually set to
10, 30, or 50
3. Part of the exercise suggest tweaking with class_weight and
min_samples_split.

In [44]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
# The parameters we'll tweak
parameters = {'max_depth': [12],
              'min_samples_split': [45],
              'class_weight': ['balanced', None]}

# Initializing a decision tree model with Gini Impurity as the metric and 30
# as maximum number of samples required
decision_tree = DecisionTreeClassifier(criterion = 'gini')

# We'll use three-fold (as there are enough training samples) for
# cross-validation
grid_search = GridSearchCV(decision_tree, parameters, n_jobs = -1, cv = 3,
                           scoring = 'roc_auc')

In [45]:
# Training the model
grid_search.fit(X_train_enc, y_train)

In [46]:
# Printing the best hyperparameters
print(grid_search.best_params_)

{'class_weight': 'balanced', 'max_depth': 12, 'min_samples_split': 45}


### Best hyperparameters:
{'class_weight': 'balanced', 'max_depth': 12, 'min_samples_split': 45}

## Using the best model to predict future test cases


In [47]:
from sklearn.metrics import roc_auc_score
decision_tree_best = grid_search.best_estimator_
pos_prob = decision_tree_best.predict_proba(X_test_enc)[:, 1]

print(f"The ROC AUC on testing set is: {roc_auc_score(y_test, pos_prob):.2f}")


The ROC AUC on testing set is: 0.73


The AUC we can achieve with the optimal decision tree model is 0.72. This does not seem to be very high, but click-through involves many intricate human factors, which is why predicting it is not an easy task.

In [48]:
import numpy as np

# Trying a random selector
pos_prob = np.zeros(len(y_test))

click_index = np.random.choice(len(y_test),
                               int(len(y_test) * sum(Y==1)/sum(Y)),
                               replace = False)
pos_prob[click_index] = 1
print(f'The ROC AUC on testing set is: {roc_auc_score(y_test,pos_prob):.2f}')

The ROC AUC on testing set is: 0.50


## Using Random Forest Trees
* We'll build a Random Forest Classifier to improve the ROC AUC score that we
 got from a single tree.
 * We'll tweak with individual parameters as well as Random Forest parameters.
1. max_features: The number of features to consider for the best split.
Tipically, we use sqrt(m), where m is the number of dimensions. Other options
 include log2, 20%, 50% (from the original features).
2. n_estimators: The number of trees considered for majority voting. The more
 the trees, the better the performance, but the longer the computational time
 . Usually set to 100, 200, 500...
3. In the random forest-based click-through prediction project, can you also tweak other hyperparameters, such as min_samples_split, max_features, and n_estimators, in scikit-learn? What is the highest AUC you are able to achieve?

In [49]:
from sklearn.ensemble import RandomForestClassifier
# Preparing the model
random_forest = RandomForestClassifier(criterion='gini')

# Setting up the parameters to try
parameters_forest = {'max_depth': [15],
                     'min_samples_split': [45],
                     'n_estimators' : [1000],
                     'max_features' : ["sqrt"]}

# We'll use three-fold (as there are enough training samples) for
# cross-validation
grid_search_forest = GridSearchCV(random_forest, parameters_forest, n_jobs=-1, cv= 3, scoring = 'roc_auc')


In [50]:
# Training the random forest
grid_search_forest.fit(X_train_enc, y_train)
grid_search_forest.best_params_

{'max_depth': 15,
 'max_features': 'sqrt',
 'min_samples_split': 45,
 'n_estimators': 1000}

The best split we got was using the Hyperparameters:
{'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_split': 45,
 'n_estimators': 500}

{'max_depth': 15,
 'max_features': 'sqrt',
 'min_samples_split': 45,
 'n_estimators': 1000}

In [51]:
from sklearn.metrics import roc_auc_score
random_forest_best = grid_search_forest.best_estimator_
pos_prob = random_forest_best.predict_proba(X_test_enc)[:, 1]

print(f"The ROC AUC on testing set is: {roc_auc_score(y_test, pos_prob):.2f}")


The ROC AUC on testing set is: 0.74


## Training Gradient Boosted Trees
* Each tree is trained in a succession.
* Here, we need to encode the labels
* We need to specify a learning rate (A small learning rate is preferred)
* More information on the XGBClassifier: https://xgboost.readthedocs.io/en/latest/

In [52]:
# First, we transform the label variable into two dimensions
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)


In [53]:
# Importing XGBoost and initializing a GBT model
import xgboost as xgb
model = xgb.XGBClassifier(learning_rate = 0.001, max_depth = 10,
                          n_estimators= 10000, )

model.fit(X_train_enc, y_train_enc)

Best:
{'booster': 'gblinear', 'learning_rate': 0.001, 'n_estimators': 500}

In [54]:
# Predicting

pos_prob = model.predict_proba(X_test_enc)[:, 1]
print(f'The ROC AUC on testing set is: {roc_auc_score(y_test, pos_prob):.3f}')

The ROC AUC on testing set is: 0.767
