<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Cost_Sensitive_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cost Sensitive Decision Trees**

In this notebook, we willl look at how to train decision trees with cost in mind

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from numpy import mean
from sklearn.model_selection import GridSearchCV

**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=3)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

**Create a decision tree classifier and train it**

In [None]:
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# **Weighted Decision Trees**

With decision trees we can weight the splitting criteria to account for the imbalance in the dataset

Small weight = less importance, lower impact on the node purity<br>
Large weight = more importance, greater impact on the node purity

**Define a Decision Tree Classifier model that incorporates the imbalance of the dataset**

In [None]:
model = DecisionTreeClassifier(class_weight='balanced')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

**Determine model performance**

As you can see this method is slightly better than the non-balanced tree

In [None]:
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

# **Grid Search Weighted Decision Tree Classifiers**

Use cross validation with grid search to find the best weights

In [None]:
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid search
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}] 
param_grid = dict(class_weight=balance)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv,
scoring='roc_auc')
# execute the grid search
grid_result = grid.fit(X, y)

In [None]:
# report the best configuration
print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_)) # report all configurations
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
  print('%f (%f) with: %r' % (mean, stdev, param))

**Assignment**<br>
Change the dataset and rerun the notebook.<br>
What effect does changing the number of features, size of the dataset, and the weight have on the model?