# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS1090A Introduction to Data Science 

# Lab 9: Decision Trees

**Harvard University**<br/>
**Fall 2024**<br/>
**Instructors**: Pavlos Protopapas and Natesh Pillai<br/>
<hr style='height:2px'>

In [None]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Table of Contents:
- A quick review of decision trees
- `DecisionTreeClassifier`
- Tuning a single decision tree (`max_depth` & `criterion`)
- Vizualizing a decision tree with `plot_tree`)
- Pruning
- Feature Importance

---------

#### The Idea: Decision Trees are just flowcharts and are interpretable!

<img src="fig/flowchart.png" alt="how to fix anything" width="50%"/>


It turns out that simple flow charts can be formulated as mathematical models for classification and these models have the properties we desire:
 - interpretable by humans 
 - have sufficiently complex decision boundaries 
 - the decision boundaries are locally linear, each component of the decision boundary is simple to describe mathematically. 

----------

#### Let's review some theory.

How do we build decision trees? We use a greedy approach:
 1. Start with an empty decision tree (undivided feature space) 
 2. Choose the ‘optimal’ predictor on which to split and choose the ‘optimal’ threshold value for splitting by applying a **splitting criterion (1)**
 3. Recurse on on each new node until **stopping condition (2)** is met
 
For classification, we label each region in the model with the label of the class to which the majority of the points within the region belong. 

#### So we need a (1) splitting criterion and a (2) stopping condition:

  #### (1) Splitting criterion 
<img src="fig/split1.png" alt="split1" width="70%"/>

---

<img src="fig/classification error.png" alt="classification error"/>

---
<img src="fig/split2.png" alt="split2" width="70%"/>

<img src="fig/tree_loss.png" alt="tree_adj"/>

#### (2) Stopping condition

If we don’t terminate the decision tree learning algorithm manually, the tree will continue to grow until each region defined by the model possibly contains exactly one training point (and the model attains 100% training accuracy). **Not stopping while building a deeper and deeper tree = 100% training accuracy; What will your test accuracy be? What can we do to fix this?**

To prevent the **overfitting** from happening, we could 
- Stop the algorithm at a particular depth. (=**not too deep**)
- Don't split a region if all instances in the region belong to the same class. (=**stop when subtree is pure**)
- Don't split a region if the number of instances in the sub-region will fall below pre-defined threshold (min_samples_leaf). (=**not too specific/small subtree**)
- Don't use too many splits in the tree (=**not too many splits / not too complex global tree**)
- Be content with <100% accuracy training set...

-------------

#### Done with theory, let's get started

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn import tree
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import confusion_matrix
from sklearn import datasets

#new model objects
from sklearn.tree import DecisionTreeClassifier


pd.set_option('display.width', 1500)
pd.set_option('display.max_columns', 100)

np.random.seed(42)

-------------

# Decision Tree Spam Classifier

We will be working with a spam email dataset. The dataset has 57 predictors with a response variable called `Spam` that indicates whether an email is spam or not spam. The goal is to be able to create a classifier or method that acts as a spam filter.

In [None]:
spam_df = pd.read_csv('data/spam.csv')
display(spam_df.head())

The predictors are all quantitative. They represent certain features  of an email like the frequency of the word 'discount.' The we will use the binary `spam` variable in the final column as our response for classification.

Link to description : https://archive.ics.uci.edu/ml/datasets/spambase

### Split data into train and test

In [None]:
# Split spam_df into train and test data with a random seed of 109
data_train, data_test = train_test_split(spam_df, random_state=0, test_size=.2, stratify=spam_df.spam)

# Split predictor and response columns
X_train, y_train = data_train.drop(['spam'], axis=1), data_train['spam']
X_test , y_test  = data_test.drop(['spam'] , axis=1), data_test['spam']

print("Shape of Training Set :", data_train.shape)
print("Shape of Testing Set :" , data_test.shape)

In [None]:
X_train.head()

We can check that the proportion of spam cases is roughly evenly represented in both the training and test set.


In [None]:
#Check Percentage of Spam in Train and Test Set
pct_spam_tr = 100*y_train.mean()
pct_spam_te = 100*y_test.mean()
                                                  
print(f"Percentage of Spam in Training Set \t : {pct_spam_tr:0.2f}%")
print(f"Percentage of Spam in Testing Set \t : {pct_spam_te:0.2f}%")

-----------

# Fitting an Optimal Single Decision Tree (by Depth) :

Here, for each candidate `max_depth` and `criterion` combination, we fit a single tree to our spam training data using 5-fold cross validation.

We store the CV accuracy scores in a DataFrame along with the hyperparmeter settings that generated them.

In [None]:
#Find optimal depth of trees

df = pd.DataFrame(columns=['criterion', 'depth', 'all_cv', 'mean_cv'])

criterion = ['gini', 'entropy']

first_depth = 2
final_depth = 30
step = 2

results = []
for cur_criterion in criterion:      
    for max_depth in range(first_depth, final_depth+1, step):
        dt = DecisionTreeClassifier(criterion=cur_criterion , max_depth=max_depth)
        scores = cross_val_score(estimator=dt, X=X_train, y=y_train, cv=5, n_jobs=-1)
        
        cur_results = {'criterion': cur_criterion,
                      'depth': max_depth,
                      'all_cv': scores,
                      'mean_cv': scores.mean()}
        results.append(cur_results)
df = pd.DataFrame(results)

In [None]:
display(df)

Some dataframe manipulations for our x,y construction for the plot below:

We can then visualize the validation accuracy for the different hyperparameters.

In [None]:
plt.figure(figsize=(7, 4))

plt.plot(df[df.criterion == 'gini'].depth,
         df[df.criterion == 'gini'].mean_cv, 'b-', marker='o', alpha = 0.6, label='Gini')
plt.plot(df[df.criterion == 'entropy'].depth,
         df[df.criterion == 'entropy'].mean_cv, 'r-', marker='o', alpha = 0.6, label='Entropy')
plt.ylabel("Cross Validation Accuracy")
plt.xlabel("Maximum Depth")
plt.title('Variation of Accuracy with Depth - Simple Decision Tree')
plt.legend()
plt.grid(alpha = 0.3)

plt.tight_layout()
plt.show()

### Let's visualize a plot with the Confidence Bands!

Also, if we wanted to get **the Confidence Bands of these results**, how would we? It's as simple as a combination of getting variance using ```scores.std()``` and ```plt.fill_between()```.

In [None]:
df_gini = df[df['criterion'] == 'gini']
df_entropy = df[df['criterion'] == 'entropy']

x_gini = df_gini['depth'].values.astype(float)
y_gini = df_gini['mean_cv'].values.astype(float)

x_entropy = df_entropy['depth'].values.astype(float)
y_entropy = df_entropy['mean_cv'].values.astype(float)

stds_gini = np.array([ np.std(scores) for scores in df_gini['all_cv']], dtype = float) 
stds_entropy = np.array([ np.std(scores) for scores in df_entropy['all_cv']], dtype = float)

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(8, 5))

#Plot
axes[0].fill_between(df.loc[df.criterion == 'gini'].depth, y_gini + stds_gini, 
                     y_gini - stds_gini, alpha=0.2)
axes[0].plot(x_gini, y_gini, 'b-', marker='o')
axes[0].set_ylabel("Cross Validation Accuracy")
axes[0].set_title('Variation of Accuracy with Depth - Single Decision Tree')
axes[0].legend(['std','Gini'])
axes[0].grid(alpha = 0.3)

axes[1].fill_between(x_entropy, y_entropy + stds_entropy, 
                     y_entropy - stds_entropy, 
                     color = 'r', alpha=0.2)
axes[1].plot(x_entropy, y_entropy, 'r-', marker='o')
axes[1].set_ylabel("Cross Validation Accuracy")
axes[1].set_xlabel("Maximum Depth")
axes[1].legend(['std','Entropy'])
axes[1].grid(alpha = 0.3)

plt.tight_layout()
plt.show()

### Let's visualize a boxplot! (**Gini impurity** only)

If we want to display it as a boxplot we first construct a dataframe with all the scores and second we use ```sns.boxplot(...)```

In [None]:
display(df_gini.head())

In [None]:
ds = range(first_depth, final_depth + 1, step)

plt.figure(figsize=(7,4))
plt.boxplot([df_gini.loc[df_gini.depth==d, 'all_cv'].values[0] for d in ds])
plt.scatter(range(1,len(ds)+1), df_gini.mean_cv, color='red', alpha=0.3, label='Mean CV Acc')
plt.xticks(range(1,len(ds)+1), labels=ds)
plt.ylabel("cross-validation accuracy")
plt.xlabel("max depth")
plt.title("Spam Classifier Trees (Gini)")
plt.grid(alpha = 0.3)
plt.legend()
plt.show()

**Question:** Which depth are you going to pick?

### Let's extract the best_depth value from these two dataframes, *df_gini* and *df_entropy*.

We need to create the new variable *best_depth* for each dataframe. 

How to get the index of the maximum value from the given array?

```hint: np.argmax(target array)```

In [None]:
# What does this do?

mean_CV_acc_gini = df_gini['mean_cv']
mean_CV_acc_entropy = df_entropy['mean_cv']

best_idx_gini = np.argmax(mean_CV_acc_gini)
best_idx_entropy = np.argmax(mean_CV_acc_entropy)

best_depth_gini = df_gini['depth'].iloc[best_idx_gini]
best_depth_entropy = df_entropy['depth'].iloc[best_idx_entropy]

print('The best depth based on Gini impurity was found to be: ', best_depth_gini)
print('The best depth based on Entropy was found to be: ', best_depth_entropy)

In [None]:
#Evalaute the performance at the best depth
model_tree_gini = DecisionTreeClassifier(max_depth=best_depth_gini, criterion = 'gini')
model_tree_entropy = DecisionTreeClassifier(max_depth=best_depth_entropy, criterion ='entropy')

model_tree_gini.fit(X_train, y_train)
model_tree_entropy.fit(X_train, y_train)

#Check Accuracy of Spam Detection in Train and Test Set (Gini Impurity)
acc_trees_train_gini = accuracy_score(y_train, model_tree_gini.predict(X_train))
acc_trees_test_gini  = accuracy_score(y_test,  model_tree_gini.predict(X_test))

print("================ [Gini Impurity] ================")
print("Simple Decision Trees: Accuracy, Training Set \t : {:.2%}".format(acc_trees_train_gini))
print("Simple Decision Trees: Accuracy, Testing Set \t : {:.2%}".format(acc_trees_test_gini))

#Check Accuracy of Spam Detection in Train and Test Set (Entropy)
acc_trees_train_entropy = accuracy_score(y_train, model_tree_entropy.predict(X_train))
acc_trees_test_entropy = accuracy_score(y_test,  model_tree_entropy.predict(X_test))

print("\n================ [Entropy] ================")
print("Simple Decision Trees: Accuracy, Training Set \t : {:.2%}".format(acc_trees_train_entropy))
print("Simple Decision Trees: Accuracy, Testing Set \t : {:.2%}".format(acc_trees_test_entropy))

### Let's visualize a confusion matrix with ```plot_confusion_matrix```

#### How to visualize the classification result using a Confusion matrix? ####

<img src="fig/confusion_matrix.png" alt="classification error" width="300"/>

<img src="fig/confusion_matrix2.png" alt="classification error" width="400"/>

*source: wikipedia*

We can use the sklearn library function, **plot_confusion_matrix**.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,8))
ConfusionMatrixDisplay.from_estimator(model_tree_gini, X_test, y_test, cmap=plt.cm.Blues, ax = axes[0]);
ConfusionMatrixDisplay.from_estimator(model_tree_entropy, X_test, y_test, cmap=plt.cm.Blues, ax = axes[1])
axes[0].set_title('Simple Decision Tree - Gini')
axes[1].set_title('Simple Decision Tree - Entropy')
# plt.rc('font', size=18)
plt.tight_layout()

plt.show()

### How to visualize a Decision Tree with ```sklearn.tree.plot_tree```

*Question:* Do you think this tree is interpretable? What do you think about a the maximal depth of the tree?

<!-- - Let's look at the resulting text ```decision_tree.dot``` -->

<!-- - Let's convert our (hard to read) written decision tree (```decision_tree.dot```) into an intuitive image file format: ```image_tree.png```
- <span style="color:red">**NOTE:**</span> You might need to install the ```pydot``` package by typing the following command in your terminal: ```pip install pydot``` or you can install from within the jupyter notebook by running the following cell: ```! pip install pydot``` -->

In [None]:
plt.figure(figsize=(15, 8))
gini_tree = tree.plot_tree(model_tree_gini, max_depth = 1);

In [None]:
plt.figure(figsize=(15, 8))
entropy_tree = tree.plot_tree(model_tree_entropy, max_depth = 5);

### Pruning

Limiting how far a tree can grow using hyperparameters like `max_depth` or `max_leaf_nodes` can help prevent overfitting, but they can lead to trees with high bias that can underfit the training data.

Another way to address overfitting is to train a deep tree and then prune it back. This is done using the `ccp_alpha` hyperparameter. This is the cost complexity parameter. The cost complexity is the size of the tree. This is analogous to the regularization (hyper)parameter we saw with Ridge and Lasso. A higher value of thehyperparameter means more regularization and a less complex model which is less likely to over fit.

We saw that the optimal entropy tree above was rather deep. But we can prune it back.

In [None]:
ccp_alpha = 0.05
entropy_pruned = DecisionTreeClassifier(max_depth=best_depth_entropy, criterion ='entropy', ccp_alpha=ccp_alpha)
entropy_pruned.fit(X_train, y_train)
entropy_pruned.score(X_test, y_test)
tree.plot_tree(entropy_pruned);

Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides [DecisionTreeClassifier.cost_complexity_pruning_path](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.cost_complexity_pruning_path) that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [None]:
path = model_tree_entropy.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In the following plot, the maximum effective alpha value is removed, because it is the trivial tree with only one node.

In [None]:
fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set");

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [None]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)

For the remainder of this example, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [None]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1)
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. 

In [None]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

fig, ax = plt.subplots(figsize=(20,8))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [None]:
print(f"ccp_alpha that maximizes test score: {ccp_alphas[np.argmax(test_scores)]:.5f}")

🤔 **Should we choose which model to deploy based on the test performance?**

## Preprocessing for Decision Trees

Unlike many other models, decision trees have some nice properties when it comes to preprocessing:

1. **Scaling**: Trees don't require feature scaling because they use thresholds rather than distances
   - No need for StandardScaler or MinMaxScaler
   - Trees make splits based on relative ordering, not absolute values

2. **Categorical Variables**: 
   - For binary categories, any encoding works equally well
   - For multi-class categories:
     - One-hot encoding is preferred
     - No need to drop_first (unlike linear models) since trees can handle the redundancy
```python
# Example of proper categorical encoding for trees
X_encoded = pd.get_dummies(X, columns=['categorical_column'])
# No need for: drop_first=True
```

3. **Missing Values**: 
   - Trees can handle missing values naturally (though sklearn's implementation doesn't)
   - Consider using SimpleImputer with strategy='most_frequent' for categorical
   - Use strategy='median' for numerical features

## Handling Class Imbalances

Class imbalance occurs when some classes have many more samples than others. This is common in real-world scenarios like spam detection or fraud detection. With imbalanced datasets, accuracy can be misleading - a model could achieve high accuracy by simply predicting the majority class!

There are several approaches to handle class imbalances:

1. **Class Weights**: Tell the model to pay more attention to minority classes
   
```python
# Add weights inversely proportional to class frequencies
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', 
                                     classes=np.unique(y_train),
                                     y=y_train)
                                   
dt = DecisionTreeClassifier(class_weight='balanced')  # Or pass dict of weights
```

2. **Upsampling**: Replicate minority class samples
```python
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_ros))
```

3. **SMOTE**: Create synthetic minority samples (requires a pip install)
```python
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_sm))
```

## Alternative Metrics to Accuracy (revisted)

When dealing with imbalanced classes, accuracy can be misleading. Consider these alternatives:

1. **Precision**: Of the positive predictions, how many were correct?
   - Important when false positives are costly
   - Example: Spam detection (don't want to block legitimate emails)

2. **Recall**: Of the actual positive cases, how many did we catch?
   - Important when false negatives are costly
   - Example: Disease detection (don't want to miss sick patients)

3. **F1 Score**: Harmonic mean of precision and recall
   - Balances precision and recall

```python
from sklearn.metrics import classification_report

# Get comprehensive metrics
print(classification_report(y_test, y_pred))

# For binary classification, you can of course also plot the ROC curve
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(model, X_test, y_test)
plt.show()
```


--------
<div class="alert alert-success">
    <strong>🏋🏻‍♂️ TEAM ACTIVITY:</strong> Tuning a Spam Detection Decision Tree </div>  

**Tune some of the available decision tree hyperparameters and select the best model. Finally, evaluate your selected model on the test data.**

- You must be able to justify your choice for the best model **without** reference to test performance (i.e., no tuning to the test data!)
- Consult the [DecisionTreeClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) to see the full list of available hyperparameters. You certainly do not need to tune all of them here!

In [None]:
# your code here


**Pros of Decision Trees:**
- Very straigtforward models, easy to explain to people, even easier than linear regression.
- Transparent models, easy to interpret
- Not as sensitive to multicollinearity as some other models
- Do not require scaling of data, not sensitive to variables having high difference in range
- Can handle both numerical and caterorical predictors (in some languages like R you don't even have to encode categorical data as zeros and ones, R handles it out-of-the-box)

**Cons of Decision Trees:**
- Not very competitive in terms of predictive accuracy, other classification and regression approaches outperform trees.
- Overfit very quickly.
- Very non-robust, a small change in the data can cause a large change in the final estimated tree, In other words they suffer from high variance.

What can we do to make it better?

Let's say we have a set of $n$ independent observations $Z_1, Z_2, Z_3, ..., Z_n$. Each $Z_i$ has a variance of $\sigma^2$. What would be the variance of the mean of the observations $\bar{Z}$ ?

It would be $\frac{\sigma^2}{n}$, which is lower than each independent observation would have.

# Feature Importance

Decision Tree objects have a `feature_importances_` attribute. This is a record of how much each feature's splits contributed to the reduction in the model's splitting criterion.

The idea is that features whose splits reduced the criterion the most are the most important. But there are reasons to be skeptical about this approach to feature importance...

## Feature Importance Analysis: 3 Approaches

Decision trees offer multiple ways to assess feature importance:

1. **Built-in Feature Importance**

In [None]:
# Built-in feature importance
importances = model_tree_entropy.feature_importances_
feature_imp = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_imp.head(10)['feature'], 
         feature_imp.head(10)['importance'])
plt.title('Top 10 Feature Importances (Built-in)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

2. **Permutation Importance**: More robust than built-in importance

In [None]:
from sklearn.inspection import permutation_importance

# Permutation Importance (more robust)
result = permutation_importance(
    model_tree_entropy, X_test, y_test,
    n_repeats=10,
    random_state=42
)

# Create dataframe of permutation importances
perm_imp = pd.DataFrame({
    'feature': X_test.columns,
    'importance_mean': result.importances_mean,
    'importance_std': result.importances_std
}).sort_values('importance_mean', ascending=False)

# Plot top 10 features with error bars
plt.figure(figsize=(10, 6))
top_10 = perm_imp.head(10)
plt.barh(range(len(top_10)), top_10['importance_mean'],
         xerr=top_10['importance_std'], capsize=5)
plt.yticks(range(len(top_10)), top_10['feature'])
plt.title('Top 10 Feature Importances (Permutation)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

3. **Bootstrap Analysis**: Assess stability of importance measures

In [None]:
# Bootstrap Analysis of Feature Importance
n_iterations = 100
n_features = X_train.shape[1]
bootstrap_importances = np.zeros((n_iterations, n_features))

for i in range(n_iterations):
    # Bootstrap sample
    indices = np.random.randint(0, len(X_train), len(X_train))
    X_boot = X_train.iloc[indices]
    y_boot = y_train.iloc[indices]
    
    # Fit model and get importance
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_boot, y_boot)
    bootstrap_importances[i,:] = dt.feature_importances_

# Calculate confidence intervals
importance_stats = pd.DataFrame({
    'feature': X_train.columns,
    'mean_importance': bootstrap_importances.mean(axis=0),
    'std_importance': bootstrap_importances.std(axis=0)
}).sort_values('mean_importance', ascending=False)

# Top 10 features with confidence intervals
plt.figure(figsize=(10, 6))
top_10 = importance_stats.head(10)
plt.barh(range(len(top_10)), top_10['mean_importance'],
         xerr=top_10['std_importance'], capsize=5)
plt.yticks(range(len(top_10)), top_10['feature'])
plt.title('Top 10 Feature Importances (Bootstrap Analysis)')
plt.xlabel('Mean Importance')
plt.tight_layout()
plt.show()

Finally, a comparison across all methods.

In [None]:
# Top 5 most important features across all methods
print("\nTop 5 Most Important Features Summary:")
print("\nBuilt-in Importance:")
print(feature_imp[['feature', 'importance']].head())
print("\nPermutation Importance:")
print(perm_imp[['feature', 'importance_mean', 'importance_std']].head())
print("\nBootstrap Importance:")
print(importance_stats[['feature', 'mean_importance', 'std_importance']].head())

🌈 **The End**