## Feature Importance from Tree-Based Models

Algorithms like Random Forests and Gradient Boosting Machines can provide feature importance scores based on how often and how much a feature is used for splitting.

Tree-based models like Random Forests and Gradient Boosting Machines (GBM) can provide feature importance scores that help you understand the significance of each feature in making predictions. These feature importance scores are useful for feature selection, understanding your model, and gaining insights into the relationships between the features and the target variable. 

The Working is based on the splitting of the node in the tree using Gini Impurity and Entropy

In [139]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the dataset (In this example, we'll use the Iris dataset)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target,columns = ['target'])

In [140]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [141]:
y.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


In [142]:
# Split the data into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [143]:
# First check Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [144]:
# Get feature importances for the Random Forest model
rf_feature_importances = rf_model.feature_importances_
rf_feature_importances

array([0.10809762, 0.03038681, 0.43999397, 0.42152159])

In [145]:
# Gradient Boosting Machine (GBM)
gbm_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbm_model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [146]:
# Get feature importances for the GBM model
gbm_feature_importances = gbm_model.feature_importances_
gbm_feature_importances

array([0.00136135, 0.01466716, 0.66567162, 0.31829987])

In [147]:
# Display feature importances for both
print("Random Forest Feature Importances:")
for feature, importance in zip(X.columns, rf_feature_importances):
    print(f"{feature}: {importance}")

print("\nGradient Boosting Machine Feature Importances:")
for feature, importance in zip(X.columns, gbm_feature_importances):
    print(f"{feature}: {importance}")

Random Forest Feature Importances:
sepal length (cm): 0.10809762464246378
sepal width (cm): 0.030386812473242528
petal length (cm): 0.43999397414456937
petal width (cm): 0.4215215887397244

Gradient Boosting Machine Feature Importances:
sepal length (cm): 0.0013613495710818788
sepal width (cm): 0.014667157306450787
petal length (cm): 0.6656716227428934
petal width (cm): 0.3182998703795739


### SelectFromModel is used to select the top k features

#### For Random Forest

In [153]:
from sklearn.feature_selection import SelectFromModel

# Select top 'k' features based on importance scores
k = 2  # You can adjust this value
sfm = SelectFromModel(rf_model, threshold=-np.inf, max_features=k)
sfm.fit(X_train, y_train)
selected_features = X_train.columns[sfm.get_support()]  # to get the column name


  return fit_method(estimator, *args, **kwargs)


In [154]:
selected_features

Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

#### For Gradient Boosting Machine

In [157]:
from sklearn.feature_selection import SelectFromModel

# Select top 'k' features based on importance scores
k = 2  # You can adjust this value
sfm = SelectFromModel(gbm_model, threshold=-np.inf, max_features=k)
sfm.fit(X_train, y_train)
selected_features = X_train.columns[sfm.get_support()]  # to get the column name

  y = column_or_1d(y, warn=True)


In [158]:
selected_features

Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

The threshold=-np.inf value is used in the SelectFromModel method from scikit-learn as a way to include all features with non-negative importance scores. When you set the threshold to -np.inf (negative infinity), it effectively means that no features will be excluded based on their importance scores. This is a way to include all features in the selection, regardless of their importance.

Here's a bit more detail on how it works:

   - If you set threshold=-np.inf, all features will be selected because no feature will have an importance score lower than negative infinity.

   - If you set a specific threshold, only features with importance scores greater than or equal to that threshold will be selected.

   - If you set threshold=0, only features with positive importance scores will be selected, effectively excluding features with non-positive importance.

Setting threshold=-np.inf can be useful when you want to retain all features and not perform any feature selection based on importance scores. However, if you want to select a specific number of top features or a subset based on importance, you can set a finite threshold that suits your needs.

### threshold parameter is optional and based on the dataset