## Feature Selection Using Feature Importance

Feature selection is the process of identifying and selecting the most relevant features from a dataset to improve model performance and interpretability. Feature importance provides a systematic way to select features by evaluating their contribution to the predictive model.

### Key Steps in Feature Selection Using Feature Importance

1. **Train a Model:**
   - Use a model that provides feature importance scores, such as tree-based models like Random Forest or Gradient Boosting.

2. **Compute Feature Importances:**
   - Extract importance scores for each feature.

3. **Rank Features:**
   - Rank the features based on their importance scores.

4. **Select Top Features:**
   - Choose a subset of features based on their importance.

5. **Train a Model with Selected Features:**
   - Retrain the model using only the selected features and evaluate its performance.


In [1]:
# Import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load sample dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Extract feature importances
importances = rf_model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(feature_importance_df)

# Select top features based on importance threshold
importance_threshold = 0.1  # Define threshold
selected_features = feature_importance_df[feature_importance_df['Importance'] >= importance_threshold]['Feature']

print("\nSelected Features:")
print(selected_features)

# Train a new model using only selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

rf_model_selected = RandomForestClassifier(random_state=42)
rf_model_selected.fit(X_train_selected, y_train)

# Evaluate the model with selected features
y_pred = rf_model_selected.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nAccuracy with Selected Features: {accuracy}")


Feature Importances:
             Feature  Importance
3   petal width (cm)    0.433982
2  petal length (cm)    0.417308
0  sepal length (cm)    0.104105
1   sepal width (cm)    0.044605

Selected Features:
3     petal width (cm)
2    petal length (cm)
0    sepal length (cm)
Name: Feature, dtype: object

Accuracy with Selected Features: 1.0


### Feature Importances
| Feature             | Importance |
|---------------------|------------|
| petal width (cm)    | 0.433982   |
| petal length (cm)   | 0.417308   |
| sepal length (cm)   | 0.104105   |
| sepal width (cm)    | 0.044605   |

---

### Selected Features
| Selected Features     |
|------------------------|
| petal width (cm)       |
| petal length (cm)      |
| sepal length (cm)      |

---

### Accuracy
| Metric                   | Value |
|--------------------------|-------|
| Accuracy with Selected Features | 1.0   |


### Explanation of the Code

**Load Dataset:**
- The Iris dataset is used, which has 4 features (sepal length, sepal width, petal length, petal width).

**Train Random Forest Model:**
- A Random Forest model is trained on the dataset. This model provides `feature_importances_` scores.

**Compute Feature Importances:**
- The feature importance scores are extracted and sorted to rank the features.

**Feature Selection:**
- Features with an importance score above a specified threshold (e.g., 0.1) are selected.

**Retrain Model:**
- A new Random Forest model is trained using only the selected features.

**Evaluate Model:**
- The accuracy of the model using the selected features is calculated to ensure performance is retained or improved.


### Benefits of Feature Selection Using Feature Importance

1. **Improved Model Performance**  
   Reducing irrelevant or redundant features can enhance the model's accuracy and generalization.

2. **Reduced Complexity**  
   Simplifying the feature set reduces training time and computational cost.

3. **Better Interpretability**  
   Identifying important features helps in understanding the factors driving predictions.

4. **Noise Reduction**  
   Eliminating less important features can reduce noise in the data and improve robustness.

---

This method can be applied with other algorithms like Gradient Boosting (e.g., XGBoost, LightGBM) or with model-agnostic techniques like Permutation Importance or SHAP values.
