# Feature Importance

## Using Regression Methods

Different regression methods can estimate feature importance by providing coefficients or weights for the input features. Here’s a list of popular regression techniques that yield feature importance:

1. **Linear Regression**
2. **Ridge Regression**
3. **Lasso Regression**
4. **Elastic Net Regression**
5. **Decision Tree Regression**
6. **Random Forest Regression**
7. **Gradient Boosting Regression (e.g., XGBoost)**
8. **Support Vector Regression (SVR)**
9. **Partial Least Squares (PLS) Regression**
10. **Permutation Importance (Model Agnostic)**


## Setup: Generating the Dataset
We'll create a simple dataset with `scikit-learn`'s `make_regression` function.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Print shape for verification
print("Features shape:", X.shape)
print("Target shape:", y.shape)

Features shape: (100, 5)
Target shape: (100,)


## 1. Linear Regression 
This is an Ordinary Least Squares.

   - **How it works:** Minimizes the sum of squared errors to fit a linear model.
   - **Feature Importance:** Coefficients of the linear model indicate the influence of each feature.
   - **Interpretation:** Larger absolute values of coefficients indicate more importance.
   
   > **Limitation:** Sensitive to multicollinearity.

   ```python
   from sklearn.linear_model import LinearRegression

   # Initialize and fit the model
   linear_model = LinearRegression()
   linear_model.fit(X, y)

   # Display feature importance (coefficients)
   print("Linear Regression Coefficients:", linear_model.coef_)
   ```

## 2. **Ridge Regression (L2 Regularization)**
   - **How it works:** Adds an L2 penalty to the loss function to shrink coefficients.
   - **Feature Importance:** Similar to linear regression, but smaller coefficients due to the penalty.
   - **Use case:** When features are correlated.
   
   > **Advantage:** Helps with multicollinearity and overfitting.


```python
from sklearn.linear_model import Ridge

# Initialize Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# Display feature importance (coefficients)
print("Ridge Regression Coefficients:", ridge_model.coef_)
```

## 3. **Lasso Regression (L1 Regularization)**
   - **How it works:** Adds an L1 penalty that encourages sparsity in the coefficients.
   - **Feature Importance:** Some coefficients are exactly zero, making it a feature selector.
   - **Use case:** When you want to identify a small subset of important features.
   
   > **Advantage:** Automatic feature selection by shrinking unimportant coefficients to zero.

   ```python
from sklearn.linear_model import Lasso

# Initialize Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)

# Display feature importance (coefficients)
print("Lasso Regression Coefficients:", lasso_model.coef_)
```

## 4. **Elastic Net Regression (L1 + L2 Regularization)**
   - **How it works:** Combines both L1 and L2 regularization.
   - **Feature Importance:** Balances between feature selection (L1) and coefficient shrinkage (L2).
   - **Use case:** When features are highly correlated, and some sparsity is desired.
   
   > **Advantage:** More flexible than Ridge or Lasso alone.

   ```python
from sklearn.linear_model import ElasticNet

# Initialize Elastic Net Regression
elastic_net_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net_model.fit(X, y)

# Display feature importance (coefficients)
print("Elastic Net Coefficients:", elastic_net_model.coef_)
```

## 5. **Decision Tree Regression**
   - **How it works:** Splits data at nodes based on feature values to minimize variance.
   - **Feature Importance:** Based on reduction in variance or impurity at each split.
   - **Interpretation:** Sum of reductions at each node where the feature was used.
   
   > **Limitation:** Prone to overfitting on small datasets.

   ```python
from sklearn.tree import DecisionTreeRegressor

# Initialize Decision Tree Regressor
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)

# Display feature importance
print("Decision Tree Feature Importances:", tree_model.feature_importances_)
```

## 6. **Random Forest Regression**
   - **How it works:** Builds an ensemble of decision trees.
   - **Feature Importance:** Average reduction in impurity across all trees where the feature was used.
   - **Use case:** When non-linear relationships are important.
   
   > **Advantage:** More robust than a single decision tree.

   ```python
from sklearn.ensemble import RandomForestRegressor

# Initialize Random Forest Regressor
forest_model = RandomForestRegressor(n_estimators=100)
forest_model.fit(X, y)

# Display feature importance
print("Random Forest Feature Importances:", forest_model.feature_importances_)
```

## 7. **Gradient Boosting Regression (e.g., XGBoost, LightGBM)**
   - **How it works:** Sequentially builds trees, each correcting the errors of the previous.
   - **Feature Importance:** Based on gain (reduction in loss) or frequency of usage in trees.
   - **Use case:** Excellent for complex, non-linear relationships.
   
   > **Advantage:** Often more accurate, with detailed feature importance scores.

   ```python
from xgboost import XGBRegressor

# Initialize XGBoost Regressor
xgb_model = XGBRegressor(objective='reg:squarederror', n_estimators=100)
xgb_model.fit(X, y)

# Display feature importance
print("XGBoost Feature Importances:", xgb_model.feature_importances_)
```

## 8. **Support Vector Regression (SVR)**
   - **How it works:** Finds a hyperplane in a high-dimensional space to minimize the error.
   - **Feature Importance:** Based on the coefficients in the dual representation.
   
   > **Limitation:** Not straightforward for feature importance unless using linear kernels.

   ```python
from sklearn.svm import SVR

# Initialize Support Vector Regressor
svr_model = SVR(kernel='linear')
svr_model.fit(X, y)

# Display feature importance (coefficients)
print("SVR Coefficients:", svr_model.coef_)
```

## 9. **Partial Least Squares (PLS) Regression**
   - **How it works:** Projects predictors to a new space while maximizing variance explanation.
   - **Feature Importance:** Feature weights on the projected latent variables.
   
   > **Use case:** Suitable for multicollinear and high-dimensional data.

   ```python
from sklearn.cross_decomposition import PLSRegression

# Initialize PLS Regression with 2 components
pls_model = PLSRegression(n_components=2)
pls_model.fit(X, y)

# Display the coefficients for each feature
print("PLS Regression Coefficients:", pls_model.coef_)
```


## 10. **Permutation Importance (Model Agnostic)**
   - **How it works:** Measures the decrease in model performance when a feature's values are randomly shuffled.
   - **Feature Importance:** Drop in performance (e.g., R² or RMSE) after shuffling.
   
   > **Advantage:** Applicable to any model, providing a global view of feature importance.

   ```python
from sklearn.inspection import permutation_importance

# Use the previously trained RandomForest model
perm_importance = permutation_importance(forest_model, X, y, n_repeats=30, random_state=42)

# Display permutation importance
print("Permutation Importances:", perm_importance.importances_mean)
```

## Choosing a Method
- **Linear Data:** Linear, Ridge, Lasso.
- **Non-linear Data:** Random Forest, Gradient Boosting.
- **Feature Selection Focus:** Lasso, Elastic Net.
- **Interpretability:** Linear models, Decision Trees.