<a href="https://colab.research.google.com/github/WasudeoGurjalwar/AL_ML_Training/blob/main/Feature_Selection_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Feature selection is the process of selecting a subset of relevant features (variables or columns) from your dataset to improve the performance of your machine learning model and reduce dimensionality. Here are three common ways to do feature selection in Python with small code samples using the scikit-learn library:

1. **Variance Threshold**:

   This method removes features with low variance, as they are likely to contain little information. It's suitable for numerical features.

   ```python
   from sklearn.feature_selection import VarianceThreshold

   # Create a VarianceThreshold instance with a threshold value
   selector = VarianceThreshold(threshold=0.1)

   # Fit and transform the selector on your dataset
   X_new = selector.fit_transform(X)

   # X_new will contain only features with variance greater than 0.1
   ```

2. **SelectKBest with Mutual Information**:

   This method selects the top k features based on their mutual information with the target variable. It's suitable for both numerical and categorical features.

   ```python
   from sklearn.feature_selection import SelectKBest, mutual_info_classif

   # Create a SelectKBest instance with mutual information as the scoring function
   selector = SelectKBest(score_func=mutual_info_classif, k=5)

   # Fit and transform the selector on your dataset
   X_new = selector.fit_transform(X, y)

   # X_new will contain the top 5 features with the highest mutual information
   ```

3. **Recursive Feature Elimination (RFE)**:

   This method recursively removes the least important features based on the model's performance. It's suitable for any supervised learning problem.

   ```python
   from sklearn.feature_selection import RFE
   from sklearn.linear_model import LogisticRegression

   # Create an estimator (e.g., LogisticRegression) and a RFE instance
   estimator = LogisticRegression()
   selector = RFE(estimator, n_features_to_select=3)

   # Fit the selector on your dataset
   selector = selector.fit(X, y)

   # X_new will contain the top 3 features selected by RFE
   ```

These are just a few examples of feature selection methods in scikit-learn. Depending on your specific dataset and problem, you can choose the most appropriate method.

## Example 1 - Variance Threshold feature selection

Demonstrates the use of the Variance Threshold feature selection method with a predefined dataset from the scikit-learn library:

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import VarianceThreshold

# Load the Iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target

# Create a DataFrame from the dataset
df = pd.DataFrame(data=X, columns=iris.feature_names)
df['target'] = y

# Display the first few rows of the dataset
print("Original Dataset:")
print(df.head())

# Instantiate the VarianceThreshold selector with a threshold value
selector = VarianceThreshold(threshold=0.2)

# Fit the selector on the dataset and transform it
X_new = selector.fit_transform(X)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)
print("------------------")
print(selected_indices)
print("------------------")

# Create a DataFrame with the selected features
df_new = df.iloc[:, selected_indices]

# Display the selected features
print("\nSelected Features:")
print(df_new.head())

# Display the feature variances
print("\nFeature Variances:")
print(df.iloc[:, selected_indices].var())


##  the variance of a feature can be above 1.
## In fact, the variance of a feature measures how much the values of that feature
## vary from the mean. If the values of the feature are widely spread out from the mean,
## the variance will be higher.

Original Dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
------------------
[0 2 3]
------------------

Selected Features:
   sepal length (cm)  petal length (cm)  petal width (cm)
0                5.1                1.4               0.2
1                4.9                1.4               0.2
2                4.7                1.3               0.2
3                4.6                1.5               0.2
4                5.0                1.4               0.2

Feature Variances:
s

## Example 2 - SelectKBest

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset as an example
data = load_breast_cancer()
X = data.data
y = data.target

# Create a DataFrame from the dataset
df = pd.DataFrame(data=X, columns=data.feature_names)
df['target'] = y

# Display the first few rows of the dataset
print("Original Dataset:")
print(df.head())

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the SelectKBest selector with mutual information as the scoring function
selector = SelectKBest(score_func=mutual_info_classif, k=5)

# Fit and transform the selector on the training data
X_train_new = selector.fit_transform(X_train, y_train)

# Get the indices of the selected features
selected_indices = selector.get_support(indices=True)
print(selected_indices)

# Create a DataFrame with the selected features
df_selected = df.iloc[:, selected_indices]

# Display the selected features
print("\nSelected Features:")
print(df_selected.head())

# Train a RandomForestClassifier on the selected features
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_new, y_train)
#clf.fit(X_train, y_train)

# Transform the test data using the same selector
X_test_new = selector.transform(X_test)

# Make predictions on the test data
y_pred = clf.predict(X_test_new)
#y_pred = clf.predict(X_test)

# Calculate and display the accuracy of the model on the test data
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy on Test Data:", accuracy)


Original Dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area

`SelectKBest` in scikit-learn provides several scoring functions that you can use for feature selection. These functions measure the relationship between features and the target variable differently. Some commonly used scoring functions include:

1. **f_classif**: This function computes the ANOVA F-statistic for classification tasks. It assesses whether there are significant differences in feature distributions between classes.

2. **chi2**: The chi-squared (chi2) statistic is used for feature selection with categorical target variables (classification tasks). It measures the dependence between the feature and the target.

3. **f_regression**: This function calculates the F-statistic for regression tasks. It measures the linear relationship between each feature and the target variable.

4. **mutual_info_classif**: As previously mentioned, this function calculates the mutual information between features and the target variable for classification tasks. It assesses the dependency and information gain.

5. **mutual_info_regression**: Similar to `mutual_info_classif`, this function calculates mutual information but is designed for regression tasks.

6. **SelectPercentile**: This method selects a fixed percentage of the highest-scoring features based on the chosen scoring function.

Here's an example of how to use `SelectKBest` with the `f_classif` scoring function:

```python
from sklearn.feature_selection import SelectKBest, f_classif

# Instantiate the SelectKBest selector with f_classif as the scoring function and k=5
selector = SelectKBest(score_func=f_classif, k=5)

# Fit and transform the selector on your data
X_new = selector.fit_transform(X, y)
```

You can choose the scoring function that best fits your specific classification or regression problem to perform feature selection effectively.

## Example 3 - Recursive Feature Elimination (RFE):

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset as an example
data = load_breast_cancer()
X = data.data
y = data.target

# Create a DataFrame from the dataset
df = pd.DataFrame(data=X, columns=data.feature_names)
df['target'] = y

# Display the first few rows of the dataset
print("Original Dataset:")
print(df.head())

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate a RandomForestClassifier as the estimator
estimator = RandomForestClassifier(random_state=42)

# Instantiate the RFE selector with the estimator and the desired number of features to select (e.g., 5)
selector = RFE(estimator, n_features_to_select=5)

# Fit the selector on the training data
selector = selector.fit(X_train, y_train)

# Get the ranking of features (1 indicates selected, 0 indicates not selected)
feature_ranking = selector.support_
print("------------------")
print(feature_ranking)
print("------------------")

# Get the indices of the selected features
selected_indices = np.where(feature_ranking)[0]

# Create a DataFrame with the selected features
df_selected = df.iloc[:, selected_indices]


# Display the selected features
print("\nSelected Features:")
print(df_selected.head())

# Train a RandomForestClassifier on the selected features
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train[:, feature_ranking], y_train)

# Make predictions on the test data using the selected features
y_pred = clf.predict(X_test[:, feature_ranking])

# Calculate and display the accuracy of the model on the test data
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy on Test Data:", accuracy)


Original Dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area

#### Key Logic in the Code Explained

`np.where(selector.support_)[0]` is a NumPy operation used to obtain the indices of selected features based on the boolean mask created by `selector.support_`. Let's break down how it works step by step:

1. `selector.support_` is a boolean mask created by the Recursive Feature Elimination (RFE) selector. It has the same length as the number of features in your dataset. Each element of this mask is `True` if the corresponding feature is selected by RFE or `False` if it's not selected.

2. `np.where(selector.support_)` returns a tuple containing the indices (positions) where the boolean mask is `True`. The result is a tuple because there could be multiple positions where `True` appears in the mask.

3. `np.where(selector.support_)[0]` extracts the first element of the tuple, which contains the selected feature indices.

Here's an example to illustrate this:

Suppose you have a boolean mask `selector.support_` like this:

```python
selector.support_ = [True, False, True, False, True]
```

In this case, `True` corresponds to the selected features, and `False` corresponds to the non-selected features.

- `np.where(selector.support_)` would return a tuple with the positions of `True` values:

  ```
  (array([0, 2, 4]),)
  ```

- Finally, `np.where(selector.support_)[0]` extracts the first element of the tuple, which gives you the selected feature indices:

  ```
  [0, 2, 4]
  ```

So, `np.where(selector.support_)[0]` returns a list of indices where the features are selected by RFE, allowing you to extract the columns (features) from your DataFrame that should be used for further analysis or modeling.

# **FAQ**

What is the problem with above Feature Selection techniques that many developers use Feature Extraction Techniques like PCA ?

Feature selection techniques and feature extraction techniques like Principal Component Analysis (PCA) serve different purposes and have their own advantages and limitations. The choice between them depends on the specific characteristics of your data and the goals of your machine learning project. Here are some considerations for why developers might choose feature extraction techniques like PCA over feature selection techniques:

1. **Dimensionality Reduction**: Feature extraction techniques like PCA are primarily used for dimensionality reduction. If you have a high-dimensional dataset with many features, reducing the dimensionality can help overcome issues related to the curse of dimensionality and computational complexity. Feature selection may not effectively reduce dimensionality to the same extent.

2. **Linear Combinations**: PCA finds linear combinations of the original features that capture the most variance in the data. This can be valuable when there are multicollinearity issues (high correlation between features), as PCA can decorrelate the features and provide orthogonal components.

3. **Unsupervised Approach**: PCA is an unsupervised technique and does not rely on target labels. It explores the inherent structure of the data based on covariance and eigenvalues. Feature selection methods, on the other hand, may take target labels into account, making PCA suitable for unsupervised learning tasks.

4. **Noise Reduction**: PCA tends to reduce the influence of noisy features by emphasizing the directions (principal components) in the data with the most information. Feature selection techniques may or may not effectively handle noisy features.

However, it's essential to consider the potential downsides of using feature extraction techniques like PCA:

1. **Loss of Interpretability**: PCA transforms the original features into principal components, which may not have clear interpretations. Feature selection retains the original features, making it easier to interpret the importance of each feature.

2. **Information Loss**: PCA may discard some information, as it focuses on capturing the most significant variance. Depending on the application, this information loss can be detrimental.

3. **Non-linearity**: PCA assumes that the relationships between features are linear. If your data has non-linear relationships, PCA may not be the most effective technique.

4. **Scalability**: PCA can be computationally expensive for very high-dimensional datasets, although there are techniques like Randomized PCA that can help mitigate this issue.

In summary, both feature selection and feature extraction techniques have their places in machine learning, and the choice between them should be based on the specific characteristics of your data, the goals of your project, and the trade-offs you are willing to make in terms of interpretability and information preservation.

---------

Connect with the author of this Notebook - Rocky Jagtiani - [Here](https://www.linkedin.com/in/rocky-jagtiani-3b390649/)

---------


### Bonus NB - on Feature Extraction
https://colab.research.google.com/drive/1gDWDKY2fMIZcassnz8ViWuOy00JIAAtd?usp=sharing