# Feature Selection

Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-Learn library.

#### What is feature selection?
Feature selection is the process of identifying and selecting a subset of relevant features for use in model construction. The goal is to enhance the model's performance by reducing overfitting, improving accuracy, and reducing training time.

#### Why is Feature Selection Important?
Feature selection offers several benefits:

- Improved Model Performance: By removing irrelevant or redundant features, we can improve the accuracy of the model.
- Reduced Overfitting: With fewer features, the model is less likely to learn noise from the training data.
- Faster Computation: Reducing the number of features decreases the computational cost and training time.

#### Types of Feature Selection Methods

Feature selection methods can be broadly classified into three categories:

- Filter Methods: Filter methods use statistical techniques to evaluate the relevance of features independently of the model. Common techniques include correlation coefficients, chi-square tests, and mutual information.
- Wrapper Methods: Wrapper methods use a predictive model to evaluate feature subsets and select the best-performing combination. Techniques include recursive feature elimination (RFE) and forward/backward feature selection.
- Embedded Methods: Embedded methods perform feature selection during the model training process. Examples include Lasso (L1 regularization) and feature importance from tree-based models.
- Feature Selection Techniques with Scikit-Learn


__Scikit-Learn provides several tools for feature selection, including:__

- Univariate Selection: Univariate selection evaluates each feature individually to determine its importance. Techniques like SelectKBest and SelectPercentile can be used to select the top features based on statistical tests.
- Recursive Feature Elimination (RFE): RFE is a wrapper method that recursively removes the least important features based on a model's performance. It repeatedly builds a model and eliminates the weakest features until the desired number of features is reached.
- Feature Importance from Tree-based Models: Tree-based models like decision trees and random forests can provide feature importance scores, indicating the importance of each feature in making predictions.

## Practical Implementation of Feature Selection with Scikit-Learn
Implementing feature selection techniques using Scikit-Learn.

## Data Preparation:
First, load a dataset and split it into features and target variables.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [15]:
mdf = pd.read_csv("../data/processed/pop_train_encoded_.csv") 
print('Dataset shape when loaded:',mdf.shape)

# Drop target and negative value columns
X = mdf.drop(['status_group','gps_height_bin','lat_lon_bin'], axis=1) #'id',
y = mdf['status_group']

print('Dataset shape after dropping target and negative value columns:',X.shape)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Dataset shape when loaded: (59400, 24)
Dataset shape after dropping target and negative value columns: (59400, 21)


### Method 1 : Univariate Selection in Python with Scikit-Learn

Use SelectKBest with the chi-square test to select the top 2 features.

In [16]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply SelectKBest with chi2
select_k_best = SelectKBest(score_func=chi2, k=2)
X_train_k_best = select_k_best.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[select_k_best.get_support()])

Selected features: Index(['amount_tsh', 'gps_height'], dtype='object')


### Method 2: Recursive Feature Elimination

Next, we'll use RFE with a logistic regression model to select the top 2 features.

In [17]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Apply RFE with logistic regression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
X_train_rfe = rfe.fit_transform(X_train, y_train)

print("Selected features:", X_train.columns[rfe.get_support()])

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Selected features: Index(['source_class', 'water_quantity_score'], dtype='object')


### Method 3: Tree-Based Feature Importance

Finally, we'll use a random forest classifier to determine feature importance.

In [18]:
from sklearn.ensemble import RandomForestClassifier

# Train random forest and get feature importances
model = RandomForestClassifier()
model.fit(X_train, y_train)
importances = model.feature_importances_

# Display feature importances
feature_importances = pd.Series(importances, index=X_train.columns)
print(feature_importances.sort_values(ascending=False))

lat_long_interaction                  0.132797
water_quantity_score                  0.119452
subvillage                            0.084168
gps_height                            0.075946
pumpage_safety_inter                  0.065998
waterpoint_type                       0.064355
ward                                  0.062493
population                            0.052608
funder_installer_pair_grouped_freq    0.049213
subvillage_funder_installer_freq      0.043641
quantity_extraction_inter             0.037664
lga                                   0.036835
amount_tsh                            0.030196
payment_type                          0.025392
district_code                         0.024446
region_code                           0.024268
basin                                 0.018337
geo_cluster                           0.017851
water_quality_score                   0.013113
management_group                      0.010949
source_class                          0.010279
dtype: float6

Features that can be ignored:
'source_class','geo_cluster','district_code'

Feature selection is an essential part of the machine learning workflow. By selecting the most relevant features, we can build more efficient and accurate models. Scikit-Learn provides a variety of tools to help with feature selection, including univariate selection, recursive feature elimination, and feature importance from tree-based models. Implementing these techniques can significantly improve your model's performance and computational efficiency.

By following the steps outlined in this article, you can effectively perform feature selection in Python using Scikit-Learn, enhancing your machine learning projects and achieving better results.