# Feature Selection

Feature selection is the process of selecting a subset of relevant and informative features from a larger set of available features for use in machine learning algorithms. The aim is to reduce the dimensionality of the data and improve the accuracy and efficiency of the model.

There are several techniques of feature selection. Let's take a look into a two most popular techniques.

## Forward Feature Selection

Forward feature selection involves starting with an empty set of features and iteratively adding one feature at a time based on their individual performance in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create custom dataset
X, y = make_classification(n_samples=800, n_features=10, n_informative=5, n_redundant=0, random_state=42)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implement forward feature selection
selected_features = []
for i in range(X_train.shape[1]):
    best_accuracy = 0
    best_feature = None
    for j in range(X_train.shape[1]):
        if j not in selected_features:
            features = selected_features + [j]
            model = LogisticRegression()
            model.fit(X_train[:, features], y_train)
            accuracy = model.score(X_test[:, features], y_test)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_feature = j
    selected_features.append(best_feature)
    print("Selected Features (Forward):", selected_features, "Score:", accuracy)

Selected Features (Forward): [9] Score: 0.68125
Selected Features (Forward): [9, 0] Score: 0.7
Selected Features (Forward): [9, 0, 5] Score: 0.78125
Selected Features (Forward): [9, 0, 5, 6] Score: 0.84375
Selected Features (Forward): [9, 0, 5, 6, 2] Score: 0.85
Selected Features (Forward): [9, 0, 5, 6, 2, 1] Score: 0.8375
Selected Features (Forward): [9, 0, 5, 6, 2, 1, 7] Score: 0.8375
Selected Features (Forward): [9, 0, 5, 6, 2, 1, 7, 8] Score: 0.8625
Selected Features (Forward): [9, 0, 5, 6, 2, 1, 7, 8, 4] Score: 0.8625
Selected Features (Forward): [9, 0, 5, 6, 2, 1, 7, 8, 4, 3] Score: 0.84375


This code creates a custom dataset, splits it into training and testing sets, and then implements forward feature selection. Note that this code assumes that you are using a binary classification problem and logistic regression as the classifier, but you can modify it to use other classifiers and handle different types of problems.

## Backward Feature Selection

Backward feature selection, on the other hand, starts with all available features and iteratively removes one feature at a time based on their individual performance in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [2]:
# Implement backward feature elimination
selected_features = list(range(X_train.shape[1]))
for i in range(X_train.shape[1] - 1):
    worst_accuracy = 1
    worst_feature = None
    for j in selected_features:
        features = selected_features.copy()
        features.remove(j)
        model = LogisticRegression()
        model.fit(X_train[:, features], y_train)
        accuracy = model.score(X_test[:, features], y_test)
        if accuracy < worst_accuracy:
            worst_accuracy = accuracy
            worst_feature = j
    selected_features.remove(worst_feature)
    print("Selected Features (Backward):", selected_features, "Score:", accuracy)

Selected Features (Backward): [0, 1, 2, 3, 4, 5, 6, 7, 8] Score: 0.7125
Selected Features (Backward): [0, 1, 2, 3, 4, 5, 6, 7] Score: 0.65
Selected Features (Backward): [0, 1, 2, 3, 4, 6, 7] Score: 0.6375
Selected Features (Backward): [1, 2, 3, 4, 6, 7] Score: 0.59375
Selected Features (Backward): [1, 2, 3, 4, 6] Score: 0.45625
Selected Features (Backward): [2, 3, 4, 6] Score: 0.4625
Selected Features (Backward): [2, 3, 4] Score: 0.4625
Selected Features (Backward): [3, 4] Score: 0.5
Selected Features (Backward): [4] Score: 0.50625


This code creates a custom dataset, splits it into training and testing sets, and then implements backward feature elimination. Note that this code assumes that you are using a binary classification problem and logistic regression as the classifier, but you can modify it to use other classifiers and handle different types of problems.

Both forward and backward feature selection have their own advantages and limitations. Forward feature selection tends to be more computationally efficient and is more likely to identify relevant features that may be missed in backward selection. However, it may also include irrelevant features that may not contribute to the overall accuracy of the model.

In contrast, backward feature selection tends to produce more parsimonious models that may be easier to interpret and have better generalizability. However, it may also remove important features that may have a significant impact on the model's accuracy.

Ultimately, the choice between forward and backward feature selection depends on the specific needs and characteristics of the dataset and the goals of the analysis.