<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_8/Section_2_Python_Example__Feature_Selection_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 2: Python example - feature selection techniques

Feature selection is a critical step in preparing your data for machine learning models. It involves identifying the most significant features that contribute positively to the predictive power of the model. This section provides practical Python examples demonstrating various feature selection techniques using libraries such as scikit-learn, which offers a comprehensive suite of methods for automated feature selection.

1. Setting Up the Environment:

Before implementing feature selection techniques, ensure your Python environment includes scikit-learn. If not already installed, you can add it via pip:

In [None]:
pip install scikit-learn

2. Importing Required Libraries:

Along with scikit-learn, we will use Pandas for handling data and NumPy for any additional numerical operations:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

3. Preparing the Data:

For this example, let’s use a synthetic dataset that simulates customer data for a bank marketing campaign:

In [None]:
# Create a synthetic dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, n_repeated=0, n_classes=2, random_state=42, shuffle=False)
# Convert to DataFrame
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(20)])
df['Target'] = y
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('Target', axis=1), df['Target'], test_size=0.3, random_state=42)

4. Feature Selection with Univariate Statistical Tests:

Univariate statistical tests can be used to select the number of best features that have the strongest relationship with the output variable:

In [None]:
# Apply SelectKBest class to extract top 10 best features
best_features = SelectKBest(score_func=f_classif, k=10)
fit = best_features.fit(X_train, y_train)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X_train.columns)
# Concatenate dataframes for better visualization
feature_scores = pd.concat([df_columns, df_scores], axis=1)
feature_scores.columns = ['Feature', 'Score'] # Naming the dataframe columns
print(feature_scores.nlargest(10, 'Score')) # Print 10 best features

5. Feature Selection Using Feature Importance from Model:

You can also use an ensemble method like Random Forest to estimate the importance of features:

In [None]:
# Training a random forest classifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Get feature importances
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X_train.shape[1]):
  print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")

6. Evaluating Model Performance:

To see how effective feature selection is, you can compare the performance of the model with and without feature selection:

In [None]:
# Selecting features based on importance
selected_features = X_train.columns[indices[:10]]
# Rebuild model on selected features
model.fit(X_train[selected_features], y_train)
y_pred = model.predict(X_test[selected_features])
# Check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.2f}")

7. Conclusion:

These feature selection techniques demonstrate how reducing the number of features in your dataset can potentially improve the performance of your models, decrease overfitting, and enhance generalization. Using libraries like scikit-learn makes implementing these techniques straightforward, allowing data scientists to focus more on model optimization and less on manual feature selection. These methods provide a robust framework for automating the selection of the most informative features, ensuring that models are both efficient and effective.