# **Feature Selection By Filter Method**

Let's consider a dataset containing various features related to houses, such as size, number of bedrooms, number of bathrooms, location, and age. We want to predict the price of the houses based on these features. However, not all features may be equally relevant for predicting house prices. We can use feature selection techniques to identify the most important features for our prediction task.

In [8]:
#Loading Data
import pandas as pd

# Load the dataset (example)
data = {
    'size': [1000, 1500, 1200, 1800],
    'bedrooms': [2, 3, 2, 4],
    'bathrooms': [1, 2, 1, 2],
    'location': ['A', 'B', 'A', 'C'],
    'age': [10, 5, 8, 15],
    'price': [200000, 300000, 250000, 350000]
}

df = pd.DataFrame(data)
X = df.drop(columns=['price'])  # Features
y = df['price']  # Target variable
y=pd.to_numeric(y)

In [27]:
print(y.head())

0    200000.0
1    300000.0
2    250000.0
3    350000.0
Name: price, dtype: float64


We can use various feature selection techniques to identify the most relevant features. One common technique is to calculate feature importance scores using a machine learning model such as a decision tree-based model.

In [29]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
# Define the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), ['location'])  # Apply one-hot encoding to 'location'
    ],
    remainder='passthrough'
)

Based on the feature importances, we can select the top k most important features.

In [64]:
# Define the RandomForestRegressor model
model = RandomForestRegressor()

# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Train the model
pipeline.fit(X, y)

# Get feature importances
feature_importances = pipeline.named_steps['model'].feature_importances_

k = min(k, len(X.columns))
top_features_indices = feature_importances.argsort()[-k:][::-1]
selected_features = X.columns[top_features_indices]
print("Selected features:", selected_features)


Selected features: Index(['location', 'age', 'size'], dtype='object')


In [37]:
print(k)


3
[3 6 0]


# **Feature Extraction using PCA**

In [39]:
from sklearn.decomposition import PCA
import numpy as np

# Generate some example data
X = np.random.rand(100, 10)  # 100 samples, 10 features

# Apply PCA for dimensionality reduction
pca = PCA(n_components=3)  # Reduce to 3 principal components
X_pca = pca.fit_transform(X)

print("Original data shape:", X.shape)
print("Transformed data shape (after PCA):", X_pca.shape)


Original data shape: (100, 10)
Transformed data shape (after PCA): (100, 3)


# **Dimensionality**


Dimensionality refers to the number of features or variables that are used to represent each data point in a dataset. In other words, it measures the number of dimensions or axes needed to describe a dataset's feature space.

For example:

1. In a dataset containing only one feature (e.g., height), each data point would be represented as a single point along a one-dimensional axis.

2. In a dataset containing two features (e.g., height and weight), each data point would be represented as a point in a two-dimensional space, with one axis representing height and the other representing weight.

3. In a dataset containing three features (e.g., height, weight, and age), each data point would be represented as a point in a three-dimensional space, with one axis representing height, another representing weight, and the third representing age.

In general, as the number of features or dimensions increases, the complexity of the dataset's representation grows, and the computational and analytical challenges associated with working with the data also increase. This is known as the curse of dimensionality. Dimensionality reduction techniques are often employed to address these challenges and extract the most relevant information from high-dimensional datasets.

# **curse of dimensionality**


The "curse of dimensionality" refers to various challenges and issues that arise when dealing with high-dimensional data in machine learning and data analysis. As the number of dimensions (features or variables) in a dataset increases, the amount of data required to adequately cover the space grows exponentially. This can lead to various problems such as increased computational complexity, sparsity of data points, and difficulty in visualization and interpretation.

Here are some key aspects of the curse of dimensionality:

Increased Computational Complexity:

Algorithms operating in high-dimensional spaces often require more computational resources and time for processing and analysis.
As the number of dimensions increases, the complexity of computations, such as distance calculations, grows exponentially.
Sparsity of Data:

In high-dimensional spaces, data points become increasingly sparse, meaning that the available data is spread thinly across the feature space.
Sparse data can lead to difficulties in accurately estimating statistical properties, making it challenging to generalize from the data.
Difficulty in Visualization and Interpretation:

It becomes increasingly difficult to visualize and interpret data in high-dimensional spaces beyond three dimensions.
Traditional visualization techniques, such as scatter plots and heatmaps, are limited in their effectiveness for high-dimensional data.
Overfitting:

High-dimensional spaces increase the risk of overfitting, where a model captures noise or irrelevant patterns in the data rather than the underlying structure.
Overfitting can lead to poor generalization performance when applying the model to new, unseen data.
To address the challenges posed by high-dimensional data, dimensionality reduction techniques are often employed.

# **Dimensionality reduction**


Dimensionality reduction refers to the process of reducing the number of input variables or features in a dataset while preserving the most relevant information.

There are two main approaches to dimensionality reduction in deep learning:

**Feature Selection:**

Feature selection methods aim to identify a subset of the most informative features from the original feature set.
Common techniques include filter methods (e.g., correlation-based feature selection), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., L1 regularization).


**Feature Extraction:**

Feature extraction methods transform the original high-dimensional data into a lower-dimensional representation using techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders.
These methods learn new representations of the data that capture the most relevant information while reducing dimensionality.
By reducing the dimensionality of the data, dimensionality reduction techniques can alleviate the curse of dimensionality, improve computational efficiency, enhance model interpretability, and mitigate overfitting in deep learning models. However, it is essential to carefully select and apply dimensionality reduction methods based on the characteristics of the data and the specific requirements of the task at hand.

# **Feature selection**

Feature selection is the process of selecting a subset of relevant features (variables or attributes) from the original set of features in a dataset. This subset of features is chosen based on certain criteria to improve model performance, reduce computational complexity, and enhance interpretability. Feature selection is particularly important when dealing with high-dimensional datasets to mitigate the curse of dimensionality and avoid overfitting.

There are three general classes of feature selection algorithms: Filter methods, wrapper methods and embedded methods.

# **Feature extraction**

Feature extraction is the process of transforming raw data into a set of features (representations) that are more suitable for modeling and analysis. In the context of machine learning, feature extraction aims to capture the most relevant information from the original data while reducing its dimensionality or complexity.

Here's an overview of the feature extraction process:

Data Representation:

Feature extraction begins with the selection or creation of an appropriate representation of the raw data.
This representation could be raw data itself (e.g., pixel values in images, word frequencies in text) or preprocessed data (e.g., normalized values, tokenized text).
Feature Engineering:

Feature engineering involves creating new features or transforming existing features to capture relevant information.
This may include techniques such as scaling, normalization, binarization, one-hot encoding, or creating new features through mathematical operations.
Dimensionality Reduction:

Dimensionality reduction techniques are often applied to reduce the number of features while preserving the most important information.
Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders are common dimensionality reduction methods used in feature extraction.
Feature Selection:

Feature selection aims to select a subset of the most relevant features from the original set of features.
This can be done using various criteria such as statistical tests, feature importance scores, or domain knowledge.
Model-Specific Feature Extraction:

Some machine learning models, particularly deep learning models, may involve specific feature extraction techniques tailored to the model architecture.
For example, convolutional neural networks (CNNs) automatically extract hierarchical features from images, while recurrent neural networks (RNNs) may learn sequential features from text data.
Feature extraction is crucial for improving the performance, interpretability, and efficiency of machine learning models. It helps in reducing the dimensionality of the data, mitigating the curse of dimensionality, removing irrelevant features, and focusing on the most informative aspects of the data.