<table style="background-color:#F5F5F5;" width="100%">
<tr><td style="background-color:#F5F5F5;"><img src="../images/logo.png" width="150" align='right'/></td></tr>     <tr><td>
            <h2><center>Aprendizagem Automática em Engenharia Biomédica</center></h2>
            <h3><center>1st Semester - 2024/2025</center></h3>
            <h4><center>Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia</center></h4>
</td></tr>
    <tr><td><h2><b><center>Lab 9 - Dimensionality Reduction</center></b></h2>
    <h4><i><b><center>Human Activity Recognition Dataset</center></b></i></h4></td></tr>
</table>

## 1. The Curse of Dimensionality

The Curse of Dimensionality is a known problem in modelling data for different objectives. It was coined by Richard Bellman in the 1950s and refers to the problems that arise when dealing with high-dimensional spaces.

When we increase the number of dimensions of a problem, i.e. the number of features, the volume of the space becomes exponentially larger, thus making available samples sparser.

<div>
<img src="https://drek4537l1klr.cloudfront.net/rhys/Figures/fig13-1_alt.jpg" width="600"/>
</div>

As examples are represented in a larger space, it may become easier for the model to find a decision boundary that splits the data, thus leading to good performances. However, we rise the concern regarding overfitting.

## 2. Feature Selection

The feature extraction process improves the learning process by uncovering hidden and helpful relationships between data. However, it comes at the cost of increasing the complexity of the data and potentially leading to the Curse of Dimensionality problem.

To avoid these problems while ensuring best performance, feature selection is a commonly applied process in Machine Learning pipelines where the best characteristics of the data, i.e. features, are identified. The most discriminative features are kept and used for the optimization process (with hyperparameter tuning).

The feature selection process can be done resorting to multiple techniques, which usually fit into three main groups:

* __Filter Methods__: These methods analyse the data structure and find independence among the feature set.

* __Wrapper Methods__: Using a classifier, wrapper methods find the best features looking at the classification performance.

* __Embedded Methods__: In embedded methods, the feature selection process is part of the learning process of the classifier.


### 2.1. The UCI Human Activity Recognition Using Smartphones Data Set 

Let's recover the [UCI's Human Activity Recognition dataset](https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions). In this class, the extracted features from TSFEL (Lab 5 and 6) are available in our GitHub page. Load the data directly from the links.

##### 2.1.1. Loading the HAR data

In [None]:
import numpy as np
import pandas as pd

X_train = pd.read_csv("https://raw.githubusercontent.com/hgamboa/nova-aaeb/main/Labs/Data/HAR/train_features.csv")
X_test = pd.read_csv("https://raw.githubusercontent.com/hgamboa/nova-aaeb/main/Labs/Data/HAR/test_features.csv")
y_train = np.loadtxt("https://raw.githubusercontent.com/hgamboa/nova-aaeb/main/Labs/Data/HAR/y_train.txt")
y_test = np.loadtxt("https://raw.githubusercontent.com/hgamboa/nova-aaeb/main/Labs/Data/HAR/y_test.txt")

In [None]:
X_train.head()

### 2.2. Exercice: Defining a Testing Classifier

As we will test different methods for feature selection, it is important to guarantee a comparison methodology. Therefore, we should use the same classifier, under the same conditions, to evaluate the results.

__Exercice 1__: Define a function using a tuned decision tree from previous experiments on the HAR dataset. The function should be designed to train the model, make predictions, and then display the performance results. Additionally, it should provide information about the number of features to help evaluate the model's complexity.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def train_test_classifier(X_train, y_train, X_test, y_test):
    
    # Tuned decision tree
    model = DecisionTreeClassifier(criterion='entropy', max_depth=5, max_features=100,
                       max_leaf_nodes=30, min_samples_leaf=25,
                       min_samples_split=50, random_state=42)
    
    # train model
    
    
    # predictions
    
    
    # performance results
    
    print("Performance:")
    print("\tTrain Accuracy:", )    
    print("\tTest  Accuracy:", )
    
    print("Complexity:")
    print("\tFeature Size:", )

__Exercice 2__: Verify the baseline performance, using the original dataset.

In [None]:
train_test_classifier(X_train, y_train, X_test, y_test)

### 2.3. Filter Methods

Feature selection using Filter methods is the most simple way to select the most relevant features.

In these methods, the data structure is analyzed in order to detect _independence_ among the several features. Several measurements are defined to extract information about the data structure:

* __Correlation__

* __Chi-Square__

* __Information Gain__

##### 2.3.1. Exploring Correlation

Pandas library includes a method to automatically compute the correlation of a dataset. [`df.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) computes the correlation matrix of the continuous variables of a dataset using the Pearson method. It is possible to explore other coefficients, such as the Spearman rank or the Kendall Tay correlation coefficients.

Let's compute and evaluate the correlation matrix.

In [None]:
corr_pearson = X_train.corr()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def corr_heatmap(corr):
    plt.figure(figsize=(16,14))
    sns.heatmap(data=corr, cmap=plt.cm.Reds, vmin=-1, vmax=1)
    
corr_heatmap(corr_pearson)

##### 2.3.2. Selecting Features

To decide which features should be kept, one needs to identify highly correlated pairs and keep only one.

We will define a method to identify the most correlated features using a reference threshold of 0.95.

In [None]:
import numpy as np

def correlated_features(corr_matrix, threshold=0.95):
    
    # Absolute value
    corr_matrix = corr_matrix.abs()
    
    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find index and column name of features with correlation greater than threshold
    corr_features = [column for column in upper.columns if any(upper[column] > threshold)]
    
    return corr_features

In [None]:
corr_features = correlated_features(corr_pearson, threshold=0.95)

print("No. correlated features:", len(corr_features))
corr_features

### 2.4. Exercice

__Exercice 3__: Now that we have identified the most correlated features, remove them from the datasets and verify the attained performance. What do you conclude?

### 2.5. Wrapper Methods

With Wrapper, the best features are selected resorting to the evaluation of the classification performance of some model using different subsets of features. 

In Wrapper methods, the feature selection is done sequentially in an iterative process using two strategies:

* __Forward Feature Selection__

* __Backward Feature Selection__

Sklearn includes the [SequentialFeatureSelector()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) method, which allows to implement both strategies using a given estimator.

##### 2.5.1. Forward Feature Selection

The forward feature selection process sequentially adds new features to the train dataset until the maximum convergence is achieved.  

The iterative testing and update runs as follows:

1. Start with an empty feature set $Y_0 = [] $, an accuracy $a_0 = 0$, an objective function $J$ and an iteration counter$k = 0$;

2. Select the feature $x^+$ that maximizes $J(Y_k + x)$;

3. If $J(Y_k + x^+) > a_k$, update $Y_{k+1} = Y_k + x^+$, $a_{k+1} = J(Y_k + x^+)$ and $k=k+1$ and go back to 2., otherwise continue;

4. Keep only the feature set $Y_k$ and discard the rest.

### 2.6. Exercice

__Exercice 4__: Using the `SequentialFeatureSelector()` method, implement a Forward Feature Selection strategy. Use the same estimator as the performance comparison method.

Use cross validation with the train dataset after removing correlated features (X_train_corr), define a performance tolerance and set n_jobs equal to -1.

__Exercice 5__: Evaluate the performance with the new set of features. What do you conclude?

### 2.7. Dimensionality Reduction - PCA

Another way to reduce the number of features is by applying dimensionality reduction techniques, which aim to summarize the information content in large datasets by means of a smaller set of representative variables. The include aggregated information of multiple variables.

__Principal Component Analysis (PCA)__ is the most well known dimensionality reduction technique. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it. The objective is to project data into their principal components by maximizing the variance of the training data.

Sklearn includes an implentation of [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

### 2.8. Exercice

__Exercice 6__: Verify the impact of PCA in the train data after feature correlation. Use `mle` in n_components for the automatic estimation of the number of components.