<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_2/2_3_3_Extraccion_de_caracteristicas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction

## Principal Component Analysis (PCA)

Principal Component Analysis, or PCA, is a statistical technique used to reduce the dimensionality of a dataset. PCA allows us to simplify the information present in a dataset with multiple variables and transform it into a reduced dataset that still retains much of the original information.

The goal of PCA is to find a representation of the data that is easier to understand while preserving as much variance in the data as possible.

To perform PCA, the covariance matrix of the original data is first calculated. Then, the eigenvectors of this matrix are computed, which indicate the directions in which the data has the greatest variance. The original data is then projected onto these directions, resulting in a new dataset with fewer variables that still captures a significant portion of the original information.


In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

In [None]:
# Load the data
cancer_data = load_breast_cancer()
df = pd.DataFrame(data=cancer_data.data, columns=cancer_data.feature_names)

# Display a few variables
print(df.head())
print(df.describe())

In [None]:
"""
We will use scikit-learn to apply the preprocessing technique StandardScaler. 
The goal is to transform the data so that it has a mean of zero and a unit standard deviation.
"""
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Rescale the data considering the mean and standard deviation of each variable
# The "fit" method adjusts the model to the original data.
scaler.fit(df.values)

# Use the "transform" function from the StandardScaler class to apply 
# the transformation to the original data. The result of this transformation 
# is stored in the variable "X_scaled"
X_scaled = scaler.transform(df.values)
print("X_scaled:\n", X_scaled)


In [None]:
# We will use scikit-learn functions for PCA analysis
from sklearn.decomposition import PCA

# To evaluate the results, we will use the full set of variables.
# "n_components = 30" specifies that PCA should fit the data to find 
# the 30 principal components.
pca = PCA(n_components=30, random_state=2020)
pca.fit(X_scaled)

# Store the values of the (30) principal components in the variable X_pca
X_pca = pca.transform(X_scaled)
print("X_pca:\n", X_pca)

# Since we selected the full set of variables, the selected components 
# should account for 100% of the variance in the data.
print("\n => Variance explained by the components:", sum(pca.explained_variance_ratio_ * 100))


In [None]:
# If we plot the variance as a function of the number of components, we can observe
# the minimum number of components needed to explain a certain percentage of the variance.
import matplotlib.pyplot as plt
import numpy as np

plt.plot(np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of Components")
plt.ylabel("Percentage of Variance Explained")
plt.title("Cumulative Variance Explained by PCA Components")
plt.grid(True)
plt.show()


In [None]:
# We see that with just a third of the variables, we can explain 95% of the variance
n_var = np.cumsum(pca.explained_variance_ratio_ * 100)[9]
print("Variance explained by the first 10 components:", n_var)

In [None]:
# Alternatively, we can construct the subset that accommodates 95% of the variance
# as follows
pca_95 = PCA(n_components=0.95, random_state=2020)
pca_95.fit(X_scaled)
X_pca_95 = pca_95.transform(X_scaled)

# A good practice is to visualize the relationship between the principal components
import seaborn as sns
sns.scatterplot(x=X_pca_95[:, 0], y=X_pca_95[:, 1], hue=cancer_data.target, palette="Set1")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("Scatter Plot of First Two Principal Components")
plt.show()


## Description of the Principal Components Plot

1. **Dimensionality Reduction**: PCA is a dimensionality reduction technique that transforms the original variables into a new set of variables (principal components) that are orthogonal (uncorrelated) to each other. These principal components capture the majority of the variability present in the original data.

2. **Visualization of Principal Components**: By plotting the first two principal components (`X_pca_95[:, 0]` and `X_pca_95[:, 1]`), this graph shows the dispersion of the data in the two directions that capture the most variability. Each point in the graph represents an observation in this reduced component space.

3. **Interpretation of Axes**: The axes of the graph (x-axis and y-axis) do not have inherent meaning in terms of the original variables since each principal component is a linear combination of them. However, the relative position of the points can indicate patterns and relationships among the samples.

This type of graph is useful for visually exploring the structure of the data. For example:
- If points from different categories are clearly separated, it suggests that the first two principal components are effective at distinguishing these categories.
- Conversely, if there is significant overlap, it may indicate that more information (additional components or alternative techniques) is needed to effectively differentiate the classes.


In [None]:
# Finally, we can create a new DataFrame with the result of the PCA analysis
cols = ['PCA' + str(i) for i in range(10)]
df_pca = pd.DataFrame(X_pca_95, columns=cols)
print("Data (PCA - 95%):\n", df_pca)


In [None]:
# Obtain the component matrix
components = pca.components_

# Create a DataFrame with the component loadings
df_loadings = pd.DataFrame(components.T, columns=['PC' + str(i + 1) for i in range(components.shape[0])], index=df.columns)

# Display the loadings
print(df_loadings)

# For each principal component, find the original variable with the greatest influence
for i in range(components.shape[0]):
    pc = f'PC{i + 1}'
    most_influential_variable = df_loadings[pc].abs().idxmax()
    print(f"{pc}: {most_influential_variable}")


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load the example dataset (Iris)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Apply SelectKBest
k = 3  # Select the top 3 features
selector = SelectKBest(score_func=f_classif, k=k)
X_new = selector.fit_transform(X, y)

# Get the indices of the selected features
selected_feature_indices = selector.get_support(indices=True)

# Get the names of the selected features
selected_features = X.columns[selected_feature_indices].tolist()

print("Selected Features:", selected_features)

# Create a new DataFrame with the selected features
X_selected = pd.DataFrame(X_new, columns=selected_features)

print("\nData with Selected Features:")
print(X_selected.head())

# Get the scores for all features
scores = selector.scores_
feature_scores = pd.DataFrame({'Feature': X.columns, 'Score': scores})
feature_scores = feature_scores.sort_values('Score', ascending=False)

print("\nScores for All Features:")
print(feature_scores)


In [None]:
def plot_feature_importance(selector, feature_names, figsize=(10, 6), palette="viridis"):
    """
    Generates a bar chart showing the importance of features 
    based on the SelectKBest analysis.
    """

    # Retrieve scores and create a DataFrame
    scores = selector.scores_
    feature_scores = pd.DataFrame({
        'Feature': feature_names,
        'Score': scores
    })

    # Sort by descending score
    feature_scores = feature_scores.sort_values('Score', ascending=True)

    # Create the figure
    plt.figure(figsize=figsize)

    # Generate the horizontal bar chart
    ax = sns.barplot(
        data=feature_scores,
        y='Feature',
        x='Score',
        palette=palette,
        hue='Feature'
    )

    # Customize the plot
    plt.title('Feature Importance According to SelectKBest', pad=20)
    plt.xlabel('F-Score')
    plt.ylabel('Feature')

    # Add values to the bars
    for i, v in enumerate(feature_scores['Score']):
        ax.text(v, i, f'{v:.2f}', va='center')


# Generate the plot
fig = plot_feature_importance(selector, X.columns)


## Independent Component Analysis (ICA)


In [None]:
# We will use fMRI data for our example with ICA.
# To do this, we start by installing the nilearn library.
!python -m pip install nilearn

In [None]:
from nilearn import datasets

# Download an fMRI subject from the developmental study
dataset = datasets.fetch_development_fmri(n_subjects=1)
file_name = dataset.func[0]

# Preprocessing the image
from nilearn.input_data import NiftiMasker

# Apply a mask to extract the background from the image (non-brain voxels)
masker = NiftiMasker(smoothing_fwhm=8, memory='nilearn_cache', memory_level=1,
                     mask_strategy='epi', standardize=True)
data_masked = masker.fit_transform(file_name)


In [None]:
from sklearn.decomposition import FastICA
import numpy as np

# Select 10 components
ica = FastICA(n_components=10, random_state=42)
components_masked = ica.fit_transform(data_masked.T).T

# Apply a threshold (80% signal) to the data after normalization
# based on mean and standard deviation
components_masked -= components_masked.mean(axis=0)
components_masked /= components_masked.std(axis=0)
components_masked[np.abs(components_masked) < .8] = 0

# Invert the transformation to recover the 3D structure
component_img = masker.inverse_transform(components_masked)


In [None]:
# Finally, we visualize the results of the dimensionality reduction operations
from nilearn import image
from nilearn.plotting import plot_stat_map, show
from nilearn import datasets

# Data for a subject/patient
dataset = datasets.fetch_development_fmri(n_subjects=1)
func_filename = dataset.func[0]

# Calculate the mean image
mean_img = image.mean_img(func_filename)

# Plot the first and second independent components over the mean image
plot_stat_map(image.index_img(component_img, 0), mean_img)
plot_stat_map(image.index_img(component_img, 1), mean_img)
show()
