<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_2/2_3_3_Preprocesado_y_estructuracion_de_datos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic 2.3.3: Data Preprocessing and Structuring

Mounting Access to Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from rich.console import Console
console = Console()

## Data Cleaning

In [None]:
console.rule("Data Cleaning")

import pandas as pd
pd.set_option('display.max_columns', None)

# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/Documentos/Master IA/Datasets/stroke-data.csv")
console.log(df)

# Check for any NaN values
nans = df.isnull().values.any()
print("\n - Are there any NaN values in the dataset:", nans)
print("\n - Show NaN values by row/column: \n", df.isna())

# Select rows that contain any NaN values
df_n = df[df.isna().any(axis=1)]
print("\n - Rows with NaNs:\n", df_n)

# Drop NaN values
df = df.dropna()
nans = df.isnull().values.any()
print("\n - Are there any NaN values in the dataset: \n", nans)
print("\n - Cleaned dataset: \n", df)


## Data Transformation

The `MinMaxScaler` function from the Scikit-learn (sklearn) library is a preprocessing technique used to scale the feature values of a dataset to a specific range. The idea behind MinMaxScaler is to transform the data so that all feature values fall within the specified range, typically (0, 1), although other ranges can also be specified.

The formula used to scale the data with MinMaxScaler is as follows:

```
X_scaled = (X - X_min) / (X_max - X_min)
```

where:
- `X` is the original value of a feature.
- `X_min` is the minimum value of that feature in the dataset.
- `X_max` is the maximum value of that feature in the dataset.

MinMaxScaler is useful when we want to bring all features of our data to the same scale, which can be beneficial for certain machine learning algorithms that are sensitive to feature scales, such as gradient descent or distance-based algorithms (k-NN, SVM, etc.).

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Create the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

# Select columns with numerical values of interest
columns_to_scale = ['age', 'avg_glucose_level', 'bmi']
df_s = df[columns_to_scale]

# Print the original data
print(" => Original Data: \n", df_s.head())

# Apply the scaling function
df_scaled = pd.DataFrame(scaler.fit_transform(df_s), columns=columns_to_scale)

# Print the scaled data
print(" => Scaled Data: \n", df_scaled.head())


In [None]:
# Normalization
df_norm = (df_scaled - df_scaled.mean()) / df_scaled.std()

# Visualize the result
import matplotlib.pyplot as plt
fig = plt.figure(1, figsize=(12, 5))

# Original data
plt.subplot(131)
plt.hist(df['bmi'].values, bins=20, color='blue', alpha=0.7)
plt.title("Original")

# Scaled data
plt.subplot(132)
plt.hist(df_scaled['bmi'].values, bins=20, color='green', alpha=0.7)
plt.title("Scaled")

# Normalized data
plt.subplot(133)
plt.hist(df_norm['bmi'].values, bins=20, color='orange', alpha=0.7)
plt.title("Normalized")

plt.tight_layout()
plt.show()


## Encoding Categorical Variables

In [None]:
# Example of one-hot encoding with pandas
df = df[['id', 'age', 'bmi', 'gender']]
cod_gen = pd.get_dummies(df, prefix='gender')
print(cod_gen.head())

## Encoding Categorical Variables

* **LabelBinarizer**: This is a Scikit-learn class used to binarize labels (categories) in a categorical variable. In other words, it converts a categorical variable into a binary representation (0 or 1). When you use `fit_transform()` on a column of a DataFrame, it converts the categories into binary columns where each column represents a category and has a value of 1 if the row belongs to that category and 0 if it does not.
  In the example code, we applied `LabelBinarizer` to the 'gender' column of a sample DataFrame, where 'Male' and 'Female' are the categories. The result is a binary matrix where each row represents a 'gender' value, and there are two columns: one for 'Male' and one for 'Female'.

* **OneHotEncoder**: This is used to convert categorical variables into a format called "one-hot encoding" or "one-out-of-N encoding." This means that each category is represented by a binary matrix where only one value is 1, and the rest are 0.
  In the example code, we first applied `LabelBinarizer` to obtain a binary matrix and then used `OneHotEncoder` to convert that matrix into a "one-hot encoding" format. The result is a matrix with one column for each category, and each row contains a 1 in the column corresponding to the category and 0 in all other columns.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder

# Example of a Test DataFrame
data = pd.DataFrame({'gender': ['Male', 'Female', 'Female', 'Male', 'Male']})

# LabelBinarizer
label_binarizer = LabelBinarizer()
data_lb = label_binarizer.fit_transform(data['gender'])
print("LabelBinarizer Output:\n", data_lb)

# OneHotEncoder
onehot_encoder = OneHotEncoder()
data_ohe = onehot_encoder.fit_transform(data_lb)
print("OneHotEncoder Output:\n", data_ohe.toarray())