# Example: Imputation using MIPLS2

This example demonstrates how to use the `MIPLS2_v7` class to perform imputation on a dataset containing missing values. We'll also utilize the `Miscellaneous_Funcs` class from `Miscellaneous.py` for any necessary auxiliary functions. This notebook guides you through creating two sets of data (`X` and `Y`), defining necessary parameters, and performing imputation on a dataset with missing values (`YI`).

In [None]:
# Loading necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import MIPLS2_v8 as mipls2  # Importing the class for imputation


## Generating Synthetic Data

We will generate two synthetic datasets: `X` and `Y`. These datasets will have the same number of rows (samples) but different numbers of columns. Then, we will create `YI`, which is a version of `Y` with some missing values.

In this case, we'll generate random data using `numpy` with controlled random seed to ensure reproducibility.

In [None]:
# Defining the number of samples and columns for X and Y
np.random.seed(42)  # Setting seed for reproducibility
n_samples = 100  # Number of samples (rows)
n_features_X = 10  # Number of columns in X
n_features_Y = 5   # Number of columns in Y

# Generating random data for X and Y
X = np.random.rand(n_samples, n_features_X)
Y = np.random.rand(n_samples, n_features_Y)

# Introducing missing values in Y to create YI (simulated incomplete data)
YI = Y.copy()
missing_rate = 0.2  # 20% missing values
mask = np.random.rand(n_samples, n_features_Y) < missing_rate
YI[mask] = np.nan  # Assigning NaN to introduce missing values

# Displaying a sample of the data
print('First 5 rows of X:', X[:5], sep='\n')
print('\nFirst 5 rows of Y (complete):', Y[:5], sep='\n')
print('\nFirst 5 rows of YI (with missing values):', YI[:5], sep='\n')

## Defining Parameters for Imputation

We will now define the necessary parameters for performing imputation using the `MIPLS2_v7` class. The key parameters include the mode of operation (`App`), the number of latent variables (`Max_LV`), cross-validation mode, and more.

Below are the details of some key parameters:
- `App`: Application mode, which can be `A0xy`, `A1xy`, `A2xy`, or `A3xy`.
- `Just_do_min`: Whether to minimize the number of latent variables or not.
- `Opt_LV`: Determines how the number of latent variables is selected.
- `Max_LV`: Maximum number of latent variables for the PLS model.
- `cv_mode`: Cross-validation mode, either `KFold` or `Venetian`.
- `NSplits`: Number of splits for cross-validation.
- `gm_type`: Type of generalized mean for imputation.


In [None]:
# Defining necessary parameters for the code execution
App = 'A0xy'  # Application mode
Just_do_min = True  # Minimize the number of latent variables
Opt_LV = 'pervar'  # Select latent variables based on percentage of variance explained
gm_type = 3  # Generalized mean type
Max_LV = 30  # Maximum number of latent variables
cv_mode = 'KFold'  # Cross-validation mode ('KFold' or 'Venetian')
NSplits = 10  # Number of splits for KFold cross-validation
rnd_stat = 42  # Random seed for reproducibility

## Performing PLS-Based Imputation

We will now use the `PLS2Based_Imputation` method from the `MIPLS2_v7` class to perform the imputation. The imputation will predict the missing values in `YI` based on the available values in `X` and `YI`. The model will be evaluated using cross-validation to find the optimal number of latent variables for the imputation process.

In [None]:
# Performing PLS-Based Imputation using the MIPLS2_v7 class
BPI = mipls2.MIPLS2()

Thresh = 1e-5  # Threshold for stopping criteria
CNT = 16  # Some constant for internal processing (number of iterations)

# Perform the imputation
_, Yhat, _, _, _, _, _, _, _, _, _, _ = mipls2.PLS2Based_Imputation(
    X, YI, App, Just_do_min, Opt_LV, Max_LV, cv_mode, Nsplits=NSplits, 
    rnd_stat=rnd_stat, gm_type=gm_type, Thresh=Thresh, CNT=CNT, YT=None, verbose=True
)

# Display a sample of the imputed data
print('\nFirst 5 rows of YI (original with missing values):', YI[:5], sep='\n')
print('\nFirst 5 rows of Yhat (imputed values):', Yhat[:5], sep='\n')

## Evaluating the Imputation

To assess the quality of the imputation, we can compute metrics such as RMSE (Root Mean Square Error) and R-squared between the imputed values (`Yhat`) and the true values (`Y`). These metrics will help quantify the performance of the imputation.

In [None]:
# Defining functions for RMSE and R-squared calculation
def rmse(y_true, y_pred):
    return np.sqrt(np.nanmean((y_true - y_pred) ** 2))

def r_squared(y_true, y_pred):
    ss_res = np.nansum((y_true - y_pred) ** 2)
    ss_tot = np.nansum((y_true - np.nanmean(y_true)) ** 2)
    return 1 - (ss_res / ss_tot)

# Calculating RMSE and R-squared for the imputed values
imputation_rmse = rmse(Y, Yhat)
imputation_r_squared = r_squared(Y, Yhat)

# Display the results
print(f'RMSE of imputation: {imputation_rmse:.4f}')
print(f'R-squared of imputation: {imputation_r_squared:.4f}')

## Conclusion

In this notebook, we demonstrated how to use the `MIPLS2_v7` class to perform Partial Least Squares (PLS) based imputation on a dataset with missing values. The imputation was evaluated using RMSE and R-squared, providing insight into its accuracy and performance.

This notebook can be expanded or modified to use real-world datasets, and further tuning of parameters can be done to optimize the imputation results.