# **Transformer-based Encoding and Bayesian Model Evaluation**


This script implements a two‑stage pipeline:


## **1. Feature Encoding with a Transformer**

### **Step 1: Initialize Seed**
`setSeed(1)` is called to fix the random number generators (RNG) across different libraries for reproducibility (NumPy, Python `random`, TensorFlow, and PyTorch).


### **Step 2: Load Data**
- The data is loaded from `CrystalFeatureMatrix.csv` using `pandas.read_csv()`, with all columns read as strings (`dtype=str`).
- The number of feature columns (`dimModel`) is calculated by selecting the first column through `FA15`.


### **Step 3: Run Transformer Encoding**
- An instance of `TransformerBasedVariantEncoderRunner` is created with the following parameters:


In [None]:
from TransformerBasedVariantEncoder import TransformerBasedVariantEncoderRunner, setSeed
import pandas as pd

if __name__ == "__main__":
    # Example execution
    setSeed(1)
    dataPath = 'CrystalFeatureMatrix.csv'
    modelDir = 'checkpoints/transformer_models'
    outputDir = 'outputs/encoded_features'

    df = pd.read_csv(dataPath, dtype=str)
    dimModel = df.loc[:, df.columns[0]:'FA15'].shape[1]

    runner = TransformerBasedVariantEncoderRunner(
        dataPath=dataPath,
        modelDir=modelDir,
        outputDir=outputDir,
        dimModel=dimModel,
        numHeads=5
    )
    runner.run(maxLayers=8)

Starting the encoding process with 8 layers...
Processing Layers: [■■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 12.50%
----------------------------------------------------------------------------------------------------
Processing layer 1/8...
[1/4] Model for layer 1 saved at: checkpoints/transformer_models/encoderLayer1.pth
[2/4] Encoding the input tensor using the transformer model...
[3/4] Processing the encoded output...
[4/4] Layer 1 encoding completed. Output saved to: outputs/encoded_features/encodedOutputLayer1.csv


Processing Layers: [■■■■■■■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 25.00%
----------------------------------------------------------------------------------------------------
Processing layer 2/8...
[1/4] Model for layer 2 saved at: checkpoints/transformer_models/encoderLayer2.pth
[2/4] Encoding the input tensor using the transformer model...


## **2. Bayesian Cross-Validation with Machine Learning Models**

### **Step 1: Ignore UserWarnings**
- The script globally suppresses `UserWarning` using the following code:

In [None]:
import warnings
import re

warnings.simplefilter("ignore", UserWarning)

### **Step 2: Set Seed for Bayesian Models**

- The set_seed(1) function is called again to ensure consistent results for the Bayesian cross-validation models:

In [None]:
from BayesianCV import BayesianCrossValidatedModel, set_seed
set_seed(42)
folder_path = 'outputs/encoded_features'

### **Step 3: Run Bayesian Cross-Validation on Encoded Features**

The script then loads the encoded feature files from the `outputs/encoded_features` directory and runs Bayesian cross-validation using the following machine learning models:

- **Random Forest (`rf`)**:

In [None]:
model_rf = BayesianCrossValidatedModel(model_type="rf")
model_rf.run_on_folder(folder_path, 'bayesian_rf.csv')

- **Multi-layer Perceptron (`mlp`)**:

In [None]:
model_mlp = BayesianCrossValidatedModel(model_type="mlp")
model_mlp.run_on_folder(folder_path, 'bayesian_mlp.csv')

- **K-Nearest Neighbors (`knn`)**:

In [None]:
model_knn = BayesianCrossValidatedModel(model_type="knn")
model_knn.run_on_folder(folder_path, 'bayesian_knn.csv')

- **Support Vector Machine (`svm`)**:

In [None]:
model_svm = BayesianCrossValidatedModel(model_type="svm")
model_svm.run_on_folder(folder_path, 'bayesian_svm.csv')

## **3. Output Files**
For each model, the results are saved in separate `CSV` files, such as `bayesian_rf.csv`, `bayesian_mlp.csv`, `bayesian_knn.csv`, and `bayesian_svm.csv`, which contain the performance metrics from the cross-validation.