# **Transformer-based Encoding and Bayesian Model Evaluation**


This script implements a two‑stage pipeline:


## **1. Feature Encoding with a Transformer**

### **Step 1: Initialize Seed**
`setSeed(1)` is called to fix the random number generators (RNG) across different libraries for reproducibility (NumPy, Python `random`, TensorFlow, and PyTorch).


### **Step 2: Load Data**
- The data is loaded from `CrystalFeatureMatrix.csv` using `pandas.read_csv()`, with all columns read as strings (`dtype=str`).
- The number of feature columns (`dimModel`) is calculated by selecting the first column through `FA15`.


### **Step 3: Run Transformer Encoding**
- An instance of `TransformerBasedVariantEncoderRunner` is created with the following parameters:


In [1]:
from TransformerBasedVariantEncoder import TransformerBasedVariantEncoderRunner, setSeed
import pandas as pd

if __name__ == "__main__":
    # Example execution
    setSeed(1)
    dataPath = 'CrystalFeatureMatrix.csv'
    modelDir = 'checkpoints/transformer_models'
    outputDir = 'outputs/encoded_features'

    df = pd.read_csv(dataPath, dtype=str)
    dimModel = df.loc[:, df.columns[0]:'FA15'].shape[1]

    runner = TransformerBasedVariantEncoderRunner(
        dataPath=dataPath,
        modelDir=modelDir,
        outputDir=outputDir,
        dimModel=dimModel,
        numHeads=5
    )
    runner.run(maxLayers=8)

Starting the encoding process with 8 layers...
Processing Layers: [■■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 12.50%
----------------------------------------------------------------------------------------------------
Processing layer 1/8...
[1/4] Model for layer 1 saved at: checkpoints/transformer_models/encoderLayer1.pth
[2/4] Encoding the input tensor using the transformer model...
[3/4] Processing the encoded output...
[4/4] Layer 1 encoding completed. Output saved to: outputs/encoded_features/encodedOutputLayer1.csv


Processing Layers: [■■■■■■■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 25.00%
----------------------------------------------------------------------------------------------------
Processing layer 2/8...
[1/4] Model for layer 2 saved at: checkpoints/transformer_models/encoderLayer2.pth
[2/4] Encoding the input tensor using the transformer model...
[3/4] Processing the encoded output...
[4/4] Layer 2 encoding completed. Output saved to: outputs/encoded_features/encodedOutputLaye

## **2. Bayesian Cross-Validation with Machine Learning Models**

### **Step 1: Ignore UserWarnings**
- The script globally suppresses `UserWarning` using the following code:

In [2]:
import warnings
import re

warnings.simplefilter("ignore", UserWarning)

### **Step 2: Set Seed for Bayesian Models**

- The set_seed(1) function is called again to ensure consistent results for the Bayesian cross-validation models:

In [3]:
from BayesianCV import BayesianCrossValidatedModel, set_seed
set_seed(42)
folder_path = 'outputs/encoded_features'

### **Step 3: Run Bayesian Cross-Validation on Encoded Features**

The script then loads the encoded feature files from the `outputs/encoded_features` directory and runs Bayesian cross-validation using the following machine learning models:

- **Random Forest (`rf`)**:

In [4]:
model_rf = BayesianCrossValidatedModel(model_type="rf")
model_rf.run_on_folder(folder_path, './checkpoints/BayesianResult/bayesian_rf.csv')

|   iter    |  target   | max_depth | min_sa... | min_sa... | n_esti... |
-------------------------------------------------------------------------
| [0m1        [0m | [0m0.3818   [0m | [0m5.919    [0m | [0m7.483    [0m | [0m2.002    [0m | [0m37.21    [0m |
| [0m2        [0m | [0m0.3303   [0m | [0m4.027    [0m | [0m1.831    [0m | [0m5.353    [0m | [0m41.1     [0m |
| [0m3        [0m | [0m0.3817   [0m | [0m5.777    [0m | [0m5.849    [0m | [0m9.546    [0m | [0m71.67    [0m |
| [0m4        [0m | [0m0.3307   [0m | [0m4.431    [0m | [0m8.903    [0m | [0m2.493    [0m | [0m70.34    [0m |
| [95m5        [0m | [95m0.3822   [0m | [95m5.921    [0m | [95m6.028    [0m | [95m4.527    [0m | [95m27.83    [0m |
| [0m6        [0m | [0m0.3822   [0m | [0m5.313    [0m | [0m6.079    [0m | [0m4.415    [0m | [0m27.75    [0m |
| [0m7        [0m | [0m0.3796   [0m | [0m5.187    [0m | [0m8.909    [0m | [0m2.394    [0m | [0m31.82   

- **Multi-layer Perceptron (`mlp`)**:

In [5]:
model_mlp = BayesianCrossValidatedModel(model_type="mlp")
model_mlp.run_on_folder(folder_path, './checkpoints/BayesianResult/bayesian_mlp.csv')

|   iter    |  target   | activa... |  epochs   |  layers   |   units   |
-------------------------------------------------------------------------
| [0m1        [0m | [0m0.4055   [0m | [0m0.417    [0m | [0m44.41    [0m | [0m1.0      [0m | [0m37.21    [0m |
| [0m2        [0m | [0m0.4025   [0m | [0m0.1468   [0m | [0m31.85    [0m | [0m1.745    [0m | [0m41.1     [0m |
| [95m3        [0m | [95m0.4687   [0m | [95m0.3968   [0m | [95m40.78    [0m | [95m2.677    [0m | [95m71.67    [0m |
| [0m4        [0m | [0m0.4162   [0m | [0m0.2045   [0m | [0m47.56    [0m | [0m1.11     [0m | [0m70.34    [0m |
| [0m5        [0m | [0m0.393    [0m | [0m0.4173   [0m | [0m41.17    [0m | [0m1.562    [0m | [0m27.83    [0m |
| [0m6        [0m | [0m0.4428   [0m | [0m0.1926   [0m | [0m41.22    [0m | [0m2.876    [0m | [0m70.95    [0m |
| [0m7        [0m | [0m0.4687   [0m | [0m0.2583   [0m | [0m40.0     [0m | [0m2.503    [0m | [0m71.92   

- **K-Nearest Neighbors (`knn`)**:

In [6]:
model_knn = BayesianCrossValidatedModel(model_type="knn")
model_knn.run_on_folder(folder_path, './checkpoints/BayesianResult/bayesian_knn.csv')

|   iter    |  target   | leaf_size | n_neig... |     p     |
-------------------------------------------------------------
| [0m1        [0m | [0m0.4606   [0m | [0m26.68    [0m | [0m14.69    [0m | [0m1.0      [0m |
| [0m2        [0m | [0m0.3987   [0m | [0m22.09    [0m | [0m3.788    [0m | [0m1.092    [0m |
| [0m3        [0m | [0m0.4523   [0m | [0m17.45    [0m | [0m7.566    [0m | [0m1.397    [0m |
| [0m4        [0m | [0m0.4568   [0m | [0m31.55    [0m | [0m8.965    [0m | [0m1.685    [0m |
| [0m5        [0m | [0m0.4592   [0m | [0m18.18    [0m | [0m17.68    [0m | [0m1.027    [0m |
| [0m6        [0m | [0m0.4606   [0m | [0m26.49    [0m | [0m14.68    [0m | [0m1.16     [0m |
| [0m7        [0m | [0m0.4574   [0m | [0m12.91    [0m | [0m12.51    [0m | [0m1.725    [0m |
| [0m8        [0m | [0m0.4482   [0m | [0m34.86    [0m | [0m16.23    [0m | [0m2.0      [0m |
| [0m9        [0m | [0m0.4549   [0m | [0m24.39    [0m 

- **Support Vector Machine (`svm`)**:

In [7]:
model_svm = BayesianCrossValidatedModel(model_type="svm")
model_svm.run_on_folder(folder_path, './checkpoints/BayesianResult/bayesian_svm.csv')

|   iter    |  target   |     C     |   gamma   |
-------------------------------------------------
| [0m1        [0m | [0m0.181    [0m | [0m2.143    [0m | [0m0.7763   [0m |
| [0m2        [0m | [0m0.1185   [0m | [0m0.1006   [0m | [0m0.4419   [0m |
| [95m3        [0m | [95m0.4929   [0m | [95m0.8191   [0m | [95m0.2739   [0m |
| [0m4        [0m | [0m0.3671   [0m | [0m1.013    [0m | [0m0.4764   [0m |
| [0m5        [0m | [0m0.2684   [0m | [0m2.044    [0m | [0m0.6311   [0m |
| [95m6        [0m | [95m0.505    [0m | [95m1.03     [0m | [95m0.2      [0m |
| [0m7        [0m | [0m0.5037   [0m | [0m1.527    [0m | [0m0.2      [0m |
| [0m8        [0m | [0m0.1759   [0m | [0m0.8316   [0m | [0m0.703    [0m |
| [0m9        [0m | [0m0.5049   [0m | [0m1.01     [0m | [0m0.2      [0m |
| [0m10       [0m | [0m0.4885   [0m | [0m5.0      [0m | [0m0.2      [0m |
| [0m11       [0m | [0m0.1338   [0m | [0m5.0      [0m | [0m0.8862

## **3. Output Files**
For each model, the results are saved in separate `CSV` files, such as `bayesian_rf.csv`, `bayesian_mlp.csv`, `bayesian_knn.csv`, and `bayesian_svm.csv`, which contain the performance metrics from the cross-validation.