# BAGECO 2025 - Metagenomics Workshop

**From Sequences to Predictions – End-to-end analysis using BioAutoML**

*This class was developed for the 17th Symposium on Bacterial Genetics and Ecology (BAGECO 2025) and will be presented at the Metagenomics Workshop on 30.06.2025.*

*For any questions regarding this material, please contact:*

*Breno L. S. de Almeida: breno.livio-silva.de.almeida@ufz.de*

---

This Jupyter Notebook demonstrates how to use [BioAutoML](https://academic.oup.com/bib/article/23/4/bbac218/6618238) without requiring any local installation.

The demonstration includes three use cases:

1. Multi-class prediction of DNA/RNA sequences (more than two classes)

2. Binary classification of protein sequences (two classes)

3. Prediction using structured data

---

## 1. Installing BioAutoML

First, let's clone the BioAutoML repository from GitHub using Git. Then, we'll navigate into the BioAutoML directory.

In [1]:
!git clone https://github.com/Bonidia/BioAutoML.git BioAutoML
%cd /content/BioAutoML

Cloning into 'BioAutoML'...
remote: Enumerating objects: 1081, done.[K
remote: Counting objects: 100% (357/357), done.[K
remote: Compressing objects: 100% (206/206), done.[K
remote: Total 1081 (delta 195), reused 256 (delta 148), pack-reused 724 (from 1)[K
Receiving objects: 100% (1081/1081), 82.17 MiB | 10.89 MiB/s, done.
Resolving deltas: 100% (470/470), done.
Updating files: 100% (416/416), done.
/content/BioAutoML


We’ll use Git to switch to the workshop-specific branch. Next, we'll download the required code to run MathFeature within BioAutoML.

In [2]:
!git checkout bageco
!git submodule init
!git submodule update

Branch 'bageco' set up to track remote branch 'bageco' from 'origin'.
Switched to a new branch 'bageco'
Submodule 'MathFeature' (https://github.com/Bonidia/MathFeature.git) registered for path 'MathFeature'
Submodule 'MathFeature-WebServer' (https://github.com/Bonidia/MathFeature-WebServer.git) registered for path 'MathFeature-WebServer'
Cloning into '/content/BioAutoML/MathFeature'...
Cloning into '/content/BioAutoML/MathFeature-WebServer'...
Submodule path 'MathFeature': checked out '69d2a3206c6db02be9d46db74ad61fc05cc6aee2'
Submodule path 'MathFeature-WebServer': checked out '5d13e073b30d0f9b333adb411bc843722c09d0ba'


Let's now retrieve all the necessary packages to run BioAutoML. For our demonstration we will use [uv](https://astral.sh/blog/uv), a very fast package manager.

If you really want to install BioAutoML locally, you can use miniconda installed in your machine. Do all the steps from before, with the exception of `!git checkout bageco`, and run the following:

```bash
conda env create -f BioAutoML-env.yml -n bioautoml
```

This will install any necessary packages for BioAutoML in a isolated environment and access with:

```bash
conda activate bioautoml
```

And exit the environment with:

```bash
conda deactivate
```

Let's continue using uv with the command sync to download all the necessary packages.

In [3]:
#!uv add scikit-learn==1.1.0 pandas catboost lightgbm matplotlib-inline  hyperopt setuptools biopython xgboost imbalanced-learn igraph tpot
!uv sync

[2K[1AUsing CPython [36m3.8.20[39m
Creating virtual environment at: [36m.venv[39m
[2mResolved [1m80 packages[0m [2min 6ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mpytz      [0m [32m[2m------------------------------[0m[0m     0 B/497.29 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mpytz      [0m [32m-[2m-----------------------------[0m[0m 14.91 KiB/497.29 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mtzdata    [0m [32m[2m------------------------------[0m[0m     0 B/339.69 KiB
[2mpytz      [0m [32m-[2m-----------------------------[0m[0m 14.91 KiB/497.29 KiB
[2K[3A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mtzdata    [0m [32m--[2m----------------------------[0m[0m 14.91 KiB/339.69 KiB
[2mpytz      [0m [32m-[2m-----------------------------[0m[0m 14.91 KiB/497.29 KiB
[2K

## 2. Running BioAutoML

BioAutoML provides multiple scripts for automated classification tasks, including:

- `BioAutoML-feature.py`: Performs automated classification for DNA/RNA sequences. It extracts descriptors (e.g., using MathFeature), selects the most relevant features, and proceeds with binary or multiclass classification.

- `BioAutoML-feature-protein.py`: Executes automated classification for protein sequences, following the same workflow as the DNA/RNA script—feature extraction, selection, and classification.

- `BioAutoML-binary.py`: Handles binary classification tasks. This script can be called by the DNA/RNA or protein scripts or run independently using structured input data.

- `BioAutoML-multiclass.py`: Manages multiclass classification. Like the binary script, it can be invoked by other BioAutoML scripts or used standalone with structured data.


## 2.1 DNA/RNA classification

Let us use `BioAutoML-feature.py` for DNA/RNA classification.

BioAutoML only deals with DNA/RNA sequences using the 4-letter nucleotide notation by IUPAC (A, G, C, and T), eliminating other sequences. BioAutoML considers Uracil in DNA as its counterpart to Thymine in DNA.

To run the script, you can run it locally in the conda environment using `python BioAutoML-feature.py` and the rest of the required arguments. As we are using the uv package manager, we will use `uv run` to run the script in Google Colab.

We have the following arguments that can be used to run the script:

`--fasta_train`: List of paths containing FASTA files used for training;

`--fasta_label_train`: List of labels for the FASTA files used for training;

`--fasta_test`: List of paths containing FASTA files used for testing (optional);

`--fasta_label_test`: List of labels for the FASTA files used for testing (optional);

`--output`: Folder to create and save results;

`--n_cpu`: Number of CPU cores to use (`-1` will use all cores available);

`--estimations`: Number of trials using Bayesian optimization to find the best combination of descriptors and classifiers for classification.

We will run the script using an example from the BioAutoML paper, which involves classifying three classes of RNA: rRNA, sRNA, and tRNA. We will use just five trials for a quick example, but it is recommended to choose at least 50 trials to generate a more accurate model for your problem.


In [4]:
!uv run BioAutoML-feature.py \
--fasta_train exemplo_fasta/train/rRNA.fasta exemplo_fasta/train/sRNA.fasta exemplo_fasta/train/tRNA.fasta \
--fasta_label_train rRNA sRNA tRNA \
--fasta_test exemplo_fasta/test/rRNA.fasta exemplo_fasta/test/sRNA.fasta exemplo_fasta/test/tRNA.fasta \
--fasta_label_test rRNA sRNA tRNA \
--output experiment_1 --n_cpu -1 --estimations 5

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.164686 seconds.
You can set `force_row_wise=true` to remove the overhead.

[LightGBM] [Info] Total Bins 5289
[LightGBM] [Info] Number of data points in the train set: 185, number of used features: 355
[LightGBM] [Info] Start training from score -1.853060
[LightGBM] [Info] Start training from score -0.432864
[LightGBM] [Info] Start training from score -1.636837
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.271508 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5243
[LightGBM] [Info] Number of data points in the train set: 185, number of used features: 355


Multiple results regarding the classification are generated in the new folder created as output for the script.

Let's use the Pandas library to check the generated results. For example we can check the best features found used for training (and therefore, for testing).

In [8]:
import pandas as pd

df_best_feature_train = pd.read_csv("experiment_1/best_feature_train.csv").sample(5)
df_best_feature_train

FileNotFoundError: [Errno 2] No such file or directory: 'experiment_1/best_feature_train.csv'

BioAutoML employs 10-fold cross-validation on the training dataset to evaluate and recommend the most suitable descriptors and classifier for a given dataset. This approach ensures robust model selection by assessing average performance metrics along with their standard deviations.

In [11]:
df_training_kfold = pd.read_csv("experiment_1/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1_micro,std_F1_micro,F1_macro,std_F1_macro,F1_w,std_F1_w,kappa,std_kappa
0,0.99,0.03,0.981,0.06,0.99,0.03,0.9865,0.04,0.9887,0.03,0.9787,0.06


We have the following confusion matrix for the training dataset.

In [19]:
df_training_confusion_matrix = pd.read_csv("experiment_1/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3,4
0,REAL,rRNA,sRNA,tRNA,All
1,rRNA,32,0,0,32
2,sRNA,0,133,0,133
3,tRNA,0,2,38,40
4,All,32,135,38,205


A test dataset is optional for model evaluation in BioAutoML. If provided, it serves as an independent validation set, ensuring unbiased performance assessment. BioAutoML automatically applies the previously selected features to the test dataset for consistency.

In [14]:
df_best_feature_test = pd.read_csv("experiment_1/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,AAC,AGC,AGT,ATA,ATG,ATT,CAT,CCC,CGA,CGG,...,T_ACG,T_ATC,T_CAC,T_CCA,T_CCC,T_GCG,T_GGG,T_TGA,average_ORF_length,maximum_GC_content_ORF
33,0.407943,-0.411942,-0.262346,1.11687,-0.72437,2.689235,0.90693,-0.205443,-0.673999,-0.433216,...,-0.57725,-0.613017,-0.587293,0.241295,-0.471188,-0.594507,-0.251639,-0.457552,1.521763,0.647435
15,-1.043542,-1.069199,-0.978176,0.133031,0.939737,-0.752259,-0.028481,-0.720289,-0.133629,0.427538,...,-0.57725,-0.258021,-0.587293,-0.470438,-0.471188,0.301625,-0.459647,-0.457552,-0.653143,-0.827785
23,-1.043542,-1.069199,0.142663,0.010493,0.723885,-0.752259,1.003437,-0.720289,-0.252084,0.238852,...,-0.260843,-0.258021,-0.587293,-0.470438,-0.471188,-0.146441,-0.459647,-0.457552,-0.653143,-0.827785
9,1.316113,0.640389,-0.512691,1.310778,-0.52017,0.206846,0.614353,0.284083,0.519953,-0.076214,...,-0.260843,-0.613017,-0.042519,-0.114572,-0.141517,-0.594507,-0.043631,-0.14593,0.917622,1.195374
20,0.13147,-0.005068,0.180787,-0.572904,1.896282,-0.354263,1.675304,-0.303509,-1.419709,-0.659471,...,0.055564,0.096974,-0.042519,-0.114572,-0.471188,-0.594507,-0.459647,-0.457552,2.246731,1.090728


We can check the metrics regarding the test dataset.

In [15]:
df_metrics_test = pd.read_csv("experiment_1/metrics_test.csv")
df_metrics_test

Unnamed: 0.1,Unnamed: 0,precision,recall,f1-score,support
0,rRNA,1.0,1.0,1.0,8.0
1,sRNA,0.970588,1.0,0.985075,33.0
2,tRNA,1.0,0.9,0.947368,10.0
3,accuracy,0.980392,0.980392,0.980392,0.980392
4,macro avg,0.990196,0.966667,0.977481,51.0
5,weighted avg,0.980969,0.980392,0.980022,51.0


We can also check the prediction done for individual samples.

In [18]:
df_test_predictions = pd.read_csv("experiment_1/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
42,U00096.3/225500-225572,tRNA
16,U00096.3/4325981-4325905,sRNA
45,U00096.3/3982375-3982448,tRNA
22,U00096.3/4094503-4094573,sRNA
8,U00096.3/4027375-4027452,sRNA


Next we have the following confusion matrix for the test dataset.

In [21]:
df_test_confusion_matrix = pd.read_csv("experiment_1/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3,4
0,REAL,rRNA,sRNA,tRNA,All
1,rRNA,8,0,0,8
2,sRNA,0,33,0,33
3,tRNA,0,1,9,10
4,All,8,34,9,51


Finally, we can also the check the feature importance of the datasets used. For example, let's see the ten most important features.

In [24]:
df_feature_importance = pd.read_csv("experiment_1/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (GGT): (0.079014)
1,2. Feature (GGG): (0.065344)
2,3. Feature (T_GCG): (0.046055)
3,4. Feature (A_TCC): (0.030537)
4,5. Feature (G_TTC): (0.029295)
5,6. Feature (C_ATG): (0.027328)
6,7. Feature (G_ACT): (0.026743)
7,8. Feature (T_CCA): (0.024867)
8,9. Feature (TAG): (0.023178)
9,10. Feature (A_GCC): (0.023124)


We can check the best features found used for training (and therefore, for testing).

## 2.2 Protein classification

Let’s use `BioAutoML-feature-protein.py` for protein classification.

BioAutoML handles protein sequences represented with the standard 20-letter IUPAC amino acid notation, automatically filtering out sequences that contain ambiguous amino acids.

The script accepts the same command-line arguments as the DNA/RNA classification script.

In this example, we’ll run it on a dataset of [non-classically secreted proteins](https://academic.oup.com/bioinformatics/article/36/3/704/5545087) from Zhang et al.


In [12]:
!uv run BioAutoML-feature-protein.py \
--fasta_train MathFeature/Case\ Studies/CS-I/train_P.fasta MathFeature/Case\ Studies/CS-I/train_N.fasta \
--fasta_label_train positive negative \
--fasta_test MathFeature/Case\ Studies/CS-I/test_P.fasta MathFeature/Case\ Studies/CS-I/test_N.fasta \
--fasta_label_test positive negative \
--output experiment_2 --n_cpu -1 --estimations 1



###################################################################################
###################################################################################
##########         BioAutoML- Automated Feature Engineering             ###########
##########              Author: Robson Parmezan Bonidia                 ###########
##########         WebPage: https://bonidia.github.io/website/          ###########
###################################################################################
###################################################################################


Train - MathFeature/Case Studies/CS-I/train_P.fasta: Found File
Train - MathFeature/Case Studies/CS-I/train_N.fasta: Found File
Test - MathFeature/Case Studies/CS-I/test_P.fasta: Found File
Test - MathFeature/Case Studies/CS-I/test_N.fasta: Found File
Error: experiment_2/feat_extraction - No such file or directory.
Creating Directory...
Extracting features with MathFeature...
Automated Feature Engineering

We can check the best features found used for training.

In [25]:
df_best_feature_train = pd.read_csv("experiment_2/best_feature_train.csv").sample(5)
df_best_feature_train

Unnamed: 0,AA,AE,AV,GV
34,0.4975,0.435258,0.151608,1.77079
99,1.466538,0.50953,-1.080399,0.901097
351,-0.938113,-1.149214,-0.20093,0.946971
572,-0.938113,-1.149214,-0.406959,1.252372
241,1.048448,0.808418,-0.268581,0.551747


Average and standard deviation for metrics using 10-fold cross-validation.

In [26]:
df_training_kfold = pd.read_csv("experiment_2/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1,std_F1,balanced_ACC,std_balanced_ACC,kappa,std_kappa,gmean,std_gmean
0,0.818,0.04,0.482,0.12,0.5905,0.09,0.7248,0.05,0.4752,0.12,0.6998,0.06


Confusion matrix for training dataset.

In [27]:
df_training_confusion_matrix = pd.read_csv("experiment_2/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,negative,positive,All
1,negative,403,43,446
2,positive,64,77,141
3,All,467,120,587


Test dataset used for prediction using selected descriptors and features.

In [28]:
df_best_feature_test = pd.read_csv("experiment_2/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,AA,AE,AV,GV
53,-0.938113,2.323296,-1.080399,-1.108193
7,1.045835,0.219326,2.821334,1.378442
52,0.450139,0.12762,0.905207,-0.334858
21,-0.125273,-0.775412,-1.080399,0.2502
56,-0.448438,-0.473653,-1.080399,0.528465


Now let's see metrics for binary classification task regarding the test dataset.

In [29]:
df_metrics_test = pd.read_csv("experiment_2/metrics_test.csv")
df_metrics_test

Unnamed: 0,Metrics: Test Set
0,Accuracy: 0.6176470588235294
1,Recall: 0.8529411764705882
2,Precision: 0.58
3,F1: 0.6904761904761905
4,AUC: 0.7006920415224913
5,balanced ACC: 0.6176470588235294
6,gmean: 0.5710731717337529
7,MCC: 0.26666666666666666


We can see predictions done regarding the test dataset.

In [30]:
df_test_predictions = pd.read_csv("experiment_2/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
59,sp|Q65K78|MOBA_BACLD,positive
17,sp|P80868|EFG_BACSU,negative
36,sp|Q8DPJ8|RNC_STRR6,negative
65,sp|P45947|ARSC_BACSU,negative
37,sp|Q88VH0|APT_LACPL,negative


Confusion matrix for test dataset.

In [31]:
df_test_confusion_matrix = pd.read_csv("experiment_2/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,negative,positive,All
1,negative,29,5,34
2,positive,21,13,34
3,All,50,18,68


Feature importance for our dataset.

In [32]:
df_feature_importance = pd.read_csv("experiment_2/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (AA): (27.673667)
1,2. Feature (AV): (25.485922)
2,3. Feature (AE): (24.590570)
3,4. Feature (GV): (22.249840)


## 2.3 Structured data classification

We can also use BioAutoML using structured data, without the need of biological sequences. We can directly use the scripts for binary and multiclass classification.

As example, let us use `BioAutoML-binary.py` for binary classification. Note that `BioAutoML-multiclass.py` is very similar to the binary classification script (with the exception of not having feature selection as an argument).

The structured dataset requires separate CSV files for features and labels, where the labels file must include a column named `label`, and this applies to both training and test sets. For test set evaluation, an additional CSV file is required containing sample names in a column titled `nameseq`.

We have the following arguments that can be used to run the script:

`--train`: Path to the CSV file consisting of the training dataset;

`--train_label`: Path to the CSV file consisting of the labels for the training dataset;

`--test`: Path to the CSV file consisting of the test dataset;

`--test_label`: Path to the CSV file consisting of the labels for the test dataset;

`--test_nameseq`: Path to the CSV file consisting of the names for the samples used in the test dataset;

`--normalization`: Normalization - Features (default = False);

`--featureselection`: Feature Selection (default = True);

`--n_cpu`: Number of CPU cores to use (`-1` will use all cores available);

`--classifier`: Classifier to be used; 0: CatBoost, 1: Random Forest, 2: LightGBM, 3: XGBoost;

`--estimations`: Number of trials using Bayesian optimization to find the best combination of descriptors and classifiers for classification. Note: In this script the original BioAutoML uses a fixed number of trials;

`--imbalance`: To deal with the imbalanced dataset problem (default = False);

`--tuning`: Tuning Classifier (default = False);

`--output`: Folder to create and save results;

We will run an AntiCancer dataset as our example of structured data.

In [9]:
!uv run BioAutoML-binary.py \
--train example_csv/AntiCancer/train.csv \
--train_label example_csv/AntiCancer/train_labels.csv \
--test example_csv/AntiCancer/test.csv \
--test_label example_csv/AntiCancer/test_labels.csv \
--test_nameseq example_csv/AntiCancer/test_names.csv \
--normalization True \
--featureselection 1 \
--classifier 1 \
--tuning True \
--output experiment_3 --n_cpu -1 --estimations 5



###################################################################################
###################################################################################
#####################        BioAutoML - Binary             #######################
##########              Author: Robson Parmezan Bonidia                 ###########
##########         WebPage: https://bonidia.github.io/website/          ###########
###################################################################################
###################################################################################


Train - example_csv/AntiCancer/train.csv: Found File
Train_labels - example_csv/AntiCancer/train_labels.csv: Found File
Test - example_csv/AntiCancer/test.csv: Found File
Test_labels - example_csv/AntiCancer/test_labels.csv: Found File
Test_nameseq - example_csv/AntiCancer/test_names.csv: Found File
Number of samples (train): 240
Number of samples (test): 104
Number of features (train): 32
Number of featu

The training dataset with selected features.

In [10]:
df_best_feature_train = pd.read_csv("experiment_3/best_feature_train.csv").sample(5)
df_best_feature_train

Unnamed: 0,AB.1,AD.1,ASSD.1,AS.1,APL.1,TALU.1,NE.1,MOT3.1,MOT4.1,A,...,L,M,N,P,Q,R,S,T,W,Y
69,-0.070511,-0.112009,-1.472136,0.933721,-0.116351,3.742713,-0.906931,-0.258154,-0.10964,-0.155858,...,-0.222618,-0.078331,-0.092564,-0.065427,-0.06245,-0.11284,-0.076544,-0.913765,-0.076728,-0.578649
75,-0.071851,-0.113304,-0.770558,-0.610568,-0.117061,-0.399544,-1.630867,-1.275034,-0.811338,-0.164907,...,-0.222618,-0.078331,-0.085988,-0.050145,-0.058464,-0.110528,-0.09726,-0.913765,-0.076728,2.274133
203,-0.064074,-0.112548,-0.426219,-0.419043,-0.11282,-0.399544,0.299629,0.177651,0.007309,-0.159209,...,-0.218235,-0.064786,-0.092564,-0.072736,-0.051353,-0.114409,-0.093999,-0.913765,-0.053717,0.583595
200,-0.066801,-0.112993,-0.529521,-0.77634,-0.111925,-0.399544,-0.786275,-0.766594,-0.606676,-0.162058,...,-0.218235,-0.071559,-0.092564,-0.072736,-0.069758,-0.105071,-0.09726,0.211279,-0.042212,-0.578649
204,-0.062177,-0.112548,-0.450814,-0.350154,-0.109171,-0.399544,-0.062339,-0.330788,-0.402014,-0.164907,...,4.708229,-0.068172,-0.086536,-0.067559,-0.059406,-0.114076,-0.089923,-0.069982,-0.050841,-0.578649


Average and standard deviation of metrics using 10-fold cross-validation.

In [11]:
df_training_kfold = pd.read_csv("experiment_3/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1,std_F1,balanced_ACC,std_balanced_ACC,kappa,std_kappa,gmean,std_gmean
0,0.8708,0.07,0.7158,0.18,0.7879,0.14,0.8357,0.1,0.6991,0.18,0.8204,0.11


Confusion matrix for training dataset.

In [12]:
df_training_confusion_matrix = pd.read_csv("experiment_3/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,0,1,All
1,0,149,5,154
2,1,22,64,86
3,All,171,69,240


Selected features for test dataset.

In [13]:
df_best_feature_test = pd.read_csv("experiment_3/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,AB.1,AD.1,ASSD.1,AS.1,APL.1,TALU.1,NE.1,MOT3.1,MOT4.1,A,...,L,M,N,P,Q,R,S,T,W,Y
97,11.491439,-0.112548,-0.433704,-0.397521,-0.110657,-0.399544,0.178973,0.105017,-0.021928,-0.162935,...,-0.213515,-0.073642,-0.089782,-0.072736,-0.06498,-0.117077,-0.093874,0.643989,-0.076728,0.628297
66,-0.068385,-0.10992,1.809837,1.160465,-0.116684,0.019541,1.023565,1.267165,1.439942,-0.160245,...,-0.208274,-0.074637,-0.092564,-0.072736,-0.065994,-0.103979,-0.094592,-0.913765,-0.057901,-0.578649
89,-0.067703,-0.113021,-0.549197,-0.758169,-0.112658,-0.399544,-0.906931,-0.839228,-0.635913,-0.161891,...,-0.220298,-0.078331,-0.088309,-0.072736,-0.069758,-0.108602,-0.09726,-0.913765,-0.040182,-0.578649
90,-0.06663,-0.111793,1.492747,0.730698,-0.113442,-0.399544,-0.062339,-0.18552,-0.197352,-0.16277,...,-0.220975,-0.078331,-0.092564,-0.04685,-0.069758,-0.102069,-0.089923,0.773802,-0.050841,0.728876
56,-0.067543,-0.111659,1.882024,0.404497,-0.113058,-0.399544,-0.424307,-0.476057,-0.431252,-0.160023,...,-0.22074,-0.078331,-0.085675,-0.066819,-0.052011,-0.117077,-0.088875,0.050559,-0.076728,-0.578649


Metrics regarding the test dataset.

In [14]:
df_metrics_test = pd.read_csv("experiment_3/metrics_test.csv")
df_metrics_test

Unnamed: 0,Metrics: Test Set
0,Accuracy: 0.8557692307692307
1,Recall: 0.9807692307692307
2,Precision: 0.7846153846153846
3,F1: 0.8717948717948717
4,AUC: 0.9375
5,balanced ACC: 0.8557692307692307
6,gmean: 0.8465907962713515
7,MCC: 0.7348737631265355


Predictions for the test dataset.

In [15]:
df_test_predictions = pd.read_csv("experiment_3/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
96,non-ACP_188,0
75,non-ACP_33,0
0,ACP_1,1
45,ACP_89,1
7,ACP_8,1


Confusion matrix for the test dataset.

In [16]:
df_test_confusion_matrix = pd.read_csv("experiment_3/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,0,1,All
1,0,51,1,52
2,1,14,38,52
3,All,65,39,104


Feature importance for the dataset with the most important features.

In [17]:
df_feature_importance = pd.read_csv("experiment_3/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (K): (0.123231)
1,2. Feature (C): (0.092589)
2,3. Feature (R): (0.079057)
3,4. Feature (P): (0.069036)
4,5. Feature (Q): (0.062090)
5,6. Feature (F): (0.049899)
6,7. Feature (AB.1): (0.047528)
7,8. Feature (I): (0.044462)
8,9. Feature (APL.1): (0.040998)
9,10. Feature (A): (0.040895)
