<a href="https://colab.research.google.com/github/brenoslivio/BAGECO2025/blob/main/02_BioAutoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BAGECO 2025 - Metagenomics Workshop

**From Sequences to Predictions – End-to-end analysis using BioAutoML**

*This class was developed for the 17th Symposium on Bacterial Genetics and Ecology (BAGECO 2025) and will be presented at the Metagenomics Workshop on 30.06.2025.*

*For any questions regarding this material, please contact:*

*Breno L. S. de Almeida: breno-livio.silva-de-almeida@ufz.de*

---

This Jupyter Notebook demonstrates how to use [BioAutoML](https://academic.oup.com/bib/article/23/4/bbac218/6618238) without requiring any local installation.

The demonstration includes three use cases:

1. Multi-class prediction of DNA/RNA sequences (more than two classes)

2. Binary classification of protein sequences (two classes)

3. Prediction using structured data

---

## BioAutoML: Automated Feature Engineering and Metalearning for Classification of Biological Sequences

BioAutoML is an automated machine learning (AutoML) tool that simplifies the analysis of biological sequences, such as noncoding RNAs (ncRNAs), by handling feature extraction, algorithm selection, and model tuning. Traditionally, these tasks require deep expertise and manual effort, but BioAutoML automates the process, transforming raw sequences into structured data, selecting the best features, and optimizing predictive models. This enables faster, more accurate classification of ncRNAs, helping researchers uncover biological insights without extensive machine learning knowledge.

It is also a user-friendly package for binary and multi-class classification, automating feature engineering and metalearning, as demonstrated by the next figure. It requires only biological sequence data (FASTA files) for end-to-end ML experiments, from feature extraction to predictive model generation. Alternatively, its modules can be used independently—either to generate optimal numerical representations for other ML tools or to build models from externally extracted features. BioAutoML consists of two main components:

- Automated Feature Engineering (extraction and selection)

- Metalearning (algorithm recommendation and hyperparameter tuning).

![modules](https://github.com/Bonidia/BioAutoML/raw/main/img/bio-v2-1.png)

This first stage uses the MathFeature package to extract feature descriptors, including Mathematical (Fourier, Shannon, Tsallis) and Conventional (NAC, DNC, TNC, ORF, kGap, Fickett score, etc.). Over 15 techniques numerically represent biological sequence information.

The second module automates feature engineering, using Bayesian optimization to select the best feature vector and ML algorithm (including ensembles), as shown by the next figure. This approach efficiently handles the NP-hard problem of optimizing numerous feature descriptor combinations.

![optimization](https://github.com/Bonidia/BioAutoML/raw/main/img/bio-v4-1.png)

This module takes as input:

- All feature descriptors from the first module;

- An objective function (balanced accuracy for binary problems, weighted F1-score for multi-class);

- ML algorithms (CatBoost, AdaBoost, Random Forest, LightGBM) for wrapper-based feature selection. These classifiers were chosen for their strong predictive performance, model interpretability, and widespread use in bioinformatics.

The search space is represented by a partially binary vector (e.g., [1, 0, 1, 0, 0, 1, [2]]), where the last position encodes one of four ML algorithms (0–3) and the others indicate selected (1) or excluded (0) features. BioAutoML uses Bayesian optimization to efficiently find a quasi-optimal feature-algorithm pair, balancing performance and speed. The process evaluates combinations until performance plateaus or after 50 trials (user-adjustable). The output may recommend a single algorithm or an ensemble.

## 1. Installing BioAutoML

First, let's clone the BioAutoML repository from GitHub using Git. Then, we'll navigate into the BioAutoML directory.

In [None]:
!git clone https://github.com/Bonidia/BioAutoML.git BioAutoML
%cd /content/BioAutoML

Cloning into 'BioAutoML'...
remote: Enumerating objects: 1081, done.[K
remote: Counting objects: 100% (357/357), done.[K
remote: Compressing objects: 100% (206/206), done.[K
remote: Total 1081 (delta 195), reused 256 (delta 148), pack-reused 724 (from 1)[K
Receiving objects: 100% (1081/1081), 82.17 MiB | 13.98 MiB/s, done.
Resolving deltas: 100% (470/470), done.
Updating files: 100% (416/416), done.
/content/BioAutoML


We’ll use Git to switch to the workshop-specific branch. Next, we'll download the required code to run MathFeature within BioAutoML.

In [None]:
!git checkout bageco
!git submodule init
!git submodule update

Branch 'bageco' set up to track remote branch 'bageco' from 'origin'.
Switched to a new branch 'bageco'
Submodule 'MathFeature' (https://github.com/Bonidia/MathFeature.git) registered for path 'MathFeature'
Submodule 'MathFeature-WebServer' (https://github.com/Bonidia/MathFeature-WebServer.git) registered for path 'MathFeature-WebServer'
Cloning into '/content/BioAutoML/MathFeature'...
Cloning into '/content/BioAutoML/MathFeature-WebServer'...
Submodule path 'MathFeature': checked out '69d2a3206c6db02be9d46db74ad61fc05cc6aee2'
Submodule path 'MathFeature-WebServer': checked out '5d13e073b30d0f9b333adb411bc843722c09d0ba'


Let's now retrieve all the necessary packages to run BioAutoML. For our demonstration we will use [uv](https://astral.sh/blog/uv), a very fast package manager.

If you really want to install BioAutoML locally, you can use miniconda installed in your machine. Do all the steps from before, with the exception of `!git checkout bageco`, and run the following:

```bash
conda env create -f BioAutoML-env.yml -n bioautoml
```

This will install any necessary packages for BioAutoML in a isolated environment and access with:

```bash
conda activate bioautoml
```

And you can exit the environment with:

```bash
conda deactivate
```

Let's continue using uv. Note that if you want to add to uv directly to run BioAutoML you can use:

```bash
uv add scikit-learn==1.1.0 pandas catboost lightgbm matplotlib-inline  hyperopt setuptools biopython xgboost imbalanced-learn igraph tpot
```

As we already have the list of packages inside BioAutoML folder to download, we will just use the command sync to download all the necessary packages.

In [None]:
!uv sync

[2K[1AUsing CPython [36m3.8.20[39m
Creating virtual environment at: [36m.venv[39m
[2mResolved [1m80 packages[0m [2min 5ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mcertifi   [0m [32m[2m------------------------------[0m[0m     0 B/153.96 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2murllib3   [0m [32m[2m------------------------------[0m[0m     0 B/123.38 KiB
[2mcertifi   [0m [32m[2m------------------------------[0m[0m     0 B/153.96 KiB
[2K[3A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mcloudpickle[0m [32m[2m------------------------------[0m[0m     0 B/20.50 KiB
[2murllib3   [0m [32m[2m------------------------------[0m[0m     0 B/123.38 KiB
[2mcertifi   [0m [32m[2m------------------------------[0m[0m     0 B/153.96 KiB
[2K[4A[37m⠙[0m [2mPreparing packages...[0m (0/49)
[2mthreadpoolctl[0m [32m[2m------------------------------[0m[0

## 2. Running BioAutoML

BioAutoML provides multiple scripts for automated classification tasks, including:

- `BioAutoML-feature.py`: Performs automated classification for DNA/RNA sequences. It extracts descriptors (e.g., using MathFeature), selects the most relevant features, and proceeds with binary or multiclass classification.

- `BioAutoML-feature-protein.py`: Executes automated classification for protein sequences, following the same workflow as the DNA/RNA script—feature extraction, selection, and classification.

- `BioAutoML-binary.py`: Handles binary classification tasks. This script can be called by the DNA/RNA or protein scripts or run independently using structured input data.

- `BioAutoML-multiclass.py`: Manages multiclass classification. Like the binary script, it can be invoked by other BioAutoML scripts or used standalone with structured data.


## 2.1 DNA/RNA classification

Let us use `BioAutoML-feature.py` for DNA/RNA classification.

BioAutoML only deals with DNA/RNA sequences using the 4-letter nucleotide notation by IUPAC (A, G, C, and T), eliminating other sequences. BioAutoML considers Uracil in DNA as its counterpart to Thymine in DNA.

To run the script, you can run it locally in the conda environment using `python BioAutoML-feature.py` and the rest of the required arguments. As we are using the uv package manager, we will use `uv run` to run the script in Google Colab.

We have the following arguments that can be used to run the script:

`--fasta_train`: List of paths containing FASTA files used for training;

`--fasta_label_train`: List of labels for the FASTA files used for training;

`--fasta_test`: List of paths containing FASTA files used for testing (optional);

`--fasta_label_test`: List of labels for the FASTA files used for testing (optional);

`--output`: Folder to create and save results;

`--n_cpu`: Number of CPU cores to use (`-1` will use all cores available);

`--estimations`: Number of trials using Bayesian optimization to find the best combination of descriptors and classifiers for classification.

We will run the script using an example from the BioAutoML paper, which involves classifying three classes of RNA: rRNA, sRNA, and tRNA. We will use just five trials for a quick example, but it is recommended to choose at least 50 trials to generate a more accurate model for your problem.


In [None]:
!uv run BioAutoML-feature.py \
--fasta_train exemplo_fasta/train/rRNA.fasta exemplo_fasta/train/sRNA.fasta exemplo_fasta/train/tRNA.fasta \
--fasta_label_train rRNA sRNA tRNA \
--fasta_test exemplo_fasta/test/rRNA.fasta exemplo_fasta/test/sRNA.fasta exemplo_fasta/test/tRNA.fasta \
--fasta_label_test rRNA sRNA tRNA \
--output experiment_1 --n_cpu -1 --estimations 5

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000554 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2273
[LightGBM] [Info] Number of data points in the train set: 185, number of used features: 110
[LightGBM] [Info] Start training from score -1.853060
[LightGBM] [Info] Start training from score -0.432864
[LightGBM] [Info] Start training from score -1.636837
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000165 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2433
[LightGBM] [Info] Number of data points in the train set: 205, number of used features: 110
[LightGBM] [Info] Start training from score -1.857274
[LightGBM] [Info] Start training from score -0.432661
[LightGBM] [Info] Start t

Multiple results regarding the classification are generated in the new folder created as output for the script.

Let's use the Pandas library to check the generated results. For example we can check the best features found used for training (and therefore, for testing).

In [None]:
import pandas as pd

df_best_feature_train = pd.read_csv("experiment_1/best_feature_train.csv").sample(5)
df_best_feature_train

Unnamed: 0,A,C,G,T,A_AA,A_AC,A_CA,A_CC,A_GA,A_GC,...,fminimum,fnone_levated_peak,fsample_standard_deviation,fpercentile15,fpercentile25,fpercentile75,finterquartile_range,fcoefficient_of_variation,fskewness,fkurtosis
109,-1.345878,0.79988,0.822194,-0.048919,-0.441973,-0.464853,-0.402537,-0.306832,-0.473865,-0.257023,...,-0.409937,-0.487202,-0.425184,-0.4156,-0.38907,-0.418632,-0.425868,-1.072879,0.35849,-0.173172
23,0.647306,-0.828351,0.551648,-0.503856,3.192843,3.084594,3.299759,3.495221,2.981792,3.382541,...,-0.840442,2.943152,3.261952,3.082188,3.128148,3.257152,3.288479,0.249884,0.242949,-0.061293
164,0.516396,-0.681095,1.13673,-1.062747,-0.441973,-0.31053,-0.402537,-0.373535,-0.243488,-0.480996,...,-0.790384,-0.310417,-0.355742,-0.376844,-0.363672,-0.370348,-0.371934,0.584781,0.094706,-0.508955
22,0.647306,-0.828351,0.551648,-0.503856,3.192843,3.084594,3.299759,3.561923,2.981792,3.382541,...,1.007191,2.941471,3.256341,3.212329,3.201782,3.213481,3.215813,0.235656,0.150882,-0.118698
79,2.324133,-2.594021,-1.021115,0.777163,0.089951,-0.387691,-0.320264,-0.440238,-0.32028,-0.425003,...,0.23237,-0.244568,-0.34051,-0.287667,-0.301333,-0.343359,-0.353689,-0.414428,-0.178851,-0.778003


BioAutoML employs 10-fold cross-validation on the training dataset to evaluate and recommend the most suitable descriptors and classifier for a given dataset. This approach ensures robust model selection by assessing average performance metrics along with their standard deviations.

In [None]:
df_training_kfold = pd.read_csv("experiment_1/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1_micro,std_F1_micro,F1_macro,std_F1_macro,F1_w,std_F1_w,kappa,std_kappa
0,0.995,0.02,0.9904,0.03,0.995,0.01,0.994,0.02,0.9947,0.02,0.9898,0.03


We have the following confusion matrix for the training dataset.

In [None]:
df_training_confusion_matrix = pd.read_csv("experiment_1/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3,4
0,REAL,rRNA,sRNA,tRNA,All
1,rRNA,32,0,0,32
2,sRNA,0,133,0,133
3,tRNA,0,1,39,40
4,All,32,134,39,205


A test dataset is optional for model evaluation in BioAutoML. If provided, it serves as an independent validation set, ensuring unbiased performance assessment. BioAutoML automatically applies the previously selected features to the test dataset for consistency.

In [None]:
df_best_feature_test = pd.read_csv("experiment_1/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,A,C,G,T,A_AA,A_AC,A_CA,A_CC,A_GA,A_GC,...,fminimum,fnone_levated_peak,fsample_standard_deviation,fpercentile15,fpercentile25,fpercentile75,finterquartile_range,fcoefficient_of_variation,fskewness,fkurtosis
47,1.229234,-1.181425,-1.271011,0.953439,-0.353319,-0.464853,-0.402537,-0.440238,-0.473865,-0.425003,...,0.040502,-0.470425,-0.41835,-0.402672,-0.368749,-0.413217,-0.42414,-0.837449,-0.69113,-1.174629
21,2.864131,-1.994169,-1.399818,0.023458,-0.353319,-0.464853,-0.402537,-0.440238,-0.32028,-0.480996,...,-0.466381,-0.604259,-0.446499,-0.467512,-0.463309,-0.452453,-0.449691,0.441449,-0.026436,0.1103
29,0.903768,-0.972491,-1.331619,1.180598,-0.353319,-0.31053,-0.155717,-0.306832,-0.397073,-0.369009,...,2.450072,-0.342319,-0.380088,-0.274594,-0.324833,-0.380452,-0.394137,-0.892113,0.423859,-1.647557
36,-0.521347,1.217591,0.433211,-0.940024,-0.441973,-0.387691,-0.402537,-0.373535,-0.473865,-0.20103,...,-0.126074,-0.439249,-0.401363,-0.408353,-0.400215,-0.421657,-0.426884,0.332392,-1.094954,-2.447554
18,-0.912791,1.566814,0.507657,-0.897705,-0.441973,-0.464853,-0.320264,-0.373535,-0.473865,-0.257023,...,-0.092022,-0.448304,-0.406175,-0.394241,-0.412207,-0.416914,-0.418005,-0.002717,-0.11567,-1.456941


We can check the metrics regarding the test dataset.

In [None]:
df_metrics_test = pd.read_csv("experiment_1/metrics_test.csv")
df_metrics_test

Unnamed: 0.1,Unnamed: 0,precision,recall,f1-score,support
0,rRNA,0.888889,1.0,0.941176,8.0
1,sRNA,0.969697,0.969697,0.969697,33.0
2,tRNA,1.0,0.9,0.947368,10.0
3,accuracy,0.960784,0.960784,0.960784,0.960784
4,macro avg,0.952862,0.956566,0.952747,51.0
5,weighted avg,0.962963,0.960784,0.960845,51.0


We can also check the prediction done for individual samples.

In [None]:
df_test_predictions = pd.read_csv("experiment_1/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
33,U00096.3/1437118-1437238,sRNA
15,U00096.3/4094319-4094389,sRNA
43,U00096.3/225381-225454,tRNA
25,U00096.3/984395-984437,sRNA
48,U00096.3/696740-696667,tRNA


Next we have the following confusion matrix for the test dataset.

In [None]:
df_test_confusion_matrix = pd.read_csv("experiment_1/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3,4
0,REAL,rRNA,sRNA,tRNA,All
1,rRNA,8,0,0,8
2,sRNA,1,32,0,33
3,tRNA,0,1,9,10
4,All,9,33,9,51


Finally, we can also the check the feature importance of the datasets used. For example, let's see the ten most important features.

In [None]:
df_feature_importance = pd.read_csv("experiment_1/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (G): (204.000000)
1,2. Feature (G_TTC): (142.000000)
2,3. Feature (A_TCC): (99.000000)
3,4. Feature (T_GCG): (95.000000)
4,5. Feature (G_TC): (79.000000)
5,6. Feature (A): (70.000000)
6,7. Feature (A_GC): (68.000000)
7,8. Feature (G_GG): (63.000000)
8,9. Feature (fickett_score-ORF): (57.000000)
9,10. Feature (G_TCG): (55.000000)


We can check the best features found used for training (and therefore, for testing).

## 2.2 Protein classification

Let’s use `BioAutoML-feature-protein.py` for protein classification.

BioAutoML handles protein sequences represented with the standard 20-letter IUPAC amino acid notation, automatically filtering out sequences that contain ambiguous amino acids.

The script accepts the same command-line arguments as the DNA/RNA classification script.

In this example, we’ll run it on a dataset of [non-classically secreted proteins](https://academic.oup.com/bioinformatics/article/36/3/704/5545087) from Zhang et al.


In [None]:
!uv run BioAutoML-feature-protein.py \
--fasta_train MathFeature/Case\ Studies/CS-I/train_P.fasta MathFeature/Case\ Studies/CS-I/train_N.fasta \
--fasta_label_train positive negative \
--fasta_test MathFeature/Case\ Studies/CS-I/test_P.fasta MathFeature/Case\ Studies/CS-I/test_N.fasta \
--fasta_label_test positive negative \
--output experiment_2 --n_cpu -1 --estimations 1



###################################################################################
###################################################################################
##########         BioAutoML- Automated Feature Engineering             ###########
##########              Author: Robson Parmezan Bonidia                 ###########
##########         WebPage: https://bonidia.github.io/website/          ###########
###################################################################################
###################################################################################


Train - MathFeature/Case Studies/CS-I/train_P.fasta: Found File
Train - MathFeature/Case Studies/CS-I/train_N.fasta: Found File
Test - MathFeature/Case Studies/CS-I/test_P.fasta: Found File
Test - MathFeature/Case Studies/CS-I/test_N.fasta: Found File
Error: experiment_2/feat_extraction - No such file or directory.
Creating Directory...
Extracting features with MathFeature...
Automated Feature Engineering

We can check the best features found used for training.

In [None]:
df_best_feature_train = pd.read_csv("experiment_2/best_feature_train.csv").sample(5)
df_best_feature_train

Unnamed: 0,A_A,G_N,L_N,AA,AE,AV,EG,GA,GV,IK,KL,LL,QI,SK,TE,VA,VG
484,0.320398,-0.796494,0.803822,-0.938113,0.440584,-0.844942,-0.054795,0.416353,-0.007751,0.506367,0.317572,-0.581984,-0.557296,-0.467538,-0.051507,-1.082512,-0.454238
236,-1.020591,-0.796494,-0.070772,-0.938113,-1.149214,-1.080399,-1.082223,-1.14757,-1.108193,-1.056024,-1.306153,1.166336,-0.983263,5.538689,2.400019,-1.082512,-1.088005
497,-0.014849,-0.796494,-0.508069,-0.938113,0.092073,-0.222476,-0.146325,0.277028,0.395418,0.977123,-1.306153,-0.434371,0.568815,0.613915,0.741342,-0.64702,0.066609
136,-1.020591,-0.796494,-0.945366,1.133587,-1.149214,-1.080399,2.150225,-1.14757,-1.108193,-1.056024,-1.306153,-1.269994,-0.983263,-1.091667,-1.016247,-1.082512,-1.088005
522,1.32614,-0.796494,1.241119,-0.202263,-1.149214,-0.729573,-1.082223,-0.759201,-1.108193,0.939348,0.421933,0.438543,0.920789,1.233185,-0.05795,-0.726345,-1.088005


Average and standard deviation for metrics using 10-fold cross-validation.

In [None]:
df_training_kfold = pd.read_csv("experiment_2/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1,std_F1,balanced_ACC,std_balanced_ACC,kappa,std_kappa,gmean,std_gmean
0,0.8621,0.04,0.6024,0.12,0.6738,0.11,0.7754,0.07,0.5892,0.13,0.7523,0.09


Confusion matrix for training dataset.

In [None]:
df_training_confusion_matrix = pd.read_csv("experiment_2/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,negative,positive,All
1,negative,420,26,446
2,positive,54,87,141
3,All,474,113,587


Test dataset used for prediction using selected descriptors and features.

In [None]:
df_best_feature_test = pd.read_csv("experiment_2/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,A_A,G_N,L_N,AA,AE,AV,EG,GA,GV,IK,KL,LL,QI,SK,TE,VA,VG
18,0.655646,-0.796494,-0.070772,4.792119,-1.149214,-1.080399,3.388183,-1.14757,1.285855,-1.056024,0.712398,0.725723,-0.983263,-1.091667,7.379045,-1.082512,-1.088005
48,-1.020591,-0.796494,-0.945366,-0.938113,2.458151,-1.080399,-1.082223,0.922477,-1.108193,-1.056024,0.53602,-1.269994,-0.983263,-1.091667,-1.016247,-1.082512,-1.088005
23,-1.020591,-0.796494,-0.508069,0.253573,-1.149214,-1.080399,2.636522,0.739287,2.874825,2.17542,3.731294,-1.269994,-0.983263,-1.091667,-1.016247,-1.082512,1.205896
36,-1.020591,-0.136124,-0.508069,-0.355167,-0.344974,0.587166,0.736903,-0.224562,-1.108193,-0.265649,1.979454,3.602665,-0.983263,1.118452,-1.016247,0.610437,-1.088005
40,0.655646,0.524246,-0.070772,-0.938113,-0.517311,-0.425284,-0.367566,-0.42235,1.188139,-0.435015,-0.660766,0.006179,-0.983263,-1.091667,-0.121511,0.247663,-0.206336


Now let's see metrics for binary classification task regarding the test dataset.

In [None]:
df_metrics_test = pd.read_csv("experiment_2/metrics_test.csv")
df_metrics_test

Unnamed: 0,Metrics: Test Set
0,Accuracy: 0.6029411764705882
1,Recall: 0.7647058823529411
2,Precision: 0.5777777777777777
3,F1: 0.6582278481012658
4,AUC: 0.6548442906574395
5,balanced ACC: 0.6029411764705882
6,gmean: 0.5808358134744558
7,MCC: 0.21758445525607323


We can see predictions done regarding the test dataset.

In [None]:
df_test_predictions = pd.read_csv("experiment_2/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
53,sp|Q7A827|HDOX2_STAAN,negative
9,sp|P9WN39|GLN1B_MYCTU,negative
6,sp|P54375|SODM_BACSU,negative
46,sp|Q72X78|SPEE_BACC1,negative
44,sp|Q92A79|NDK_LISIN,negative


Confusion matrix for test dataset.

In [None]:
df_test_confusion_matrix = pd.read_csv("experiment_2/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,negative,positive,All
1,negative,26,8,34
2,positive,19,15,34
3,All,45,23,68


Feature importance for our dataset.

In [None]:
df_feature_importance = pd.read_csv("experiment_2/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (AA): (12.128527)
1,2. Feature (QI): (7.163575)
2,3. Feature (AV): (7.062948)
3,4. Feature (EG): (7.017453)
4,5. Feature (SK): (6.795861)
5,6. Feature (GV): (6.687290)
6,7. Feature (LL): (5.995464)
7,8. Feature (L_N): (5.774325)
8,9. Feature (G_N): (5.747755)
9,10. Feature (GA): (5.489212)


## 2.3 Structured data classification

We can also use BioAutoML using structured data, without the need of biological sequences. We can directly use the scripts for binary and multiclass classification.

As example, let us use `BioAutoML-binary.py` for binary classification. Note that `BioAutoML-multiclass.py` is very similar to the binary classification script (with the exception of not having feature selection as an argument).

The structured dataset requires separate CSV files for features and labels, where the labels file must include a column named `label`, and this applies to both training and test sets. For test set evaluation, an additional CSV file is required containing sample names in a column titled `nameseq`.

We have the following arguments that can be used to run the script:

`--train`: Path to the CSV file consisting of the training dataset;

`--train_label`: Path to the CSV file consisting of the labels for the training dataset;

`--test`: Path to the CSV file consisting of the test dataset;

`--test_label`: Path to the CSV file consisting of the labels for the test dataset;

`--test_nameseq`: Path to the CSV file consisting of the names for the samples used in the test dataset;

`--normalization`: Normalization - Features (default = False);

`--featureselection`: Feature Selection (default = True);

`--n_cpu`: Number of CPU cores to use (`-1` will use all cores available);

`--classifier`: Classifier to be used; 0: CatBoost, 1: Random Forest, 2: LightGBM, 3: XGBoost;

`--estimations`: Number of trials using Bayesian optimization to find the best combination of descriptors and classifiers for classification. Note: In this script the original BioAutoML uses a fixed number of trials;

`--imbalance`: To deal with the imbalanced dataset problem (default = False);

`--tuning`: Tuning Classifier (default = False);

`--output`: Folder to create and save results;

We will run an AntiCancer dataset as our example of structured data.

In [None]:
!uv run BioAutoML-binary.py \
--train example_csv/AntiCancer/train.csv \
--train_label example_csv/AntiCancer/train_labels.csv \
--test example_csv/AntiCancer/test.csv \
--test_label example_csv/AntiCancer/test_labels.csv \
--test_nameseq example_csv/AntiCancer/test_names.csv \
--normalization True \
--featureselection 1 \
--classifier 1 \
--tuning True \
--output experiment_3 --n_cpu -1 --estimations 5



###################################################################################
###################################################################################
#####################        BioAutoML - Binary             #######################
##########              Author: Robson Parmezan Bonidia                 ###########
##########         WebPage: https://bonidia.github.io/website/          ###########
###################################################################################
###################################################################################


Train - example_csv/AntiCancer/train.csv: Found File
Train_labels - example_csv/AntiCancer/train_labels.csv: Found File
Test - example_csv/AntiCancer/test.csv: Found File
Test_labels - example_csv/AntiCancer/test_labels.csv: Found File
Test_nameseq - example_csv/AntiCancer/test_names.csv: Found File
Number of samples (train): 240
Number of samples (test): 104
Number of features (train): 32
Number of featu

The training dataset with selected features.

In [None]:
df_best_feature_train = pd.read_csv("experiment_3/best_feature_train.csv").sample(5)
df_best_feature_train

Unnamed: 0,C,K,P,Q,R
60,-0.0291,-0.12956,-0.063861,-0.069758,-0.117077
195,-0.073475,-0.13187,-0.064167,-0.061191,-0.114593
100,-0.073475,-0.130348,-0.072736,-0.06245,-0.108602
24,-0.073475,-0.107231,-0.072736,-0.069758,-0.117077
83,-0.073475,-0.129217,-0.072736,-0.069758,-0.111536


Average and standard deviation of metrics using 10-fold cross-validation.

In [None]:
df_training_kfold = pd.read_csv("experiment_3/training_kfold(10)_metrics.csv")
df_training_kfold

Unnamed: 0,ACC,std_ACC,MCC,std_MCC,F1,std_F1,balanced_ACC,std_balanced_ACC,kappa,std_kappa,gmean,std_gmean
0,0.8417,0.06,0.6516,0.13,0.7587,0.09,0.8112,0.07,0.642,0.13,0.8014,0.07


Confusion matrix for training dataset.

In [None]:
df_training_confusion_matrix = pd.read_csv("experiment_3/training_confusion_matrix.csv", header=None)
df_training_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,0,1,All
1,0,142,12,154
2,1,27,59,86
3,All,169,71,240


Selected features for test dataset.

In [None]:
df_best_feature_test = pd.read_csv("experiment_3/best_feature_test.csv").sample(5)
df_best_feature_test

Unnamed: 0,C,K,P,Q,R
83,-0.073475,-0.134026,-0.065833,-0.055955,-0.117077
9,-0.073475,-0.134026,-0.072736,-0.060202,-0.117077
28,-0.073475,-0.120864,-0.072736,-0.069758,-0.117077
55,-0.073475,-0.12956,-0.063861,-0.065321,-0.117077
79,-0.063121,-0.134026,-0.054098,-0.069758,-0.106271


Metrics regarding the test dataset.

In [None]:
df_metrics_test = pd.read_csv("experiment_3/metrics_test.csv")
df_metrics_test

Unnamed: 0,Metrics: Test Set
0,Accuracy: 0.8269230769230769
1,Recall: 0.9038461538461539
2,Precision: 0.7833333333333333
3,F1: 0.8392857142857143
4,AUC: 0.8946005917159764
5,balanced ACC: 0.8269230769230769
6,gmean: 0.8233374857156787
7,MCC: 0.6617241025372945


Predictions for the test dataset.

In [None]:
df_test_predictions = pd.read_csv("experiment_3/test_predictions.csv", header=None).sample(5)
df_test_predictions

Unnamed: 0,0,1
9,ACP_10,0
62,non-ACP_20,0
84,non-ACP_176,1
19,ACP_29,1
90,non-ACP_182,0


Confusion matrix for the test dataset.

In [None]:
df_test_confusion_matrix = pd.read_csv("experiment_3/test_confusion_matrix.csv", header=None)
df_test_confusion_matrix

Unnamed: 0,0,1,2,3
0,REAL,0,1,All
1,0,47,5,52
2,1,13,39,52
3,All,60,44,104


Feature importance for the dataset with the most important features.

In [None]:
df_feature_importance = pd.read_csv("experiment_3/feature_importance.csv", header=None).head(10)
df_feature_importance

Unnamed: 0,0
0,1. Feature (K): (0.294172)
1,2. Feature (C): (0.218803)
2,3. Feature (Q): (0.180643)
3,4. Feature (P): (0.158340)
4,5. Feature (R): (0.148042)
