<h1 align="center" style="font-variant: small-caps">Tutorial: How to build a genome-scale GBA model</h1>
<h2 align="center">INTRODUCTION</h2>
<h5 align="center">(<code>Version 9</code>, March 2025)</h5>

<div align="center" style="max-width:100px;display:block;margin:auto;">

![image-2.png](attachment:image-2.png)

</div>

# Table of content

- [1. The <strong>GBApy</strong> module](#gbapy)
- [2. Constructing a genome-scale GBA model: how-to](#introduction)
  - [Step 1: Collect data](#introduction_1)
  - [Step 2: Check and correct the consistency of the model](#introduction_2)
  - [Step 3: Edit and compress the model](#introduction_3)
  - [Step 4: Build the GBA model](#introduction_4)
  - [Step 5 (Optional): Reduce the problem to a full column rank problem](#introduction_5)
- [3. Software usage](#software)
- [4. A minimal cell tutorial to guide the user](#tutorial)

# 1. The <strong>GBApy</strong> module <a id="gbapy"></a>

<strong>GBApy</strong> has been specifically developed to handle GBA models in Python. 
It is versatile, easy to use, and allows to build, test and optimize models of any size, including genome-scale models.
Ultimately, <strong>GBApy</strong> provides tools to simulate evolution and predict the future adaptation steps of the organism of interest.

<strong>GBApy</strong> is freely distributed as a Python module (see https://github.com/charlesrocabert/gbapy for installation instructions, and for documentation) under the GNU-GPLv3 license.

In addition, a C++ framework, <strong>GBAcpp</strong> (see https://github.com/charlesrocabert/gbacpp), has been implemented to specifically look for optimal solutions with genome-scale models, which require heavy calculation.
The aim of <strong>GBAcpp</strong> is to be deployed on HPC.

# 2. Constructing a genome-scale GBA model: how-to <a id="introduction"></a>

Building a GBA genome-scale model usually requires a pre-existing model of the organim of interest. One good start is a pre-published and annotated SBML model (see <em>e.g.</em> the <a href="http://bigg.ucsd.edu/" target="_blank">BiGG database</a>).

To obtain a functional model, it is necessary to check for mass conservation inconsistencies, and to collect an important amount of data to infer metabolite and enzyme molecular masses, as well as kinetic parameters.

A proper annotation, including metabolites, proteins and enzyme composition, as well as a genome assembly, are also usually needed to extract database identifiers, chemical formulas, or nucleotide and amino-acid sequences (<em>e.g.</em> on the <a href="https://www.ncbi.nlm.nih.gov/datasets/genome/" target="_blank">NCBI database</a>).

The construction can be divided in five main steps, each including substeps, as presented below:

<div align="center" style="max-width:1000px;display:block;margin:auto;">

![image.png](attachment:image.png)

</div>

## Step 1. Collect data <a id="introduction_1"></a>

The first step is to collect all the necessary data to build the model:

- 1) Collect all proteins involved in the model, their amino-acid sequences and calculate their molecular masses. If the protein is unknown for one reaction, a dummy protein can be used (<em>e.g.</em> an average protein).
- 2) Collect all metabolites involved in the model. Depending on the nature of the metabolite (DNA, RNA, small molecule, macro-molecule, ...), different solutions are necessary to calculate the molecular masses. <strong>GBApy</strong> provides methods for DNA, RNA and chemical formulas without unspecified radicals (`R`). For other molecules, manual curation is necessary.
- 3) Collect all the reactions, as well as their enzyme composition, except typical FBA pseudo-reactions such as exchange (`R_EX_...`) or biomass reactions (`R_BIOMASS_...`).
- 4) Collect kinetic parameters for each reaction (mainly $K_\text{M}$ and $k_\text{cat}$ values). It can also include inhibition/activation interactions. The user has to collect (<em>e.g.</em> from <a href="https://www.brenda-enzymes.org/" target="_blank">BRENDA database</a>), or predict (<em>e.g.</em> with <a href="https://esp.cs.hhu.de/" target="_blank">DeepMolecules</a>), the kinetic parameter values, before providing them to <strong>GBApy</strong>.

## Step 2. Check and correct the consistency of the model <a id="introduction_2"></a>

Second, the consistency of the model structure must be tested to detect potential mishaps:

- 1) Check for various annotation inconsistencies (missing molecular masses or kinetic parameters, empty objects, ...).
- 2) Check for unproduced metabolites breaking the conservation of mass.
- 3) Check for infeasible loops (unproduced metabolites hidden in a loop breaking mass conservation).
- 4) Check for isolated metabolites (imported but never used; it is usually the case for metabolites that only relate to the FBA biomass function).

## Step 3. Edit and compress the model <a id="introduction_3"></a>

After the mandatory steps 1 and 2, modeling decisions can be made to reduce the model size and/or remove unused pathways:

- 1) Edit reactions and metabolite names to fit GBA formalism (by convention, external metabolites receive a prefix `x_`; internal metabolites have no prefix/suffix).
- 2) Remove unnecessary and/or simplify pathways. This step is at the user will.

## Step 4. Build the GBA model <a id="introduction_4"></a>

Finally, the GBA model can be built:

- 1) Check for mass balance. It is usually necessary to adjust the mass of a few metabolites to keep mass balance in all reactions.
- 2) Build a realistic ribosomal reaction. This involves to collect the list of ribosomal proteins composing the enzyme, and to define a stoichiometry for the protein synthesis reaction.
- 3) Collect information to build external conditions (<em>e.g.</em> from a known medium content).
- 4) Find at least one initial solution. <strong>GBApy</strong> routines will help.

## Step 5 (Optional). reduce the problem to a full column-rank problem <a id="introduction_5"></a>

By reducing the internal mass fraction matrix to a full column rank matrix, a convex GBA problem and the existence of a single optimal growth rate are guaranteed.
A good approach is to identify an elementary flux mode (EFM) and to remove inactive reactions.


# 3. Software usage <a id="software"></a>

Two software are available to build and run GBA models:

- <strong>GBApy</strong> (https://github.com/charlesrocabert/gbapy): Python module dedicated to the building of GBA models, from toy models to genome-scale models (see the next steps of this tutorial). <strong>GBApy</strong> can also be used to run evolutionary algorithms on small models (typically, less than 20 reactions) on a personal laptop.
- <strong>GBAcpp</strong> (https://github.com/charlesrocabert/gbacpp): C++ framework specifically optimized to run genome-scale models. <strong>GBAcpp</strong> is particularly quick and useful when deployed on HPC.

<div align="center" style="max-width:1000px;display:block;margin:auto;">

![image-2.png](attachment:image-2.png)

</div>

# 4. A minimal cell tutorial to guide the user <a id="tutorial"></a>

To guide the user towards a functional genome-scale GBA model, a complete tutorial using the synthetic minimal cell (<a href="https://www.nature.com/articles/s41586-023-06288-x" target="_blank">Moger-Reischer et al. 2023</a>) is available.