# Hands-On Session III: Workflows in the workflow collection SPECIMEN
In this notebook you will get first-hand experience on how to run some of the steps from the high-quality template-based 
(HQTB) pipeline within ``SPECIMEN`` along with information on when to use which workflow, how to obtain the required 
data and how to run the workflows.

## Background on workflows
Both workflows are intended to be run to obtain a high-quality model with minimal manual input. Hence, both 
workflows share a similar structure, while the implementation is different.

Basic structure
- Draft generation
- Model extension 
- Annotation and Clean-up 
- Analyses (final and after various steps)

One major difference between the HQTB and the CarveMe-ModelPolisher-based (CMPB) workflow is that steps in HQTB can be run 
individually and a template model is required for the draft reconstruction while the CMPB workflow can not only generate 
a new model but it can also be used to improve an already existing model. 

### When to use which workflow?
Since the structure of the workflow is similar, one could argue, why we need more than one. 
The table below illustrates this question:

|  | **HQTB** | **CMPB** |
| :--- | :--- | :--- |
| Requirements | Template model, </br> annotated genome from template model organism, </br> annotated genome from organism to be build | Model/ </br> Annotated genome |
|Ideally for| Lab strains, </br> where models of close related organisms exist | Novel organisms, </br> without template models/ </br> Existing models |
| "Philosophy" | Maximise profiting from previous work </br> to prevent wasting time and resources </br> on the same questions again | Automation of "classic" workflow |

We will take a closer look at the HQTB workflow for this workshop.

### Running HQTB
Before running HQTB, the necessary input data needs to be gathered, the correct paths added and the options set in 
one of the possible configuration files for HQTB.

#### Getting the input data

**Template**

Since the HQTB workflow is a template-based approach, one of the first steps is to actually find a suitable template model. 
For the reconstruction of the model of *Pseudomonas putida* KT2440, the *Pseudomonas aeroginosa* P14 model from Dahal 
et al.[^1] was chosen (see XML file in `../data/template/`). 

Due to the bidirectional BLAST in the first part of the workflow, the model itself is not enough for running the workflow.
We need the genome file as well. Usually, one would check the model and the corresponding paper for the sequence or identifier for a database where the sequence can be found. 
Here, we already provide you with the annotated genome used for creating the model, which can be found on NCBI under the assembly number [ASM1462v1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000014625.1/) (see FASTA file in `../data/template/`). 

**Subject**

For the subject organism, the annotated genome is needed as well. It can either be provided directly by the laboratory or downloaded from databases like NCBI. In our case, we will use the NCBI entry [ASM7565v2](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007565.2/) (see FASTA file in `../data/`). For some steps in the workflow, the GFF file is additionally required (either the GenBank, starting with `GCA`, or the RefSeq one, starting with `GCF`).

**Other**

Depending on how many steps and on what options are enabled, more data files or values are needed, for example:

- To set the reaction direction based on BioCyc a table from BioCyc needs to be provided.
- To run the KEGGapFiller, a KEGG organism ID is needed.
- For running the BioCycGapFiller, one table for the genes, one for the reactions is needed. 
- The GeneGapFiller requires a DIAMOND database and optionally a mapping table for improving runtime.
- etc.

### Running steps 1 & 2 for HQTB

While the wokflow can be run in a couple of hours or less depending on what options are enabled, it would be too long for the context of this workshop. 
Therefore, only the first two steps of the workflow will be tested. However, you are encouraged to also generate a full configuration file to get a feeling on what is needed to build a complete model. 

📝 **Task**
1. Fill out the provided example configuration file (see `../data/hqtb_example_config.yaml`).
2. Run step 1: Bidirectional BLAST.
    > Note: While the workflow can run with both the FASTA and the GBFF, its reliant on matching the feature qualifiers of the GBFF or the ID in the FASTA. Make sure to use the correct one as input.
3. Run step 2: Generate draft model.
    > Tip: Since the template organism is not of the same species, try out different PID values and see, how they influence the results and why closely related organisms work best for this kind of modelling approach.
4. Perform basic analyses on the generated model.

<font color="grey">

[^1]: Dahal S, Renz A, Dräger A, Yang L. Genome-scale model of Pseudomonas aeruginosa metabolism unveils virulence and drug potentiation. Commun Biol. 2023 Feb 10;6(1):165. doi: 10.1038/s42003-023-04540-8. PMID: 36765199; PMCID: PMC9918512.

In [None]:
# Your code goes here

<details>
<summary>🔑 Click to see the answer 🔑</summary>

Here is the code for the task:

```python
from specimen.hqtb.core import bidirectional_blast
from specimen.hqtb.core import generate_draft_model

bidirectional_blast.run(
        template = "./data/template/GCA_000014625.1_ASM1462v1_translated_cds.faa",
        input = "./data/GCA_000007565.2_ASM756v2_translated_cds.faa",
        dir = "./hqtb_out/",  # can be path of your choice
    )

generate_draft_model.run(
    template = "./data/template/iSD1509.xml",
    bpbbh = "./hqtb_out/GCA_000007565.2_ASM756v2_translated_cds_GCA_000014625.1_ASM1462v1_translated_cds_bbh.tsv", # output of the previous function
    dir = "./hqtb_out/", # can be path of your choice
    pid = 90.0,
    name = "iPpuKT2440WS25")

```

</details>

## ⭐️ Bonus
If you already finished the tasks and still want to go further, you can try to use more options from the configuration 
file. Otherwise, you can also use more analyses.

📝 **Task**

Try out more options/steps for HQTB or more analyses.

*Note:* More options/steps for HQTB will most likely take more time than the workshop provides.