In [1]:
from specimen.util import set_up
import specimen


# HowTo: Collect Data needed to run the workflow

The workflow requires a set of data and database files to run correctly. The following abstract show what data is needed (or optional) and where to (potentially) get it from.

----
## Setting up a structure to save the data to
*"Tidiness is half the battle"*: To easily keep track of which data is already there and which is still missin, construct a directory structure with one folder for each of the data(base) type that is needed and fill them step by step. This can be done via the tool by running the function `set_up.build_data_directories`. 

The following directory structure will be created:

- your_folder_name/
    - annotated_genomes/
    - <span style="color:green">BiGG-namespace/ </span>
    - BioCyc/
    - medium/
    - <span style="color:green">MetaNetX/ </span>
    - pan-core-models/
    - RefSeqs/
    - template-models/
    - universal-models/

Most of the folder will be emtpy, however the functions already directly downloads the files needed in the <span style="color:green">MetaNetX/ </span> and <span style="color:green">BiGG-namespace/ </span> folders, meaning those two do not need further adjustment.  

In [2]:
# enable logging for more verbose output
# logging.basicConfig(level=logging.INFO)
set_up.build_data_directories('your_folder_name')

INFO:root:Creating directory structure...
INFO:root:Downloading BiGG-namespace...
INFO:root:Downloading MetaNetX...


Directory test_data_collection/annotated_genomes/ already exists.
Directory test_data_collection/BiGG-namespace/ already exists.
Directory test_data_collection/BioCyc/ already exists.
Directory test_data_collection/RefSeqs/ already exists.
Directory test_data_collection/medium/ already exists.
Directory test_data_collection/MetaNetX/ already exists.
Directory test_data_collection/pan-core-models/ already exists.
Directory test_data_collection/template-models/ already exists.
Directory test_data_collection/universal-models/ already exists.


test_data_collection/MetaNetX/chem_prop.tsv: 645MiB [04:02, 2.79MiB/s] 
test_data_collection/MetaNetX/chem_xref.tsv: 552MiB [03:29, 2.77MiB/s] 
test_data_collection/MetaNetX/reac_prop.tsv: 8.69MiB [00:03, 2.71MiB/s]
test_data_collection/MetaNetX/reac_xref.tsv: 66.2MiB [00:25, 2.72MiB/s]


----
Next, the input for the remaining directories will be discussed.

## template-models/
As this workflow is based on a high-quality template, having template model is <span style="color:red">required</span> and not optional. Either copy a template model into this folder or remember to adjust the configuration file accordingly (...).

## annotated_genomes/
For the workflow, two annotated genomes, either `.gbff` (NCBI annotation pipeline) or `.faa` (PROKKA annotation pipeline, are <span style="color:red">required</span>. One from the same genome that was used to generate the template model and one for the genome that will be used to generate the new model.
> note: if the PROKKA annotation pipeline was used for the new genome, keep the `.fna` file as well, as the whole genome is later needed as well

## BioCyc/
- smart table of reactions from BioCyc/MetaCyc
- optional, used to check for directionality
- columns needed: Reactions, EC-Number, KEGG reaction, METANETX, Reaction-Direction
> the columns names are listed above, Reactions in that case means the reaction ID 

## medium/
- default medium database already part of the package (no input required, see HowTo about handling media)
- place to add / store additional (new) media
- for more information, take a look at the HowTo..... notebook

## pan-core-models/
- optional, needed for analysis and/or gapfilling
- see other notebook about building a pan-core model

## universal-models/
- optional, nedded for gapfilling if no 'big enough' pan-core model is available

## RefSeqs/
<span style="color:red">required</span> for the second *DIAMOND* run in the extension step of the pipeline

*DIAMOND* database
- first, download a set of FASTA files that should be part of the database, e.g. in **faa** format, and collect then in one folder
- use the function `create_DIAMOND_db_from_folder` to create a *DIAMOND* database file from the folder

In [None]:
specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory',
                                                 '/User/Path/for/output/', 
                                                 name = 'database',
                                                 extention = 'faa')

NCBI mapping (genome locus tags against old locus tags, EC number etc.)
- download the annotated genome files, ideally gbff, for the genomes used in the previous step
- run the function `create_NCBIinfo_mapping` on the folder with the files to create the mapping

In [None]:
specimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory',
                                           '/User/Path/for/output/', 
                                            extention = 'gbff')

NCBI information files of the genomes
- file is optional 
- has to be created manually
    - simple CSV file with the following columns: NCBI genome, organism, locus_tag (start) and KEGG.organism
    - the information of the first three columns can be taken from the previous two steps while for the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier

----
## Creating a configuration file

The final input needed for the pipeline is the configuration file, in which the parameters that will be used are stored.

There are two types of configuration files:

- basic: a smaller version that includes less parameters but is easier to complete
- advanced: complete version, gives full control of all parameters to the user
   
A new template for either version can be downloaded either via a python script e.g. a notebook by using

In [None]:
set_up.download_config(filename='./my_basic_config.yaml', type='basic')


or via the commandline tool:

- default: `specimen setup config `
- set type: `specimen setup config -t advanced`
- set name: `specimen setup config -f new_config.yaml`


After downloading the default configuration files, open it and adjust the parameters as needed.