# Running EVcouplings pipeline jobs

## Content

This notebook demonstrates how to run EVcouplings jobs by specifying the job settings in a configuration file and then executing it using the EVcouplings command line applications.

## Configuration files

Config file drives all aspects of the pipeline (which query protein, and which parameters like bitscore thresholds, databases etc.). The config files are in computable YAML format (see below for example how to modify it programmatically).

> For an example configuration file, see config/sample_config.txt in this repository.

### Parameters that might need to be modified (non-exhaustive):

The following parameters are the most important ones to consider modifying when running your job:

1) In “global” section:
* prefix (includes output folder, anything after last “/” will be prefix of filenames, folders are made automatically), e.g. output/run4/RASH_HUMAN would create subfolders output/run4/ in current directory and then filenames would start with RASH_HUMAN (name can be arbitrary). Good convention e.g. RASH_HUMAN_4-169
* sequence_id (Uniprot ID or AC): sequence will be automatically fetched from Uniprot
* region: leave blank for full sequence or put [start_pos, stop_pos]

2) In "alignment" section:
* domain_threshold (bitscore or evalue). 0.5 is a good starting point for bitscore, e.g. use 0.3 for more inclusive and 0.7 or 0.8 for less inclusive alignment
* sequence_threshold (bitscore or evalue), typically should be same as domain_threshold
* iterations: number of jackhmmer iterations
* database: uniref, uniprot (see databases section for options or to modify path)
* seqid_filter: Filter alignment at x% sequence identity. Leave empty or set to 100 not to filter any sequences.
* minimum_sequence_coverage (formerly called "-f" in buildali)
* minimum_column_coverage (formerly called "-m" in buildali, but now the other way round: -m 30 corresponds to minimum_column_coverage=70)

3) in "couplings" section:
* theta (note that this is 1 - theta is used so far in pipelines, e.g. this pipeline uses 0.8 rather than 0.2 to cluster at 80% sequence identity)
* iterations: how many plmc iterations to run
* ignore_gaps: exclude gaps from EC calculation
* save_model: save binary file with model parameters or not

4) in "compare" section:

* by_alignment: If True, structures for comparison will be identified by homology search; otherwise use only structures for given sequence_id (must be UniProt ID or AC)
* pdb_ids: Used if by_alignment is False. If pdb_ids is None, compares to all PDB structures for given sequence_id. If list of PDB IDs, compare to that subset of structures only.
* compare_multimer: Besides intra-chain contacts, also identify homomultimer contacts and use in evaluation.
* distance_cutoff: Maximum distance for a residue pair to be considered as a contact
* min_sequence_distance: Only use pairs that are at least this distant in sequence for evaluation

### Configuration rules

Configuration files are handled internally according to the following rules:

1) Global settings override settings of the same name for stages

2) Outputs of a stage are merged into "global" and fed into input of subsequent stages. This allows results to be passed on from stage to stage.

3) All settings are explicitly specified in the configuration. No hidden defaults in code.

4) Each stage is also passed the databases and tools sections


### Batch jobs

By specifying the "batch" section in the config file, one can easily generate multiple jobs:

    batch:
        _b0.75:
           align: {domain_threshold: 0.75, sequence_threshold: 0.75}
        _b0.3:
           align: {domain_threshold: 0.3, sequence_threshold: 0.3}

This example will create two jobs, which extend the global job prefix by _b0.75 and _b0.3, and replace the settings in  the "align" section with the given values. All other configuration settings will be constant for the job.

To execute batch jobs, the *evcouplings* application has to be used (see below for details, not possible using *evcouplings_runcfg*).

### Modifying the config file from within Python:

The configuration files used by EVcouplings are standard YAML files that directly translate to standard data structures such as lists and dictionaries. This means that configuration files can be easily loaded, modified programmatically (e.g. when running large amounts of jobs), and stored to file again.

Also, the output state after running a pipeline is stored in YAML files, which means the results can be easily accessed and passed on to other code.

In [None]:
from evcouplings.utils import read_config_file, write_config_file

config = read_config_file("../config/sample_config.txt", preserve_order=True)
config["global"]["prefix"] = "output/RASH_HUMAN"
config["global"]["sequence_id"] = "RASH_HUMAN"
config["align"]["domain_threshold"] = 0.5

write_config_file("test_config.txt", config)

## Running the pipeline

EVcouplings provides three ways of executing configuration files. The applications *evcouplings_runcfg* and *evcouplings* will be created automatically and put on your PATH when installing the Python package (i.e. they should be available from any directory):

### evcouplings_runcfg

Execute a single configuration file (this will ignore the "batch" section and runs a *single* thread of the pipeline):

```bash
evcouplings_runcfg <config_file>
```

If running in a batch computing environment, the user is responsible for submitting jobs etc.

### evcouplings

This is a wrapper around *evcouplings_runcfg* which provides the three major additional functions:
* for convenience, overwrite parameters in a config file using command line flags (e.g. to simply change proteins or specify a list of E-value thresholds)
* execute batch jobs (e.g. for scanning different evolutionary depths)
* submit to batch computing environments to parallelize jobs, and create summary statistics and visualizations of results

Running evcouplings requires to specify the "environment" section in the configuration file. Currently, we only support local execution using multithreading, and LSF. If you need another environment (e.g. SGE, Torque or Slurm), please consider implementing it and submit a pull request!

For a list of the available command line arguments, please run

```bash
evcouplings --help
```

To run a config file, execute:

```bash
evcouplings_runcfg [options] <config_file>
```

Example of overriding the settings in the config file to run different bitscore thresholds of a protein (this will create 6 independent jobs for each bitscore):

```bash
evcouplings_runcfg -P output/RASH_HUMAN -p RASH_HUMAN -b "0.1, 0.2, 0.3, 0.4, 0.5, 0.6" sample_config.txt
```

### Running pipeline from within Python

Configuration files can also be directly run from within Python:

In [None]:
from evcouplings.utils import read_config_file
from evcouplings.utils.pipeline import execute

config = read_config_file("test_config.txt")
outcfg = execute(**config)