# Running EVcouplings pipeline jobs


------

## Content

This notebook demonstrates how to run EVcouplings jobs, and covers the following topics:

(1) __Configuration files__: how to setup and specify settings

(2) __Running jobs__: executing the config file (via the EVcouplings command line applications or directly via python)

(3) __Pipeline stages__: The stages of the pipeline, how to run only certain stages, and how to restart jobs from an arbitrary stage

------

## Configuration files

The config file drives all aspects of the pipeline and is required for every job. The config files are in computable YAML format (see below for an example of how to modify it programmatically).

> For an example configuration file, see config/sample_config.txt in this repository.

### Setup

Before you run any jobs, you must set up paths to external tools and databases, and specify your batch submission engine (if applicable). Modify the following sections of your config file:

1) in "tools" section:

Several external software tools must be installed in order to run EVcouplings. See the README for a list of tools and installation instructions. Once installed, absolute paths to the binary file for each tool must be specified.

2) in "databases" section:

Several external databases must be downloaded in order to run EVcouplings. See the README for instructions. Absolute paths to these databases must be specified once downloaded.

3) in "environment" section:

If using the evcouplings command line application, specify the batch submission engine and memory requirements for your jobs. 

### Essential parameters to modify:

The following parameters are needed to specify your job:

1) In “global” section:
* __prefix__: Unique prefix for your run. The prefix includes the output folder, and any string after the last “/” will be the prefix of filenames. Folders will be made automatically. For example, the prefix "output/run4/RASH_HUMAN" would create subfolders output/run4/ in current directory and then the filenames will start with RASH\_HUMAN. We suggest the prefix convention {UniprotID}\_{region_start}\-{region_end}, e.g. RASH_HUMAN_4-169.
* __sequence_id__ (Uniprot ID or AC): Identifier of your sequence. Will be automatically fetched from Uniprot.
* __region__: Region of the protein to run, in uniprot numbering. Leave blank for full sequence or put [start_pos, stop_pos]

### Additional parameters to modify (non-exhaustive):

The following additional parameters are the most important ones to consider modifying when running your job. See example configuration files for full parameter explanations.

1) In "alignment" section:
* __domain_threshold__ (bitscore or evalue): Inclusion threshold for sequence alignment. 0.5 is a good starting point for bitscore, e.g. use 0.3 for more inclusive and 0.7 or 0.8 for less inclusive alignment
* __sequence_threshold__ (bitscore or evalue): typically should be same as domain_threshold
* __use_bitscores__: default true, set to false to use evalues for domain_threshold and sequence_threshold
* __iterations__: number of jackhmmer iterations
* __database__: uniref, uniprot (see databases section for options or to modify path)
* __seqid_filter__: Filter alignment at x% sequence identity. Leave empty or set to 100 not to filter any sequences.
* __minimum_sequence_coverage__: Only keep sequences that align to at least x% of the target sequence (formerly called "-f" in buildali)
* __minimum_column_coverage__: Only include alignment columns with at least x% residues (rather than gaps) during model inference. (formerly called "-m" in buildali, but now the other way round: -m 30 corresponds to minimum_column_coverage=70)

2) in "couplings" section:
* __theta__: Clustering threshold for downweighting redudant sequences when inferring the model and computing the Meff. For example, 0.8 will cluster sequences at an 80% sequence identity cutoff (note that this is 1 - theta used in previous pipelines)
* __iterations__: how many plmc iterations to run
* __ignore_gaps__: exclude gaps from EC calculation
* __save_model__: save binary file with model parameters or not. These files can be very large.

3) in "compare" section:
* __by_alignment__: If True, structures for comparison will be identified by homology search; otherwise use only structures for given sequence_id (must be UniProt ID or AC)
* __pdb_alignment_method__: jackhmmer or hmmsearch. Method for searching the PDB sequence database for homologous structures. Jackhmmer is more stringent, hmmsearch is more lenient. 
* __pdb_ids__: Filters the list of PDB ids used for comparison. If pdb_ids is None, compares to all PDB structures for given sequence_id. If list of PDB IDs, compares to that subset of structures only.
* __compare_multimer__: Besides intra-chain contacts, also identify homomultimer contacts and use in evaluation.
* __distance_cutoff__: Maximum distance in angstroms for a residue pair to be considered as a contact
* __min_sequence_distance__: Only use pairs that are at least this distant in sequence for evaluation

4) in "mutate" section:


### Configuration rules

Configuration files are handled internally according to the following rules:

1) Global settings override settings of the same name for stages

2) Outputs of a stage are merged into "global" and fed into input of subsequent stages. This allows results to be passed on from stage to stage.

3) All settings are explicitly specified in the configuration. No hidden defaults in code.

4) Each stage is also passed the databases and tools sections


### Batch jobs

By specifying the "batch" section in the config file, one can easily generate multiple jobs:

    batch:
        _b0.75:
           align: {domain_threshold: 0.75, sequence_threshold: 0.75}
        _b0.3:
           align: {domain_threshold: 0.3, sequence_threshold: 0.3}

This example will create two jobs, which extend the global job prefix by _b0.75 and _b0.3, and replace the settings in  the "align" section with the given values. All other configuration settings will be constant for the job.

To execute batch jobs, the *evcouplings* application has to be used (see below for details, not possible using *evcouplings_runcfg*).

### Modifying the config file from within Python:

The configuration files used by EVcouplings are standard YAML files that directly translate to standard data structures such as lists and dictionaries. This means that configuration files can be easily loaded, modified programmatically (e.g. when running large amounts of jobs), and stored to file again.

Also, the output state after running a pipeline is stored in YAML files, which means the results can be easily accessed and passed on to other code.

In [1]:
from evcouplings.utils import read_config_file, write_config_file

config = read_config_file("../config/sample_config_monomer.txt", preserve_order=True)
config["global"]["prefix"] = "output/RASH_HUMAN"
config["global"]["sequence_id"] = "RASH_HUMAN"
config["align"]["domain_threshold"] = 0.5

write_config_file("test_config.txt", config)

------

## Running the pipeline

EVcouplings provides three ways of executing configuration files. The applications *evcouplings_runcfg* and *evcouplings* will be created automatically and put on your PATH when installing the Python package (i.e. they should be available from any directory):

### evcouplings_runcfg

Execute a single configuration file (this will ignore the "batch" section and runs a *single* thread of the pipeline):

```bash
evcouplings_runcfg <config_file>
```

### evcouplings

This is a wrapper around *evcouplings_runcfg* which provides the three major additional functions:
* for convenience, overwrite parameters in a config file using command line flags (e.g. to simply change proteins or specify a list of E-value thresholds)
* execute batch jobs (e.g. for scanning different evolutionary depths)
* submit to batch computing environments to parallelize jobs, and create summary statistics and visualizations of results

Running evcouplings requires to specify the "environment" section in the configuration file. Currently, we support local execution using multithreading, LSF, and Slurm. If you need another environment (e.g. SGE or Torque), please consider implementing it and submit a pull request!

For a list of the available command line arguments, please run

```bash
evcouplings --help
```

To run a config file, execute:

```bash
evcouplings [options] <config_file>
```

Example of overriding the settings in the config file to run different bitscore thresholds of a protein (this will create 6 independent jobs for each bitscore):

```bash
evcouplings -P output/RASH_HUMAN -p RASH_HUMAN -b "0.1, 0.2, 0.3, 0.4, 0.5, 0.6" sample_config.txt
```

Please note: currently, overriding settings using the command line is only supported for monomer jobs. Future releases will support command line arguments for complexes jobs. 

### Running pipeline from within Python

Configuration files can also be directly run from within Python:

In [None]:
from evcouplings.utils import read_config_file
from evcouplings.utils.pipeline import execute

config = read_config_file("test_config.txt")
outcfg = execute(**config)

------

# Pipeline Stages

The evcouplings pipeline is split into different stages for different aspects of computation. This allows users to run only the desired stages, and to re-start computation from an arbitrary stage. The stages are as follows:

__align__: Creates sequence alignments or reads in an existing alignment, and pre-processes alignment for couplings calculation.

__couplings__: Calculates evolutionary couplings, post-processes output into correct numbering, and fits statistical scoring model to couplings. 

__compare__: Searches for structures against which to compare the couplings, visualizes couplings in a contact map, calculates precision of couplings against structure. 

__mutate__: Generates and visualizes matrices of the predicted effects of mutations at every site. 

__fold__: Generates folded models using the evolutionary couplings as input, compares folded models to existing crystal structures (if applicable)


### Specifying stages to run

In the "stages" section of the configuration file, simply comment out the stages you wish to omit using a "#". Note that stages must be run in sequential order. For example, you cannot run "compare" without first running "alignment" and "couplings." However, you can run "alignment" and "couplings" without running "compare", "mutate", or "fold".

### Re-starting jobs at an arbitrary stage

The user can re-run certain stages of a job without re-running all of the stages by restarting at any stage for which all previous stages have been successfully run. This is done by using an indetical prefix and commenting out the stages to skip. __Be aware__ that this will overwrite the results of any stage that is being re-run, so this is risky. 


Example 1: The user has run "align", "couplings", and "compare", and now wants to run "fold" and "mutate" for their protein. Copy the configuration file and then comment out "align", "couplings", and "compare" under "stages", while leaving "fold" and "mutate" uncommented. Then submit the new configuration file. The evcouplings application will validate that all outputs of the previous stages exist (based on the user-defined prefix) before submitting the new job.

Example 2: The user has run "align", "couplings", and "compare", and wishes to re-run "compare" with different parameters. Copy her configuration file and then comment out "align" and "couplings" under "stages". Then modify parameters in the "compare" stage and re-submit the job using the newly modified config file. This is somewhat risky, because it will overwrite previous results for the "compare" stage. 

By default the evcouplings app will not allow the user to submit jobs that overwrite previous results. Use the "yolo" flag to live dangerously and force an override.  

```
evcouplings --yolo [options] <config_file>
```