# Running EVcouplings complex pipeline jobs

## Content

This notebook demonstrates how to run EVcouplings for heteromultimeric complex predictions by specifying the job settings in a configuration file and then executing it using the EVcouplings command line applications.

## Configuration files

Config file drives all aspects of the pipeline (which query proteins, and which parameters like bitscore thresholds, databases etc.). The config files are in computable YAML format (see below for example how to modify it programmatically).

> For an example configuration file, see config/sample_config_complexes.txt in this repository.

### Parameters that might need to be modified (non-exhaustive):

The following parameters are the most important ones to consider modifying when running your job:

1) In “global” section:
* prefix (includes output folder, anything after last “/” will be prefix of filenames, folders are made automatically), e.g. output/run4/PhoP_PhoQ would create subfolders output/run4/ in current directory and then filenames would start with PhoP_PhoQ.

2) In "align_1" and "align_2" section:
* alignment_protocol: choose either 'existing' to use an input alignment or 'standard' to generate an alignment
using the monomer alignment protocol (see notebooks/running_jobs_monomers.ipynb for more explanation)
* input_alignment: input alignment file, requried for 'existing' alignment protocol
* override_annotation_file: input annotation file, suggested for 'existing' alignment protocol. This will override the annotations generated when postprocessing the input_alignment, which may be incomplete depending on the alignment format
* sequence_id: uniprot identifier of sequence to run
* region: region of sequence to run, leave blank for full sequence
* domain_threshold (bitscore or evalue): domain inclusion threshold for alignment
* sequence_threshold: typically should be the same as domain_threshold

3) in "concatenate" section:
* protocol: currently two procols are available, will either pair sequences by closest reciprocal distance on the genome (genome_distance) or by best hit to the target sequence for each genome (best_hit)
* minimum_sequence_coverage (formerly called "-f" in buildali)
* minimum_column_coverage (formerly called "-m" in buildali, but now the other way round: -m 30 corresponds to minimum_column_coverage=70)

4) in "couplings" section:
* theta (note that this is 1 - theta is used so far in pipelines, e.g. this pipeline uses 0.8 rather than 0.2 to cluster at 80% sequence identity)
* iterations: how many plmc iterations to run
* ignore_gaps: exclude gaps from EC calculation
* save_model: save binary file with model parameters or not
* scoring: Scoring model to assess confidence in computed ECs
* use_all_ecs_for_scoring: if True, will run the scoring model on the ECs, both inter and intra, simulataneously

5) in "compare" section:

Note: arguments with the prefix "first" apply to the first monomer, argument with prefix "second" apply to second monomer. The general uses are explained here, but need to be set independently for each monomer
* by_alignment: If True, structures for comparison will be identified by homology search; otherwise use only structures for given sequence_id (must be UniProt ID or AC)
* pdb_ids: Used if by_alignment is False. If pdb_ids is None, compares to all PDB structures for given sequence_id. If list of PDB IDs, compare to that subset of structures only.
* compare_multimer: Besides intra-chain contacts, also identify homomultimer contacts and use in evaluation.
* distance_cutoff: Maximum distance for a residue pair to be considered as a contact
* min_sequence_distance: Only use pairs that are at least this distant in sequence for evaluation

### Configuration rules

Configuration files are handled internally according to the following rules:

1) Global settings override settings of the same name for stages

2) Outputs of a stage are merged into "global" and fed into input of subsequent stages. This allows results to be passed on from stage to stage.

3) All settings are explicitly specified in the configuration. No hidden defaults in code.

4) Each stage is also passed the databases and tools sections


### Batch jobs

Currently not available for complexes pipeline

### Modifying the config file from within Python:

The configuration files used by EVcouplings are standard YAML files that directly translate to standard data structures such as lists and dictionaries. This means that configuration files can be easily loaded, modified programmatically (e.g. when running large amounts of jobs), and stored to file again.

Also, the output state after running a pipeline is stored in YAML files, which means the results can be easily accessed and passed on to other code.

In [1]:
from evcouplings.utils import read_config_file, write_config_file

config = read_config_file("../config/sample_config_complex.txt", preserve_order=True)
config["global"]["prefix"] = "output/complex_test"
config["align_1"]["sequence_id"] = "RASH_HUMAN"
config["align_1"]["domain_threshold"] = 0.5

write_config_file("test_config.txt", config)

## Running the pipeline

EVcouplings provides three ways of executing configuration files. The applications *evcouplings_runcfg* and *evcouplings* will be created automatically and put on your PATH when installing the Python package (i.e. they should be available from any directory):

### evcouplings_runcfg

Execute a single configuration file (this will ignore the "batch" section and runs a *single* thread of the pipeline):

```bash
evcouplings_runcfg <config_file>
```

If running in a batch computing environment, the user is responsible for submitting jobs etc.

### evcouplings

This is a wrapper around *evcouplings_runcfg* which provides the three major additional functions:
* for convenience, overwrite parameters in a config file using command line flags (e.g. to simply change proteins or specify a list of E-value thresholds)
* execute batch jobs (e.g. for scanning different evolutionary depths)

Running evcouplings requires to specify the "environment" section in the configuration file. Currently, we only support local execution using multithreading, and LSF. If you need another environment (e.g. SGE, Torque or Slurm), please consider implementing it and submit a pull request!

For a list of the available command line arguments, please run

```bash
evcouplings --help
```

To run a config file, execute:

```bash
evcouplings [options] <config_file>
```

### Running pipeline from within Python

Configuration files can also be directly run from within Python:

In [2]:
from evcouplings.utils import read_config_file
from evcouplings.utils.pipeline import execute

config = read_config_file("test_config.txt")
# outcfg = execute(**config)

ResourceError: Input file does not exist or is empty:
/groups/marks/databases/jackhmmer/uniref100/uniref100_2017_04.fasta