Skip to content
SJShaw edited this page Feb 13, 2023 · 7 revisions

antiSMASH 5+

Layout concept

General layout

The antismash package has several main sub-packages:

  • detection: for modules that predict clusters, find domains such as PFAMs, NRPS domains, and so on
  • modules: for modules that with specific predictions or searches, such as finding clusters or classifying clusters
  • common: for code which can be reused by multiple modules, this covers areas such as running external programs (e.g. common.subprocessing.run_blastp()), interacting with FASTA formats (e.g. common.fasta.read_fasta()), and common module components (e.g. common.module_results.ModuleResults)
  • config: contains command line argument management code, along with managing the global antiSMASH run configuration
  • outputs: modules that generate output from the results, for example the HTML output

There is also a databases subdirectory which is the default search location that antiSMASH searches for large databases (e.g. the PFAM database). It is not intended to be populated in this repository due to the size of the databases.

Module layout

Apart from __init__.py, data, test, and external, all other names below can (and should) be changed to an informative name.

modulename/
├── __init__.py
├── modulename.py
├── othername.py
├── data/
│   ├── known_proteins.fasta
│   └── scripts/
│       └── rebuild.py
├── external/      # contains a subdir with each included external program/lib/etc, if relevant
└── test/
    ├── data/
    │   ├── diamond_output_sample.txt
    │   └── test_sequence.fasta
    ├── __init__.py
    ├── integration_modulename.py
    ├── test_modulename.py
    └── test_othername.py

Each module should contain an __init__.py, intended to contain the functions that every module is expected to have (e.g. get_args(), check_prereqs(), run_on_record()). See here for an analysis module template. Multiple files should be used to split up code into relevant chunks. These files should be named based on the code they contain. Reference data used by the module should be in data/. External libraries, scripts or programs shipped with the module should be stored in external instead of at the top level of the module. Tests for the module should be located in a test directory, with any test data being in test/data.

Tests should be split into three sections:

  • Unit tests (file prefix test_): These test small chunks of code. They should not take more than a few seconds and they should not invoke any external commands (though mocking is fine).
  • Integration tests (file prefix integration_): These tests can take longer and invoke external commands. By this point the module itself should be thoroughly tested.

Testing is performed with with pytest, as such all tests should begin with test_, whether they are standalone functions within a test file or methods of a unittest.TestCase.

While not part of a specific module, end-to-end tests should exist. These tests are for the whole of antiSMASH, including runs of multiple modules at once and checking the outputs. A module should also have its output tested.

Contribution requirements

If you're new to git and/or GitHub, see GitHub's Getting Started guide and pull request guide.

In order to keep things reliable and to simplify future work, all contributions should include the following:

  • A high-level description of the goal of the module and how the goal is achieved
  • Tests that cover the meaningful sections of the contribution
  • Code that is easily readable and understandable.
  • Only the code required: no dead code, no duplicated functions from antiSMASH's core or from other modules
  • An accurately described method of regenerating any provided datasets, including where to acquire any data needed. This is to help when base datasets are updated or any issues are found.

NOTE: It is vital that all external dependencies used have a compatible license. Any that apply restrictions to the output generated by a dependency will cause issues with integrating the contribution.

If a module has sections which could be used independently of each other, they should be written in a way that they are independent. This helps for testing and if problems are detected in specific sections of a module, as we can enable or disable only the specific areas of interest.

If reference data is part of the contribution, the following should also be included:

  • A description of the data source and data format
  • A method of regenerating any reference data used in case bugs are found or updates are required

Further things to avoid:

  • Reloading static data continually (due to being inefficient)
  • Using strings as data structures (inefficient and error-prone)
  • Arcane variable and function identifiers (The time saved writing for v in w: for l in v: ... is negligible compared to the time it takes to fix or update that code).
  • Huge functions, if it's over 50 lines, start looking to see if there's sections you can reasonably carve out into a separate function
  • Scripts that can only be run on the command line
  • Circular dependencies with imports, if the sections are that tightly bound there will be problems later on