Skip to content

DRAM 1.4.4+ -> Point Release

Compare
Choose a tag to compare
@rmFlynn rmFlynn released this 06 Jan 03:50
· 30 commits to master since this release
ac83ba7

This is the official release of DRAM1.4.4. The 1.4.0 release has significant changes that could impact your research. The 1.4.4 point release is less significant, but still important for dram-v and dram users. Please review these changes and help us validate this release!

Install / upgrade:

If DRAM is installed with Bioconda, and then it can be upgraded like any Conda package. Note that the conda package for dram may be delayed slightly while it is validated, but it should be available within a day or two of the release.

If you already have a DRAM environment and want to upgrade:

# Activate your old DRAM environment first!
# Save your old config
DRAM-setup.py export_config > my_old_config.txt
# install DRAM
wget https://raw.githubusercontent.com/shafferm/DRAM/master/environment.yaml
conda env update -f environment.yaml -n DRAM --prune
# import your old databases
DRAM-setup.py import_config --config_loc  my_old_config.txt

If you are using an old database, like in the example above, you may need to check out a special version of dram from GitHub.

git clone https://github.com/WrightonLabCSU/DRAM.git
cd DRAM
git checkout dbcan_no_ec
conda env update -f environment.yaml -n DRAM --prune
conda activate DRAM
conda install pip
pip install ./

To install the DRAM in a new Conda environment, follow the instructions in the README.

Change log DRAM1.4.4 addendum:

  1. Bug fixes have been made all to the setup script to support the many ways the DRAM databases get build, You will see them in the merge history.
  2. Previously, the DRAM-v AMG summary did not add match data for AMGs that were matched to the AMG Database only. This was confusing, and so now information relevant to the AMG Database is in the AMG summary along with the Metabolic Database. This adds the new columns "metabolism", "reference", and "verified", and the "gene_id_origin" field which tells you where this Gene ID came from. Remember that a sequence can match to more than one sequence and this is more common in the AMG Database, so your AMG Summary will be longer and contain more duplicates.
  3. DRAM1.4.X collects subfamily EC numbers for the raw annotations, but does not use them in the distillation process. We have future plans for these EC numbers, but in the meantime it makes it impossible to use older versions of the DRAM databases with the newer DRAM1.4.X. This is not ideal as we do strive for backwards compatibility, sadly the only solution at this time is to create a branch that does not look for the EC numbers. Use the instructions above or in the read me to install the dbcan_no_ec branch from git.
  4. Most output arguments are now required, with only a few exceptions. Most people will not notice this.

Change log DRAM1.4.0:

  1. DRAM distill now includes a new metabolism for methylation. Although planned for DRAM2 you can already include this tool in annotation and distillation provided you follow the instructions below.

    In order to distill with methyl, you need only download the new FASTA file and point to it with the dram custom database options that were introduced in DRAM1.3. Note that in order to distill correctly, you will need to use the correct name ‘methyl’ and must use DRAM 1.4.

    To Annotate with methyl, do something like:

    wget https://raw.githubusercontent.com/shafferm/DRAM/master/data/methylotrophy/methylotrophy.faa
    DRAM.py annotate -i '/some/path/*.fasta' -o dram_output --threads 30 --custom_db_name methyl --custom_fasta_loc methylotrophy.faa
    

    To Distill with methyl:

    wget https://raw.githubusercontent.com/shafferm/DRAM/master/data/methylotrophy/methylotrophy_distillate.tsv
    DRAM.py distill -i dram_output/annotations.tsv -o dram_output/distillate --custom_distillate methylotrophy_distillate.tsv
    

    Learn more about custom databases, in the Wiki.

  2. Glycoside hydrolase subfamily calls, subfamily calls are now being incorporated into annotations with changes in databases and code; this impacts what gets pulled into the distillate and product because these are looking for family level (e.g. AA1) not subfamily level (e.g. AA1_1, AA2_2).

    In response, DRAM is changing the output of the dbCAN database in DRAM1.4. Raw- cazyme subfamilies will be output into the cazy_id column, and the corresponding description for the cazyme family will be put into the cazy_hit column.

    The Distillation in DRAM1.4 will count cazymes marked at subfamily level on the family level; this means for cazyme family AA1 there will be 4 entries in the distillate AA1, AA1_1, AA1_2, and AA1_3 and the sum of these four will be the total number of AA1 cazymes. In DRAM1.3 and previous, the distillate for this example AA1 with no underscore would include cazymes that can be assigned to family AA1, but do not have a subfamily designation.

    The DRAM Product will also count cazymes at the family level. For the AA1 example, AA1_1, AA1_2, and AA1_3 will be counted as AA1 for the current rules in assigning cazymes to compounds.

  3. More changes are also being made that will affect CAZY IDs in DRAM1.4. The cutoff e-value is being changed to 1e-18 to conform to best practices for the database.

    DRAM1.4 also introduced a new column for best hit per gene from dbCAN database named cazy_best_hit. This column will be the match to the gene that has the highest coverage and lowest full-sequence e-value as calculated by mmseqs, with priority on e-value. Cazy_best_hit will be the only column considered downstream in the distillate and product. DRAM1.3 pulls and counts all dbCAN hits above e-value 1e-15, rather than profiling best hits.

    New column corresponding to EC number information from subfamilies, named cazy_subfamily_ec has been added in DRAM1.4. These EC numbers will also be used as part of the distillate along with those from kegg, as part of pathways and other tools. For now, incomplete EC numbers will be included, but not considered for the distillate. The subfamilies will be excluded from the product in order to facilitate its goals of being a larger overview.

  4. Logging is now fully implemented in DRAM1.4. Log files will be created for almost all DRAM functions. The log file for annotations will appear in the annotations' folder by default, and the log file for the dram distillation will by default be in the distillation folder. You can also use the --log_file_path argument to set the log path. A log file for database processing is set by the config file, and by default it will be in the databases' directory. All content that DRAM prints to the command line will appear in the log file .

  5. The dram config now stores when databases were downloaded, citation information and version information when applicable. This information is printed to the log at the beginning of each run. The old format can still be imported if you want to keep your DRAM1.3 databases.

  6. In 1.4 you can set a config file to use in dram annotation and distillation at run time in 2 ways. (1) use --config_loc with DRAM.py or DRAM-v.py or (2) set the environment variable DRAM_CONFIG_LOCATION. This will not store or import the config, and that config will only be used for that run.

  7. Significant Bug fixes are also included in this release.

  • When the input fastas contain duplicates in their header names, the dram annotate step should fail with an error immediately, not at the end of the annotation process, this will save some people a lot of time. It may be that this is only a problem for annotating genomes, in any case it must be in place across workflows.
  • Some users have firewalls on their HPC environments that prevent the download via ftp in some cases converting to http can solve download problems. In DRAM1.4 if ftp links fail, a back-up http link will be attempted before an error is thrown. See issue #206.
  • DRAM1.4 will ensure that if no databases are downloaded, DRAM setup will still work. Previously, some databases depend on data being downloaded and can't be set up with a provided data set.
  • Reduced unnecessary warnings in various repetitive tasks in DRAM distillation by refactoring pandas code.
  • BIO-RELATED This bug change could affect biology. In the past, the counting of EC numbers was inconsistent. When counting the number of EC numbers in a row of the annotations file duplicates were not counted, however if counting the EC numbers for the full set of data the count of EC numbers included such duplicates. This is now corrected, but it could have some small unexpected downstream effects.
  • Glycoside hydrolase subfamily calls.
  • In response to issue #122 You can now pass a config file at run time or by setting the environment variable DRAM_CONFIG_LOCATION. Read more in the Wiki.

Known issues:

  • Speed and memory remain a big problem for DRAM and the estimates in the wiki and other documentation are woefully out of date. Fixing this is a major priority.
  • The annotation merging tool lacks sufficient checks, and fails when files are missing.
  • Code coverage remains low, especially for the less prominent tools.