v3.5.1
chewBBACA no longer checks if input files have unique basename prefixes shorter than 30 characters. In the past, this was performed to ensure that sequence identifiers did not exceed the character limit (50 characters) enforced by BLAST when creating a database. The main changes to file name processing are the following:
- chewBBACA uses the file basename without the file extension as unique identifier (e.g.
GCF_008632635.1.fastais converted toGCF_008632635.1), instead of trying to determine the shortest unique prefix that can be used to identify each input file. It is still necessary for each file to have a unique identifier after the removal of the file extension (e.g.GCF_008632635.1.fastaandGCF_008632635.1.fnahave different file extensions but the same identifier after removing the file extension, which is not allowed). - The CreateSchema module uses the input file basenames without the file extension to define the identifiers for the loci in the created schemas (e.g. loci initially identified in the genomes
GCF_008632635.1.fastaandGCA_000006785.2_ASM678v2.fastaare named asGCF_008632635.1-proteinN.fastaandGCA_000006785.2_ASM678v2-proteinN.fasta, respectively). We still recommend using short and unique file names without special characters (e.g.:!@#?$^*()+) for conciseness and to avoid potential issues. - The AlleleCall module accepts and uses the new loci identifier format used by the CreateSchema module. The input genome or CDS files can also have basenames of any length as long as the basename without the file extension for each input file is unique. The output files created by the AlleleCall module use the full unique basenames (e.g. for the genome
GCA_000006785.2_ASM678v2.fasta, the genome identifier used in the output files will beGCA_000006785.2_ASM678v2, instead ofGCA_000006785used up until chewBBACA v3.5.0). - The PrepExternalSchema module accepts schemas containing loci FASTA files with basenames longer than 30 characters.
Additionally, the CDS identifiers are converted to a different format (lcl|SEQ1, lcl|SEQ2...lcl|SEQN) before creating a BLAST database with makeblastdb and the -parse_seqids option to avoid issues related to some sequence identifiers being interpretd and modified (e.g. interpretd as PDB Chain IDs) when creating a database, resulting in errors when an identifier is modified and no longer matches the original identifier. This allowed to remove the check to verify that unique prefixes are not modified by BLAST during database creation.
Additional changes:
- Added the
--output-maskedoption to the AlleleCall module to create a TSV file with the masked profiles (INF-prefixes are removed and the NIPH, NIPHEM, ASM, ALM, PLOT3, PLOT5, LOTSC, and PAMA classes are converted to0).