- Subclones are now properly assigned.
- The order of exported sequences is now consistent.
- Updates for consistency with Python 3 test cases.
- ImmuneDB can now process T-cell sequences.
- Clonal assignment now includes an optional "subclone" process for locally-aligned sequences. Subclones are clones which share features of their parent, but contain insertions or deletions. See the documentation for more information.
- External clonal assignments can now be imported with
immunedb_clone_import
. - Optional rollbar support has been added to
immunedb_rest
to track errors. - Logging has been overhauled and is more consistent with best-practices.
- The package has been renamed from AIRRDB to ImmuneDB.
- Tests now check for the presence of a local-alignment binary.
- SciPy has been removed as a dependency and a custom hypergeom function has been included.
- The package has been renamed from SLDB to AIRRDB.
- J-gene offsets are now set to human values by default.
- Local alignment has been updated and should properly work for most sequences.
- Selection pressure can now be calculated for mutations happening exactly a specified number of times.
- Clonal overlap calculations are now faster.
- A
sldb_sql
command has been added to ease direct interface with MySQL. - API call for clonal overlap now properly pages.
- Improved error handling in lineage construction.
- Docker compose is now used to separate the different AIRRDB components.
- J-genes are now properly assigned.
- Alleles are no longer annotated.
- Sequences can be optionally trimmed during identification or importing.
- Sequences with stop codons can optionally be excluded from lineages.
- Sequences are now properly assigned to clones regardless of CDR3 length.
- Sequences with ambiguous bases in their J-genes are now properly identified.
- V- and J-gene tie code has been consolidated.
- Total clone copy number is now properly calculated for statistics.
- Sequences with ambiguous CDR3s are now properly added to clones.
- V-ties for locally-aligned sequences are now properly annotated.
- Mutation rate for each sample is now stored in the underlying database.
- Lineage node copy numbers are now correct for collapsed sequences
- Clone overlap queries are now much faster.
- Local alignment now uses external libraries.
- Insertions and deletions are now included in sequence records
- A Dockerfile is now available for AIRRDB.
- Exporting clones by sample now works properly.
- Memory usage and run-time for local-alignment has been reduced.
- Documentation has been cleaned up.
- Selection pressure can now be calculated at any level.
- The API has been simplified and re-organized.
- URLs for API calls now use run length encoding to specify which samples to analyze. This fixes issues when many samples are selected and cause the URL to be too long.
- Grouped quality scores are now properly calculated.
- Rarefaction calculations have been removed.
- Clone mutations can now have arbitrary thresholds.
- The clone overlap query has been optimized and now properly filters functional and non-functional clones.
sldb_admin
has been added to simplify creating, deleting, backing up, and restoring SLDB instances.sldb_local_align
has been added for locally aligning sequences marked as having insertions or deletions.sldb_clone_selection_pressure
has been added to calculate clonal selection pressure.sldb_clone_stats
now only calculates mutations and overlap, but much more quickly.- Duplicate sequences, regardless of ambiguous characters, are automatically collapsed during identification.
- API call
get_stats
now takes apercentages
parameter which will return statistics as percentages. - SLDB no longer uses two databases and now only requires one configuration file.
- Identification speed has been increased..
- Samples can now be annotated with an
ig_class
specifying the isotype of the sample (e.g. IgA, IgE). - Sequences instances are now counted at the subject level.
- Exporting clone overlap now includes selected and all samples.
- Identification will no longer fail for samples with zero identifiable reads.
- It is no longer possible to have multiple input files for one sample. Additionally, identification will not allow sequences to be added to existing samples.
- Hypergeometric probabilities for V-ties are now cached, greatly improving identification performance.
- A `--trim INT`` parameter has been added to identification allowing reads to be trimmed prior to identification.
- TokuDB has been dropped in favor of InnoDB for the purpose of easier installation.
- Identification tests have been re-written.
- Sequences that cannot be inserted due to a field-length restriction are added
as
NoResult
s whenever possible. - Clonal assignment now includes partial reads by default.
- Collapsing of sequences now occurs within V, J, CDR3 length buckets for efficiency.
- Identification now looks for
D.....C
in sequences if all other anchors fail. - Sample-level duplicate sequences now have the correct clone ID after.
- V-match percentage is now correct for partial sequences.
- Quality strings are now properly oriented for reverse-complement sequences.
- Trees will no longer have zero-mutation roots.
- Clones can now be created with an specifiable minimum-copy number.
- Sequence exports can now optionally only include sequences assigned to clones.
- Total sequence counts in sample statistics now work with sample-level collapsing.
- Tree creation will now emit a warning when mutation information is unavailable.
- Multiprocess workers now emit warning when uncaught errors occur.
- VDJ alignment now uses exceptions to indicate alignment failures.
- String-fields in models are now verified to be of correct length or a
ValueEror
is thrown. - CDR3s are now limited to the lesser of 32 amino acids or 96 nucleotides.
- Models now consistently use
cdr3
instead ofjunction
for the CDR3 region. - Identification has been refactored to be cleaner and more efficient.
- Regression testing has been added in the
tests
directory. - The
CloneGroup
model has been removed. - Exporting has been refactored.
- J gene germlines are now specified by a FASTA file than hard-coded sequences.
- Clone lineages can now be created only from mutations that occur in a given number of samples.
- Various performance enhancements to clone statistics.
- Selection pressure is pre-calculated for all mutations as well as those which occur at least twice.
- Removed clone collapse level since it will never result in further collapsing past the subject level.
- Delimited importing update to match new models.
- Sequences with various capitalization is now normalized.
- V identification no longer looks for hard-coded anchors.
- Versioning will now follow the Semantic Versioning Standard.
- Major Feature: Sequences must now be collapsed at three different levels: the sample, subject, and clone. This collapsing is to take into account Ns added from quality filtering. Sequences that are identical except for Ns are considered the same and will be collapsed into the highest copy-number sequence. Equality checking ignoring Ns is written in C for efficiency.
- Feature: Fully aligned sequences with V and J assignments can now be imported from CSV files.
- Feature: Phred quality scores can now be analyzed from FASTQ files.
- Feature: Rarefaction for samples can now be calculated with the
rarefaction
API call. - Feature: V-gene diversity for samples can now be calculate with the
diversity
API call. - Feature: Sequences can now be discarded based on number of V-ties and a minimum identity-to-germline threshold during identification.
- Enhancement: When checking if a sequence has a similar CDR3 for clonal assignment, only unique CDR3 amino-acid sequences are checked (#24).
- Enhancement: Clone overlap in a sample-context can now be exported as a CSV
via the
clone_overlap
API call. - Bug fix: Workers will no longer prematurely terminate due to blocking on the task queue.
- Bug fix: Grouping of sample statistics no longer inflates distribution values.
- Bug fix: Copy numbers for duplicate sequences during
sldb_identify
are now correct.
- Baseline has now been integrated to calculate clonal selection pressure during clone statistic calculations.
- Clone comparison now only allows one clone to be selected.
- Modification log messages now added at each pipeline stage.
- Mutations are now precalculated for all sequences and clones.
- Mutations can now be exported for both clones and samples.
- All pipeline stages now use the multiprocessing module to parallelize processing.
- Mutations can now be filtered by occurrence frequency via the REST API.
sldb_sample_stats
now accepts the--clones-only
flag which, when set, will cause sample statistics only to be generated for clone filters. Useful for updates to clonal assignment methods.- Fixed a bug where exporting sequences did not return the CDR3 NTs, AAs, or length.
- V-gene names are now lexicographically sorted when requested via the REST API.
sldb_clones
now accepts the--order
flag which will sort sequences by copy number for clonal assignment.
- Clone stats are now properly updated when
--force
flag is passed. - Indels are now flagged for percentage mismatch in addition to windowed mutations.
- The
get_stats
API call now allows for stats to be grouped by any attribute defined in theSample
model. sldb_sample_stats
now takes a--clones-only
flag to only regenerate clone statistics for samples.- CDR3 AA and CDR3 length are now properly exported.
- Lineage trees generated by neighbor joining can no longer have a zero-mutation node as the root.
- The
v_usage
API call now provides a list of groupings for the selected samples and sorts the returned V-genes.
- All duplicate sequences are now properly assigned an entry in the
Sequences
table, removing cycles fromDuplicateSequences
. - Clones can now have a different V gene assigned or gaps added manually via the
sldb_modify_clone
command.- Modifications via
sldb_modify_clone
are recorded and can be fetched with themodification_log
API call. Other manual modification should make use of theModificationLog
model.
- Modifications via
- Neighbor joining now properly calculates copy number.
sldb_clone_stats
can now be limited by clone ID.
- Rows summing total/unique sequences cross all samples can be included in clone exports.
- Neighbor joining now added as a method of lineage tree creation.
- Fixed incorrect unique sequence count in clone comparison.
- V gene usage API tweak to allow for exporting via website.
- Mutation frequency is now calculated for various thresholds.
- HighV-Quest output can now be imported with the
sldb_hvquest
binary. - Optionally, V-ties can be calculated in addition to the Vs specified
- Sequences with probable indels or misalignments are excluded from clonal
assignment by default. Override this behavior in
sldb_clones
with--include-indels
. - Major data-model changes:
- Consolidated
SequenceMapping
toSequence
model to reduce joining. SampleStatistics
changed to match updatedSequence
model.- Added
CloneStats
model to reduce API query time. - Added binary
sldb_clone_stats
to populateCloneStats
models. - Renamed
aggregation/stats.py
toaggregation/sample_stats.py
andsldb_stats
binary tosldb_sample_stats
for new clone statistics scripts. - Can now export both sequences and clones. Re-architected exporting classes.
- Fixed case where duplicate sequences were incorrectly inserted.
- First stable release
- Duplicate sequences are now detected during identification and not re-identified.
- Metadata fallthrough to "all" block properly during identification.
- pRESTO references removed in favor of "R1+R2"
- Insertion/deletion check based on sliding window.
- V and J ties now calculated on a per-sample basis.
- Major changes and fixes to V/J identification:
- V gene alleles can now be identified and must be separated with an asterisk (e.g. IGHV4-34*01).
- Anchors are now found using reversed frame-shifting if forward
- frame-shifting yields a no-result.
- V and J genes now match into the CDR3 based on sliding window.
- Germlines can now be specified during identification. Note each germline name must refer to a unique sequence.