This repository has been deprecated and archived. It remains available for reproducibility purposes only.
For the latest versions of CZ ID data processing workflows, please see https://github.com/chanzuckerberg/czid-workflows.
This is a CLI that allows you to execute the different data processing stages required in CZ ID.
To install after cloning:
$ pip install -e .
- When merging a commit to master, you need to increase the version number in czid_pipeline/version/__init__.py:
- if results are expected to change, increase the 2nd number
- if results are not expected to change, increase the 3rd number.
- 1.8.7
- Bug fix for count_reads and non-host read counts.
- 1.8.4 ... 1.8.6
- Minor code quality, documentation, and logging improvements.
- 1.8.0 ... 1.8.3
- Upload a status file that indicates when a job has completed.
- Add a dedicated semaphore for S3 uploads.
- Code quality and documentation improvements.
- Restore capability to run non-host alignment from the development environment.
- Try a more relaxed LZW fraction if the initial filter leaves 0 reads
- 1.7.2 ... 1.7.5
- General code style changes and code cleanup.
- Convert string exceptions and generic exceptions to RuntimeErrors.
- Change some print statements for python3.
- Add more documentation.
- 1.7.1
- Truncate enormous inputs to 75 mil paired end / 150 mil unpaired reads.
- Support input fasta with pre-filtered host, e.g. project NID.
- Many operational improvements.
- 1.7.0
- Add capability to further filter out host reads by filtering all the hits from gsnapping host genomes. (i.e. gsnap hg38/patron5 for humans).
- 1.6.3 ... 1.6.1
- Handle bogus 0-length alignments output by gsnap without crashing.
- Fix crash for reruns which reuse compatible results from a previous run.
- Fix crash for samples with unpaired reads.
- Improve hit calling performance.
- 1.6.0
- Fix fasta downloads broken by release 1.5.0, making sure only hits at the correct level are output in the deduped m8.
- Fix fasta download for samples with unpaired reads by eliminating merged fasta for those samples.
- Extend the partial fix in release 1.5.1 to repair more of the broken reports. Full fix requires rerun with updated webapp.
- Correctly aggregate counts for species with unclassified genera, such as e.g. genus-less species 1768803 from family 80864.
- Fix total count in samples with unpaired reads (no longer doubled).
- Fix crash when zero reads remain after host filtering.
- Fix bug in enforcing command timeouts that could lead to hangs.
- Fix performance regression in stage 2 (non-host alignment) introduced with 1.5.0.
- Deduplicate and simplify much of stage 2, and improve performance by parallelizing uploads and downloads.
- 1.5.1
- Fix bug introduced in 1.5.0 breaking samples with non-species-specific deuterostome hits.
- 1.5.0
- Identify hits that match multiple species within the same genus as "non species specific" hits to the genus.
- 1.4.0
- Version result folder.
- 1.3.0
- Fix bug causing alignment to run before host subtraction in samples with unpaired reads.
- Include ERCC gene counts from STAR.
- 1.2.0
- Synchronize pair order after STAR to improve sensitivity in 10% of samples with paired-end reads.