07 May 19:17

alecw

93a0242

Shell script standardization and other minor changes Latest

Latest

Changes

Standardize shell scripts, including alignment, reference metadata creation, and tool wrapper scripts. Thanks to @mschilli87.
Add ability to request multiple cores in alignment script.
Remove dependent jars and instead get from Maven Central. Thanks to @lbergelson.
In various programs, filter out reads with UMI == -, which is how STARsolo marks chimeric reads.
Change the dependency between our R packages from Imports to Suggests.

Installation

Java

Download and unzip dropseq-3.0.1.zip . Wrapper scripts for all the command-line programs will be in the dropseq-3.0.1 directory that is created when unzipping.

R packages

Do this in the order below, because DropSeq.dropulation depends on DropSeq.utilities, and if you don't install that package first, the source version of DropSeq.utilities will be installed, which may be unstable.

install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.utilities_3.0.1.tar.gz") 
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.dropulation_3.0.1.tar.gz")

Contributors

lbergelson and mschilli87

Assets 5

14 Mar 13:22

alecw

v3.0.0

6a9ae19

Java 21, R packages for Dropulation and Census-seq

New and changed

Java 21 and Gradle

This is the first release built with Java 21. You will need to use a JVM version >= 21.

R packages for Dropulation and Census-seq

Our Java programs have long been available for donor assignment and Census-Seq, but we've typically analyzed their outputs using R for standard QC plots and analysis. This release marks the first time we're making these tools public, offering the community a standard set of downstream analyses for this data. This ensures proper analysis of outputs and provides users with plots to help diagnose technical or biological issues with donor pools.

We will shortly release a new "cookbook" detailing how to run donor assignment on both scRNASeq and scATACSeq data, with more in depth explanations of the program outputs and explanations of how to interpret each QC plot. Please look for that on this github's front page next to the other cookbooks.

New program FilterReadsByUMISupport

For some analysis, it may be useful to only evaluate UMIs that have at least some minimum or maximum number of reads supporting them. This tool counts the number of UMIs supporting each read, and emits a BAM file that only contains reads within the bounds of MIN_READ_SUPPORT and MAX_READ_SUPPORT.

GatherDigitalAlleleCounts changes

Tracking of alternate alleles improved. Multiallelic sites now report the alternate allele with the highest allele frequency instead of randomly selecting one of the alternate alleles. When POLYMORPHIC_SNPS_ONLY=false, in rare cases GDAC lost track of the alternate allele and emitted an N instead with 0 counts. This was fixed to emit the proper alternate allele and counts data.

DigitalExpression changes - smells like STARSolo

We've conducted a recent assessment of how Optimus/STARSolo expression measurements stack up against DropSeq's techniques. STARSolo outperformed DropSeq in retrieving UMIs from reads that align to multiple genomic locations but only support a single gene. To replicate this capability, simply set READ_MQ=0, allowing for the recovery of these UMIs in a similar fashion.

STARSolo uses different functional annotation strategy - that is, the way annotations of which regions of the genome are exonic, intronic, and antisense are interpreted and prioritized. The STARSOLO strategy priority is very similar to DropSeq, except in cases where a read overlaps both an intron on the sense strand and a coding region on the antisense strand. In these cases, DropSeq favors the intronic interpretation, while STARSolo interprets this as a technical artifact and labels the read as coming from the antisense coding gene, and the read does not contribute to the expression counts matrix. This functional strategy is enabled in STARsolo using the flag --soloFeatures GeneFull_Ex50pAS. This can be summarized by this schematic:

We have added a new parameter FUNCTIONAL_STRATEGY to all programs that interpret gene annotations. This can be set to interpret the annotations with method's priority. When READ_MQ=0 and FUNCTIONAL_STRATEGY=STARSOLO, the methods generate very similar results. The plot below is the summed expression of each gene across cells for a single experiment, where the expression was generated by STARSolo and DropSeq on the same set of sequencing reads aligned by STARSolo.

Cell barcode correction tools

These tools emulate Cell Ranger's cell barcode correction algorithm.

CountBarcodeSequences tallies the appearance of cell barcodes in the input paired-end BAM. Recommended usages is to pass the list of expected barcodes via ALLOWED_BARCODES option, so that only these cell barcodes are counted.
CorrectAndSplitScrnaReadPairs corrects cell barcodes that are edit-distance 1 away from one of the barcodes in the ALLOWED_BARCODE_COUNTS file (produced by CountBarcodeSequences). The program can also split the read pairs into multiple BAM files, in which all the reads for a cell barcode are in a single BAM. This can facilitate parallel processing of the reads.

Installation

Java

Download and unzip dropseq-3.0.0.zip . Wrapper scripts for all the command-line programs will be in the dropseq-3.0.0 directory that is created when unzipping.

R packages

install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.utilities_3.0.0.tar.gz") 
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.dropulation_3.0.0.tar.gz")

Assets 5

02 Aug 13:37

jamesnemesh

v2.5.4

2d826c7

Donor Assignment Minor Release

Update to donor assignment and Census-Seq code

We have recently noticed that as the sequence length increases and SNP backbone density improves, the fraction of reads or UMIs observing more than one transcribed SNP also increases. This trend was particularly evident when comparing the number of "informative" UMIs to the total number of UMIs expressed by each cell. In some instances, the count of informative UMIs surpassed the total UMIs produced by DGE for a cell. In such cases, multiple SNPs were observed on the same read or UMI, which might not represent independent observations. This essentially leads to "double counting" of these observations. To address this issue, AssignCellsToSamples and DetectDoublets now pre-sort the reads by cell and molecular barcode, selecting only the "best" SNP for each pileup. SNP quality is measured using the mean GQ score across all donors, and in case of a tie, a random SNP is chosen. For algorithms not employing UMIs (CensusSeq tools), this filtering occurs at the read level.

Changes in results in a modest test data set of 1614 cells

Donor assignment changes were minor - all discordant donor assignment cells were also flagged as doublets, so would not affect the final donor labels. Probability of assignment was lower on average due to removal of non-independent allele observations.

This change alters the doublet classification of a small percentage (~3%) of cells, with roughly equal numbers of cells flipping between singlet/doublet status. The cells undergoing status changes have very few informative UMIs, making them underpowered for this classification, so it is less surprising that they might change assignment. The vast majority of cells classified as singlets by both the previously released algorithm and the new algorithm show no disagreements in donor assignment. In general, this update does not change outcomes significantly but does affect the likelihoods of assignment due to fewer observations.

Quality of life improvements

Moreover, we have implemented several quality of life improvements to enhance the transparency of the read filtering methods used. This should help in better understanding situations of low-quality data (sequencing or VCF) or mismatches between the sequence data and VCF data, which result in low amounts of data used in the likelihood calculations. It will also assist us in addressing any issues you may submit. Additionally, we have incorporated a "faster-fail" threshold to detect when no UMIs detect a transcribed SNP. This threshold, TRANSCRIBED_SNP_FAIL_FAST_THRESHOLD, represents the number of UMIs that can be observed without detecting a transcribed SNP. If this threshold is exceeded, the program will stop running early and report the number of UMIs observed.

For API users, FiltererIterator has a new method, logFilterResults, which can be overridden by your subclass to automatically generate logging when the iterator is empty.

Assets 3

23 Mar 18:31

jamesnemesh

v2.5.3

e21ba24

MarkChimericReads release

When investigating differences in expression matrixes generated by 10x and our software, we discovered that 10x has implemented a new cleanup step for UMI sequences that are observed by multiple genes with in a cell. This is implemented in 10x's CellRanger software, please see this method:

Cell Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.

We found this technique to be useful, but wanted to extend this correction to apply not only to a counts matrix, but to the reads in a BAM file. MarkChimericReads accomplishes this by identifying UMIs that are shared across gene within a cell using the same heuristics, then marking problematic reads as map quality 0 so they are ignored by other downstream processes. There are two options for this filtering. We refer to the 10x strategy as RETAIN_MOST_SUPPORTED (our default), and there is an additional strategy REMOVE_ALL that removes all reads where UMIs are observed on multiple genes. We found this strategy to be too conservative in practice, but have left it available for experimentation.

This program is best run after cell barcode selection, but before running any programs that rely on UMIs, such as DigitalExpression or AssignCellsToSamples. The software can be run on a BAM file specifying a cell barcode list, but the memory required to track problematic UMIs for all cell barcodes in a BAM may be prohibitive. We have found this UMI cleanup to yield small incremental improvements in donor assignment and doublet detection.

Assets 3

18 Jan 18:20

jamesnemesh

v2.5.2

98eb326

Donor assignment / doublet detection tools in support of Cell Stem Cell publication

This release is in support of our manuscript "Natural variation in gene expression and viral susceptibility revealed by neural progenitor cell villages" by Wells, et al., which has been accepted for publication in Cell Stem Cell. The following software is relevant for release.

AssignCellsToSamples - This program takes as input a BAM file containing sequencing reads from a pool of donors, and a VCF file containing genotypes from those donors. The program emits the most likely donor for each cell.
DetectDoublets - This program takes as input a BAM file containing sequencing reads from a pool of donors, a VCF file containing genotypes from those donors, and the output from AssignCellsToSamples. The program emits the probability that each cell barcode is a doublet where a cell from two different donors have been co-encapsulated in the same droplet.
TagReadWithGeneFunction - This program adds additional gene function tags to a BAM file that are required to run donor assignment code. This programs are run as part of the standard Drop-seq pipeline, but can also be run on standard 10x BAM files to make them compatible with our toolkit. Please see the Drop-seq alignment cookbook and command line help for details.

Please also see our biorxiv manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells which is already available. We will publish a computational protocols guide in support of this work in the following weeks.

See the Drop-seq alignment cookbook and Census-seq computational protocols for detailed usage instructions, diagrams, and explanations of how these new systems work. For more information about Census-seq, see our manuscript on bioRxiv

Assets 3

0 Join discussion

13 Jan 14:48

alecw

v2.5.1

680c2ea

Upgrade to Picard 2.26.10, which includes log4j 2.17.1

Picard and log4j upgrades

Picard 2.26.10, which includes log4j 2.17.1

Minor changes

Reduce memory requirement of MergeDgeSparse by making it multi-pass.

Assets 3

17 Dec 00:44

alecw

v2.5.0

f85da7c

Dropulation + upgrade to picard 2.26.8, which includes log4j.16.0

Picard and log4j upgrades

picard.jar v2.26.8, which includes log4j 2.16.0
remove stand-alone log4j.jar
upgrade to biojava 6.0.3
remove slf4j-log4j12

Dropulation programs - these programs are in support of our manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells.

AssignCellsToSamples - This program takes as input a BAM file containing sequencing reads from a pool of donors, and a VCF file containing genotypes from those donors. The program emits the most likely donor for each cell.
DetectDoublets - This program takes as input a BAM file containing sequencing reads from a pool of donors, a VCF file containing genotypes from those donors, and the output from AssignCellsToSamples. The program emits the probability that each cell barcode is a doublet where a cell from two different donors have been co-encapsulated in the same droplet.

GTF parsing

pass command-line VALIDATION_STRINGENCY to GTFParser
Set default command-lin VALIDATION_STRINGENCY to LENIENT

Minor changes

improvements to sparse matrix I/O
allow merging of DGEs with same prefix if there are no cell barcode collisions.

Assets 3

04 Aug 19:46

jamesnemesh

v2.4.1

92c1eb2

Support for SBARRO experiments

TagReadWithRabiesBarcodes - Tags an unaligned BAM with rabies virus barcode sequences and associated metrics. Barcode sequences are extracted using invariant sequences which flank the barcode regions.
FilterValidRabiesBarcodes - Filters rabies virus tags into pass/fail BAMs based on metrics associated with the extracted viral barcodes.
BipartiteRabiesVirusCollapse - Collapse rabies virus sequences in which half the barcode matches within some edit distance, within a cell.

Other updates

TrimStartingSequence update: New algorithm with mismatch rate and sequence not require to be anchored to start of read. Note that different sequence is recommended for 10X and Drop-seq. Old, buggy behavior can be enabled with LEGACY=true.
PolyATrimmer now optionally stores length of polyA and trimmed adapter
ConvertTagToReadGroup improved error messages when read group missing from input data
FilterBamByGeneFunction - new program to filter reads to specified gene functions, in the same way DigtialGeneExpression and related programs internally filter data.
CensusSeq and related software can now take as input a list of BAMs.
Csi analysis now generates a contamination confidence interval.
Improvements to make GTF parsing more robust

Assets 3

1 Join discussion

15 Sep 21:33

alecw

v2.4.0

3d2b3d8

Release 2.4.0

Census-seq tools and documentation (CensusSeq, RollCall, CsiAnalysis)
New program SplitBamByCell - partitions a large BAM into many smaller BAMs by cell barcodes, so that each cell barcode is contained in only one output BAM, and BAMs have roughly the same size.
More flexible DGE merging
New program CountUnmatchedSampleIndices
Upgrade to Picard 2.20.5
Add option to FilterBam and FilterBamByTag to fail if not enough reads pass filter
Many bug fixes -- git log v2.3.0..v2.4.0 to view

Assets 3

16 Apr 19:51

alecw

v2.3.0

34e6572

Release 2.3.0

New program ConvertTagToReadGroup.
Encode quality tags as strings in TagBamWithReadSequenceExtended.
FilterBam bug fix: output file was not being close properly.
In DigitalExpression, allow null STRAND_STRATEGY and/or null LOCUS_FUNCTION_LIST, for BAMs that are not tagged with gene function.
In FilterBam and FilterBamByTag, add ability to output a summary file.

See the Drop-seq alignment cookbook for detailed usage instructions, diagrams, and explanations of how these new systems work.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes

Installation

Java

R packages

Contributors

New and changed

Java 21 and Gradle

R packages for Dropulation and Census-seq

New program FilterReadsByUMISupport

GatherDigitalAlleleCounts changes

DigitalExpression changes - smells like STARSolo

Cell barcode correction tools

Installation

Java

R packages

Update to donor assignment and Census-Seq code

Changes in results in a modest test data set of 1614 cells

Quality of life improvements

This release is in support of our manuscript "Natural variation in gene expression and viral susceptibility revealed by neural progenitor cell villages" by Wells, et al., which has been accepted for publication in Cell Stem Cell. The following software is relevant for release.

Picard and log4j upgrades

Minor changes

Picard and log4j upgrades

Dropulation programs - these programs are in support of our manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells.

GTF parsing

Minor changes

Support for SBARRO experiments

Other updates

Releases: broadinstitute/Drop-seq

Shell script standardization and other minor changes

Changes

Installation

Java

R packages

Contributors

Java 21, R packages for Dropulation and Census-seq

New and changed

Java 21 and Gradle

R packages for Dropulation and Census-seq

New program FilterReadsByUMISupport

GatherDigitalAlleleCounts changes

DigitalExpression changes - smells like STARSolo

Cell barcode correction tools

Installation

Java

R packages

Donor Assignment Minor Release

Update to donor assignment and Census-Seq code

Changes in results in a modest test data set of 1614 cells

Quality of life improvements

MarkChimericReads release

Donor assignment / doublet detection tools in support of Cell Stem Cell publication

This release is in support of our manuscript "Natural variation in gene expression and viral susceptibility revealed by neural progenitor cell villages" by Wells, et al., which has been accepted for publication in Cell Stem Cell. The following software is relevant for release.

Upgrade to Picard 2.26.10, which includes log4j 2.17.1

Picard and log4j upgrades

Minor changes

Dropulation + upgrade to picard 2.26.8, which includes log4j.16.0

Picard and log4j upgrades

Dropulation programs - these programs are in support of our manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells.

GTF parsing

Minor changes

Support for SBARRO experiments

Support for SBARRO experiments

Other updates

Release 2.4.0

Release 2.3.0