Skip to content

reformat inDrop fastqs so they look like 10X genomics fastqs

Notifications You must be signed in to change notification settings

VanAndelInstitute/intent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

intent

Convert fastq files from inDrop v2 libraries to 10X genomics style fastqs to facilitate analysis by some downstream tools like alevin.

This program takes an R1 and and R2 fastq from an inDrop run, and reorganizes the barcodes. R1 and R2 are swapped, and the cellular barcodes 1 and 2 are extracted and put at the start of each read and the UMI is put after the cellular barcode. Note that inDrop v2 uses just 6 bases for their UMIs. This sofware extends the UMIs to 10-14 basepairs by adding some or all of cell barcode 2 on to it. By default, a 12 bp UMI is created in this way. This is shortened to 10 bp if the -t flag is given (to simulate v2 10X Genomics chemistry), or lengthened to 14 if the -x flag is given. In addition, the combined cell barcode for v2 inDrop chemistry is 16-20 basepairs. This is truncated to the first 16 bases of BC2+BC1 when run in default mode or with the -t flag. If the -x flag is given, a 20bp cell barcode is provided (which in some cases will require padding with up to 4 G's since BC1 can be anywhere between 8 and 12 bp).

If barcode 1 is shorter than 8 bp or if the UMI is shorter than 6bp, the barcode (and corresponding transcript read) is discarded. If barcode 1 is 8-11 basepairs, it is padded to 12 basepairs with G's. If the W1 portion of R2 aligns poorly to the expected W1 sequence (GAGTGATTGCTTGTGACGCCTT), then the barcode is discarded. "Poorly" is defined as exceeding the Levenshtein distance limit set by the parameter -d, which by default is set to '2'.

Note that barcode 1 and barcode 2 are compared to the official V2 barcode list provided with the indrop pipeline (prior to any padding). If there is no match, then the barcode is corrected if possible if there is a unique barcode within a Levenshtein distance of 1 of the read barcode.

Due to the way that the barcodes are designed, the padding and truncation performed should in fact not introduce any ambiguity to the barcodes.

The quality metrics for the reads are preserved and re-ordered along with the barcodes.

The output files are as follows (C = cell barcode, U = UMI):

output_r1.fastq: 

    default (Chromium v3 style):
    CCCCCCCCCCCCCCCCUUUUUUUUUUUU

    -t (Chromium v2 style):
    CCCCCCCCCCCCCCCCUUUUUUUUUU

    -x (Entire cell barcode, padded if needed, and augmented UMI)
    CCCCCCCCCCCCCCCCCCCCUUUUUUUUUUUUUU

output_r2.fastq:

    All transcript reads corresponding to high quality barcodes.

Note that all input files must be gzipped, and the program outputs gzipped files.

CAVEAT EMPTOR This software is brand spanking new and minimally tested. Please verify results as your mileage may vary.

Installation

Clone the repository, then:

cd intent
make
cp bin/intent /somewhere/on/your/PATH

Usage

intent [options] read1.fastq read2.fastq output    

    read1.fastq.gz      read 1 fastq file (gzipped) from a v2 inDrop sequencing run.
    read2.fastq.gz      read 2 fastq file (gzipped) from a v2 inDrop sequencing run.
    output              root name of the output; transformed fastqs will be  
                        written to output_R1.fastq and output_R2.fastq. R1 
                        and R2 are swapped, and ambigious reads dropped  
                        (i.e. those with low quality w1 read or poorly formed
                        barcodes or UMIs); the resulting R1 fastq will have 16
                        bases of cell barcode follwed by 12 of UMI (unless
                        -x or -t flags are specified); the R2 fastq will
                        contain the corresponding transcript reads from the 
                        original R1 fastq (excluding those corresponding to 
                        dropped low quality barcodes. 

    Options:
        -v              report the version and exit.
        -x              expanded barcodes and umi format (20 / 14) 
        -t              Chromium v2 barcode format (16 / 10) 
        -d 2            maximum W1 alignment distance (default = 2)
        -h              display this help message

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Notably, only V2 of inDrop chemistry is supported. Naively, it should be pretty easy to add support for V1 or V3. I just don't have access to that fastq files nor deep understanding of their barcoding scheme. I would be happy to discuss further or receive pull requests.

License

MIT

About

reformat inDrop fastqs so they look like 10X genomics fastqs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published