GitHub - csawye01/nf-core-demultiplex: Demultiplexing pipeline for sequencing data

Introduction

nf-core/demultiplex is a bioinformatics pipeline used to demultiplex the raw data produced by next generation sequencing machines. At present, only Illumina sequencing data is supported.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Pipeline summary

Reformatting the input sample sheet
- Searches for [Data] tag
- Splits 10X single cell samples into 10X, 10X-ATAC and 10X-DNA .csv files by searching in the sample sheet column DataAnalysisType for 10X-3prime, 10X-ATAC and 10X-CNV.
- Outputs the results of needing to run specific processes in the pipeline (can be only 10X single cell samples, mix of 10X single cell with non single cell samples or all non single cell samples)
Checking the sample sheet for downstream error causing samples such as:
- a mix of short and long indexes on the same lane
- a mix of single and dual indexes on the same lane
Processes that only run if there are issues within the sample sheet found by the sample sheet check process (CONDITIONAL):
1. Creates a new sample sheet with any samples that would cause an error removed and create a a txt file of a list of the removed problem samples
2. Run bcl2fastq on the newly created sample sheet and output the Stats.json file
3. Parsing the Stats.json file for the indexes that were in the problem samples list.
4. Recheck newly made sample sheet for any errors or problem samples that did not match any indexes in the Stats.json file. If there is still an issue the pipeline will exit at this stage.
Single cell 10X sample processes (CONDITIONAL): NOTE: Must create CONFIG to point to CellRanger genome References
1. Cell Ranger mkfastq runs only when 10X samples exist. This will run the process with CellRanger, CellRanger ATAC, and Cell Ranger DNA depending on which sample sheet has been created.
2. Cell Ranger Count runs only when 10X samples exist. This will run the process with Cell Ranger Count, Cell Ranger ATAC Count, and Cell Ranger DNA CNVdepending on the output from Cell Ranger mkfastq. 10X reference genomes can be downloaded from the 10X site, a new config would have to be created to point to the location of these. Must add config to point Cell Ranger to genome references if used outside the Crick profile.
bcl2fastq (CONDITIONAL):
1. Runs on either the original sample sheet that had no error prone samples or on the newly created sample sheet created from the extra steps.
2. This is only run when there are samples left on the sample sheet after removing the single cell samples.
3. The arguments passed in bcl2fastq are changeable parameters that can be set on the command line when initiating the pipeline. Takes into account if Index reads will be made into FastQ's as well
FastQC runs on the pooled fastq files from all the conditional processes.
FastQ Screen runs on the pooled results from all the conditional processes. Must have own fastq_screen config to direct to.
MultiQC runs on each projects FastQC results produced.
MultiQC_all runs on all FastQC results produced.

Sample sheet format

The input sample sheet must adhere to Illumina standards as outlined in the table below. Additional columns for DataAnalysisType and ReferenceGenome are required for the correct processing of 10X samples. The order of columns does not matter but the case of column name's does.

Lane	Sample_ID	index	index2	Sample_Project	ReferenceGenome	DataAnalysisType
1	ABC11A2	TCGATGTG	CTCGATGA	PM10000	Homo sapiens	Whole Exome
2	SAG100A10	SI-GA-C1		SC18100	Mus musculus	10X-3prime
3	CAP200A11	CTCGATGA		PM18200	Homo sapiens	Other

Documentation

The nf-core/demultiplex pipeline comes with documentation about the pipeline, found in the docs/ directory:

Credits

Credits The nf-core/demultiplex pipeline was written by Chelsea Sawyer from The Bioinformatics & Biostatistics Group for use at The Francis Crick Institute, London.

Many thanks to others who have helped out along the way too, including (but not limited to): @ChristopherBarrington, @drpatelh, @danielecook, @escudem, @crickbabs

Citation

You can cite the nf-core pre-print as follows:
Ewels PA, Peltzer A, Fillinger S, Alneberg JA, Patel H, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. nf-core: Community curated bioinformatics pipelines. bioRxiv. 2019. p. 610741. doi: 10.1101/610741.

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.github		.github
assets		assets
bin		bin
conf		conf
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pipeline summary

Sample sheet format

Documentation

Credits

Citation

About

Releases

Packages

Contributors 2

Languages

License

csawye01/nf-core-demultiplex

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline summary

Sample sheet format

Documentation

Credits

Citation

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages