Using the work that Matthew Galbraith outlined - make a nextflow workflow that allows this to be run and scaled.
create an environment to ensure all pieces required are installed in a clean environment.
Here we call it picard
.
Conda environment management is rich and nearly complete. Current some problems - for example samtools install via conda does not work.
Search and use anaconda for packages.
conda create -n picard -y
install nextflow
We develop and install our workflow using nextflow -- we need to install it :)
conda install -c bioconda nextflow
Keep track of everything on GitHub - final workflow will be main.nf
with configuration in nextflow.config
.
Need to use the command line interface routine gh
.
conda install -c conda-forge gh
Need to use your GitHub token to authenticate.
gh auth login
Sorry the editor is great in JupyterLab, I like emacs
conda install -c conda-forge emacs -y
Two routines used -- picard FilterSamReads
and samtools fastq
samtools fastq --reference data/Homo_sapiens_assembly38.fasta -1 HTP0003A.1.fastq -2 HTP0003A.2.fastq HTP0003A_filtered.cram
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1613932 reads
Checking results (4 lines per read in a fastq file)
(picard) wc -l *.fastq
3227864 HTP0003A.1.fastq
3227864 HTP0003A.2.fastq
6455728 total
Double check the numbers -
3227864 / 4 = 806966
806966 * 2 = 1613932 Reads
It adds up.
docker run -it -v $PWD:$PWD -w $PWD samtools samtools fastq --reference data/Homo_sapiens_assembly38.fasta -1 HTP0003A.1.fastq -2 HTP0003A.2.fastq 2022Apr23ManualRun/HTP0003A_filtered.cram
.fastq -2 HTP0003A.2.fastq 2022Apr23ManualRun/HTP0003A_filtered.cram
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1613932 reads
response is identical.
We made two containers, now we have stitched those together as two processes. Making sure the file names are updated based upon the input file name and then into a workflow.
Using the nextflow documentation faq for How do I process multiple input files in parallel?
'updated Jan 8 2022'
Using the steps outlined by our colleague 'Nick'
Pseudo code
-
a. samtools view to convert cram to sam with header file.
b. samtools view to start the new output file with header only.
c. samtools view to filter based upon a provided filter string (see script below - narrow to chromosome, gene of interest).
-
picard SamToFastq to extract the paired reads overlapping the newly extracted region and produced the R1_fastq and R2_fastq files.
-
fastqc - performs quality control analysis on the fastq files extracted using a container built from fastqc-docker.
-
multiqc - creates a final quality control report using the output from fastqc using a container built from multiqc-docker.
the input requirements can be obtained from running with option --help
(picard) nichdm02209715:picard-filterSamReads deslattesmaysa2$ nextflow run main.nf --help
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [modest_austin] - revision: 9d1ff459fd
Usage:
The typical command for running the pipeline is as follows:
Inputs Options:
--input Input cram
--max_records_in_ram For picard tools to specify the maximum records in ram (default is 500000).
--filter_string String to use to filter region from cram (e.g. "grep -e chr6 -e HLA -e "*"")
--outdir The directory for the filtered cram (default is CRAMS_filtered).
--outputfile test place holder - shouldn't be necessary
--reference_fasta The assembly reference (Homo sapiens assembly as a fasta file.
--tracedir Where the traces and DAG and reports are kept.
To execute the test file, the following command was run:
Note for the test, we used the filter string `"grep -e chr22 -e USP18 -e """*'
nextflow run main.nf \
--input data/test.chr22.Aligned.sortedByCoord.out.cram \
--filter_string "grep -e chr22 -e USP18 -e \"*\"" \
--outdir "test_output" \
--outputfile "test_outputfile" \
--reference_fasta "data/GRCh38.primary_assembly.genome.chr22.fa" \
--tracedir "execution_trace" \
-with-trace \
-with-report \
-with-dag "execution_trace/test_output.png" \
-with-timeline
Command when executed on my macbook pro ran very quickly with the limited data files and isolated to chr22 segment of the human genome. GitHub actions can be set up to ensure that this nextflow script runs at all times regardless of changes.
Nextflow has nice features for creating execution report with timeline and resource details.
These may be found in the execution trace directory
Nextflow Execution Artefacts | Test Output |
---|---|
The Nextflow execution report | |
The Nextflow Tasks Details | |
The Nextflow Resource Usage |
To view the details, you can download the html files to your own computer and view within your browser (Chrome preferred)
Phil Ewels continues to produce so many wonderful tools, including Multiqc
The output of the running of fastqc and multiqc on the test files may be found in the test output directory
Multiqc & Fastqc Execution Artefacts | Test Output |
---|---|
The Multiqc | |
The Fastqc details |
To view the complete details, download the html files to your own computer and view within your browser (Chrome preferred)
A nice feature of the Cavatica platform - now we can upload our Nextflow workflow onto Cavatica.
The steps are as follows:
-
Create your workflow and put it into GitHub
-
Create a credentials file placed here
~/.sevenbridges/credentials
. The content looks like this:
[deslattesmaysa2]
api_endpoint = https://cavatica-api.sbgenomics.com/v2
auth_token = [your developers token]
The name in between the []
is your username on the platform.
-
git clone your GitHub workflow in a clean directory. This is important because the process of uploading as an application onto Cavatica zips up the directory - and you do not want your old work directories to be zipped inside!
-
Install the sbpack_nf routine. This is done with pip
pip3 install sbpack
- Now use the sbpack_nf command. See the link gives all the details for the options.
sbpack_nf --profile deslattesmaysa2 --appid matthew.galbraith/kf-htp-cram-hla-extractions/picard-filterSamReads-nf --sb-doc /Users/deslattesmaysa2/projects/picard-filterSamReads/README.md --entrypoint main.nf --workflow-path /Users/deslattesmaysa2/projects/picard-filterSamReads
- Edit the
sb_nextflow_schema.yaml
to accept the input files using the details as outlined in the Cavatica Nextflow help pages
The final form of the sb_nextflow_schema.yaml
may be found in this repository sb_nextflow_schema.yaml
- upload the edited
sb_nextflow_schema.yaml
Note that this is done with the sbpack
command.
sbpack deslattesmaysa2 matthew.galbraith/picard-test/picard-filtercramfile-nf sb_nextflow_schema.yaml
And then it is an application ready to be used on the cavatica platform.
Generated when running Nextflow - here is the process graph