WDL based workflows for BAM-to-PBAM-to-BAM conversions. For additional information on the protocol and file formats, see http://privaseq3.gersteinlab.org/docs/.
The workflows can be run using caper. Install caper following these installation instructions.
In wdl
directory you will find workflows for all the supported formats. Alongside with each wdl file, an input json template is provided.
As an example, assume you have a bam
file, that originates from bulk RNA sequencing experiment, and it has been aligned to human GRCh38 reference and you want to make a privacy-aware bam from it. The workflow you need is located in wdl/genome/make_pbam_genome.wdl
, and the input template in wdl/genome/genome_pbam_input_template.json
.
Fill in the locations of your bam
, and reference files into the template. Acceptable file storages in addition to your local machine are https://
, gs://
, s3://
. In all the workflows there is an option to provide the reference files (reference genome, reference transcriptome and the annotation) as plain files, or as files compressed with gzip. If you want to use the gzipped file, you should add _gz
suffix to the input parameter name (see the second example below).
{
"genome.bam": "<your bam location here>",
"genome.reference_fasta": "<GRCh38.fasta location here>",
"genome.cpu": 1,
"genome.memory_gb": 2,
"genome.disk": "local-disk 20 SSD"
}
If you want to provide the reference genome as a compressed file, genome.reference_fasta
needs to be changed into genome.reference_fasta_gz
:
{
"genome.bam": "<your bam location here>",
"genome.reference_fasta_gz": "<GRCh38.fasta.gz location here>",
"genome.cpu": 1,
"genome.memory_gb": 2,
"genome.disk": "local-disk 20 SSD"
}
Save the input containing locations you your input files.
Memory and disk requirements depend on the size of the input. Good starting point for disk is 5x the size of your bam
file, and for memory 16GB should be sufficient for most bam
files. Most of the processes for now are single process. Parallelized version will be available in the future.
Run the workflow:
caper run -i <your_input.json> wdl/genome/make_pbam_genome.wdl -m metadata.json --docker
If you are using singularity use --singularity
option instead of --docker
.
After the run finishes, the metadata.json
containing detailed information of the run is written. In outputs
section of the metadata.json
you will find the location of the pbam
file.
Assuming you are not intending anyone to be able to restore the information contained in the bam
file, you are done. If you need to be able to reverse the transformation, you will need to create a diff
file corresponding to your bam
.
Note that if your bam
file was aligned to human transcriptome, then you can use the workflow located in wdl/transcriptome/make_pbam_transcriptome.wdl
. For single-cell RNA-Seq data that was aligned to human reference genome using STAR, you can still use the workflow located in wdl/genome/make_pbam_genome.wdl
.
The workflow and corresponding input template are located in wdl/diff
directory. As above fill in the location of the bam
file into the template:
{
"diff.bam": "<your bam location here>,
"diff.cpu": 1,
"diff.memory_gb": 2,
"diff.disk": "local-disk 20 SSD"
}
Memory and disk requirements and running is as above.
To restore a regular bam
from pbam
and diff
files you will need to use the workflow and input template located in wdl/pbam2bam
. The process is very similar to the previous steps. Fill in the input files to the template:
{
"pbam2bam.pbam": "<your pbam location here>",
"pbam2bam.diff": "<corresponding diff location here>",
"pbam2bam.run_type": "genome",
"pbam2bam.reference_fasta": "<reference fasta location here>",
"pbam2bam.cpu": 1,
"pbam2bam.memory_gb": 2,
"pbam2bam.disk": "local-disk 20 SSD"
}
If you want to use a gzip-compressed reference file, use the following:
{
"pbam2bam.pbam": "<your pbam location here>",
"pbam2bam.diff": "<corresponding diff location here>",
"pbam2bam.run_type": "genome",
"pbam2bam.reference_fasta_gz": "<reference fasta.gz location here>",
"pbam2bam.cpu": 1,
"pbam2bam.memory_gb": 2,
"pbam2bam.disk": "local-disk 20 SSD"
}
Running and locating outputs is exactly same as before.
We recommend you run the WDL workflows, but it is also possible run the scripts directly, although some setup will be necessary (obviously it cannot be guaranteed that even after following these instructions your environment will be exactly the same as in the docker image we use to run the WDL):
- Install
python 3.8.5
- Install following python packages
numpy==1.19.2 biopython==1.78 pandas==1.1.3
- Get Picard 2.23.8 from
https://github.com/broadinstitute/picard/releases/download/2.23.8/picard.jar
, and add the.jar
file toPATH
. - Install samtools 1.11
- Install java runtime, for example
openjdk 11.0.8
. - Add scripts from directories
10xscell
,diff
,genome
,pbam2bam
andtranscriptome
toPATH
.
Additional READMEs for running the bash scripts are provided in each folder.