# Running vsn

First, generate a metadata file as described here: https://vsn-pipelines.readthedocs.io/en/latest/scatac-seq.html

You can do this manually, but I did it automatically using some python string manipulations (see notebook 1b)

In [1]:
cat metadata.35k.tsv

sample_name	technology	fastq_PE1_path	fastq_barcode_path	fastq_PE2_path
BIO_ddseq_1	biorad	35k_fastq/BIO_ddseq_1__R1.LIBDS.fastq.gz		35k_fastq/BIO_ddseq_1__R2.LIBDS.fastq.gz
BIO_ddseq_2	biorad	35k_fastq/BIO_ddseq_2__R1.LIBDS.fastq.gz		35k_fastq/BIO_ddseq_2__R2.LIBDS.fastq.gz
BIO_ddseq_3	biorad	35k_fastq/BIO_ddseq_3__R1.LIBDS.fastq.gz		35k_fastq/BIO_ddseq_3__R2.LIBDS.fastq.gz
BIO_ddseq_4	biorad	35k_fastq/BIO_ddseq_4__R1.LIBDS.fastq.gz		35k_fastq/BIO_ddseq_4__R2.LIBDS.fastq.gz
BRO_mtscatac_1	atac_revcomp	35k_fastq/BRO_mtscatac_1__R1.LIBDS.fastq.gz	35k_fastq/BRO_mtscatac_1__R2.LIBDS.fastq.gz	35k_fastq/BRO_mtscatac_1__R3.LIBDS.fastq.gz
BRO_mtscatac_2	atac_revcomp	35k_fastq/BRO_mtscatac_2__R1.LIBDS.fastq.gz	35k_fastq/BRO_mtscatac_2__R2.LIBDS.fastq.gz	35k_fastq/BRO_mtscatac_2__R3.LIBDS.fastq.gz
CNA_10xmultiome_1	multiome_revcomp	35k_fastq/CNA_10xmultiome_1__R1.LIBDS.fastq.gz	35k_fastq/CNA_10xmultiome_1__R2.LIBDS.fastq.gz	35k_fastq/CNA_10xmultiome_1__R3.LIBDS.fastq.gz
CNA_10xmultiome_2	multio

Generate a config file read by the vsn pipeline. $nwork is a temporary file destination.

In [3]:
VSN=../0_resources/vsn-pipelines/main_atac.nf

In [17]:
nwork=${VSC_SCRATCH}/ATAC_BENCHMARK_FULL/
mkdir $nwork
export NXF_WORK=$nwork


../0_resources/nextflow/nextflow-21.04.3-all config $VSN \
    -profile atac_preprocess_bap,vsc \
    > atac_preprocess.config

mkdir: cannot create directory ‘/scratch/leuven/330/vsc33042/ATAC_BENCHMARK_FULL/’: File exists


### now make some changes to the config:
* redirect to correct metadata_auto.tsv file
* fix BWA parameters
    * change the bwa index directory to the right, in this case hg39
    * change the bwa executor to local to run on current node
    * number of bwa CPUs: better to have 2 forks running with 17 threads than to have 1 fork with 36 threads due to I/O overhead
* add whitelists for each sample
* check if bap parameters are correct
* make sure all output dirs etc. exist
* Increase the time limit

Here is a functional config file tailored to our computing environment:

In [1]:
cat atac_preprocess_rapid.config

manifest {
   name = 'vib-singlecell-nf/vsn-pipelines'
   description = 'A repository of pipelines for single-cell data in Nextflow DSL2'
   homePage = 'https://github.com/vib-singlecell-nf/vsn-pipelines'
   version = '0.27.0'
   mainScript = 'main.nf'
   defaultBranch = 'master'
   nextflowVersion = '!21.04.3'
}

params {
   global {
      project_name = 'downsampling_series'
      outdir = 'preprocessing_out'
   }
   misc {
      test {
         enabled = false
      }
   }
   utils {
      container = 'vibsinglecellnf/utils:0.4.0'
      file_converter {
         off = 'h5ad'
         tagCellWithSampleId = true
         remove10xGEMWell = false
         useFilteredMatrix = true
         makeVarIndexUnique = false
      }
      publish {
         compressionLevel = 6
         annotateWithBatchVariableName = false
         mode = 'copy'
      }
   }
   atac_preprocess_tools {
      mark_duplicates_method = 'MarkDuplicates'
      adapter_trimming_method = 'Trim_Galore'
   }
   data {
  

I suggest you run a difftool on this to compare this to your config.

Then, in a tmux session to avoid interruption:

In [None]:
VSN=../0_resources/vsn-pipelines_/main_atac.nf
module load  Nextflow/21.04.3
nextflow -C atac_preprocess_rapid.config run $VSN -entry atac_preprocess_rapid -resume

Jaccard:

In [None]:
img_path=/lustre1/project/stg_00090/scatac_benchmark/0_resources/vsn_cache/vibsinglecellnf-singlecelltoolkit-2022-07-07-0638c1d.img
singularity exec --cleanenv -H $PWD -B /lustre1,/staging,/data,${VSC_SCRATCH},${VSC_SCRATCH}/tmp:/tmp,${HOME}/.nextflow/assets/,/local_scratch/ $img_path bash

In [1]:
export POLARS_MAX_THREADS=8
export OMP_NUM_THREADS=8

subset=
for fragments in preprocessing_out/data/fragments/*${subset}*fragments.raw.tsv.gz
do
    # echo $fragments
    sample=(`basename $fragments`)
    sample=${sample%.fragments.raw.tsv.gz}
    # echo $sample
    subsample=(${sample//\./ })
    # echo ${subsample}
    jaccardfile=/lustre1/project/stg_00090/scatac_benchmark/public_2_vsn_preprocessing/public_preprocessing_output/data/reports/barcard/$subsample.FULL.tomerge.tsv
    infile=$fragments
    outfile=preprocessing_out/data/fragments/$sample.fragments.tsv
    outfile2=$outfile.gz
    if [ -f "$outfile" ]; then
        # echo "$outfile exists, skipping"
        echo ""
    elif [ -f "$outfile2" ]; then
        # echo "$outfile2 exists, skipping"
        echo ""
    else 
        echo "$outfile does not exist, generating"
        merge_cbs_over_jaccard_index_threshold.py -j $jaccardfile -i $infile -o $outfile &
    fi
done

bash: syntax error near unexpected token `elif'
 exists, skipping
bash: syntax error near unexpected token `else'
 does not exist, generating
bash: syntax error near unexpected token `fi'
bash: syntax error near unexpected token `done'


: 2

In [None]:
export POLARS_MAX_THREADS=8
export OMP_NUM_THREADS=8

subset=
for fragments in preprocessing_out/data/fragments/*${subset}*fragments.tsv
do
    if [ -f "$fragments.gz" ]; then
        echo "exists"
    else 
        echo "$outfile does not exist, generating"
        bgzip -@ 4 $fragments &
    fi
done