# 0. Resources
Running ATACflow requires several files. In the tutorial notebooks, I place these files in a subdirectory of the current working directory called `resources/`.

In [1]:
if [ ! -d resources ]
then
    mkdir resources
fi

## 1. PUMATAC
Clone the latest version of the pipeline from github

In [2]:
if [ ! -d ATACflow ]; then
    echo "cloning PUMATAC"
    git clone https://github.com/aertslab/ATACflow
else
    echo "pulling PUMATAC"
    cd ATACflow
    git pull
    cd ..
fi

pulling PUMATAC
Already up to date.


## 2. Example FASTQ (optional)
Example FASTQ used to test this pipeline can be found at https://resources.aertslab.org/papers/

In [None]:
wget -r -np -nH -R index.html https://resources.aertslab.org/papers/PUMATAC/PUMATAC_example_fastq/

In [3]:
md5sum PUMATAC_example_fastq/*

0c8b78a5687d117f71d04a090eeffa80  PUMATAC_example_fastq/BIO_ddseq_4__R1.LIBDS.fastq.gz
6a5319b566331e1c2cb92e0950863075  PUMATAC_example_fastq/BIO_ddseq_4__R2.LIBDS.fastq.gz
0a9f8d48581694a85dc5b12a54fe7e59  PUMATAC_example_fastq/EPF_hydrop_1__R1.LIBDS.fastq.gz
9001b4deff9eef9d6c605eadae7dff92  PUMATAC_example_fastq/EPF_hydrop_1__R2.LIBDS.fastq.gz
b3300d28ed425417dc21d796d7a8e623  PUMATAC_example_fastq/EPF_hydrop_1__R3.LIBDS.fastq.gz
9fa3339b8aea6b7b2f9ef4d4bf73a086  PUMATAC_example_fastq/OHS_s3atac_1__R1.LIBDS.fastq.gz
f08b6c814abb4d3c25329dbc2f6a67ab  PUMATAC_example_fastq/OHS_s3atac_1__R2.LIBDS.fastq.gz
448652e81c27d2000fca275bc7f94990  PUMATAC_example_fastq/OHS_s3atac_1__R3.LIBDS.fastq.gz
dbc5f0fd828f9f81a9f2b77642395c0b  PUMATAC_example_fastq/VIB_10xv2_1__R1.LIBDS.fastq.gz
2e00f84113992335632517f67703adf9  PUMATAC_example_fastq/VIB_10xv2_1__R2.LIBDS.fastq.gz
9d39e83afc86000c8fda42da4e69e4db  PUMATAC_example_fastq/VIB_10xv2_1__R3.LIBDS.fastq.gz


## 3. PUMATAC dependencies
All other files can all be downloaded from https://resources.aertslab.org/papers/PUMATAC_dependencies/, the rest of this notebook only documents how they were generated.

In [None]:
wget -r -np -nH -R index.html https://resources.aertslab.org/papers/PUMATAC/PUMATAC_dependencies/

In [4]:
tree -L 2 PUMATAC_dependencies/

PUMATAC_dependencies/
├── cache
│   ├── broadinstitute-gatk-4.2.0.0.img
│   ├── cache
│   ├── cflerin-seurat-4.0.3-plus.copy.sif
│   ├── cflerin-seurat-4.0.3-plus.sif
│   ├── seurat.sh
│   ├── vibsinglecellnf-bap-2021-04-27-3b48f4b.img
│   ├── vibsinglecellnf-bwamaptools-bwa-mem2-2.2.1-zlibng-2.0.6.img
│   ├── vibsinglecellnf-popscle-2021-05-05-da70fc7.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-04-15-16314db.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-07-07-0638c1d.img
│   ├── vibsinglecellnf-sinto-0.7.3.1.img
│   └── vibsinglecellnf-trimgalore-0.6.6.img
├── genomes
│   ├── hg38_bwamem2
│   └── mm10_bwamem2
├── jupyter_kernels
│   └── 20230419_pycistopic.sif
├── nextflow
│   ├── nextflow-21.04.3-all
│   ├── nextflow-21.10.4-all
│   └── nextflow-22.10.7-all
├── regions
│   ├── 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet-scATAC.vcf.gz
│   ├── V2.hg38-rDHS-Unfiltered.blacklisted.bed
│   ├── V2.mm10-rDHS-Unfiltered.blacklisted.bed
│   ├── hg38-blacklist.v2.bed
│   └── 

## 4. Jupyter kernels
For the jupyter notebooks, we need specific versions for some python packages. I have included a singularity image that contains the right versions:

In [43]:
KERNEL_PATH=(`realpath PUMATAC_dependencies/jupyter_kernels/20230419_pycistopic.sif`)
echo $KERNEL_PATH

/lustre1/project/stg_00002/lcb/fderop/data/20230411_ATACflow_tutorial/PUMATAC_dependencies/jupyter_kernels/20230419_pycistopic.sif


We now need to redirect Jupyter to this singularity image. Kernels are usually located in `~/.local/share/jupyter/kernels/`:

In [54]:
KERNELS_DIR=~/.local/share/jupyter/kernels/
ls -1 $KERNELS_DIR

bash
cistopic_20220628
cistopic_20220714
cistopic_20220722
cistopic_20220815
cistopic_20221014
cistopic_20230419
cistopic_cbravo
pycistopic_singularity


Create a new directory (`cistopic_20230419`) and add the `kernel.json` file. Make sure that the path to the singularity iamge is updated.:

In [55]:
if [ ! -d $KERNELS_DIR/cistopic_20230419/ ]
then
    mkdir $KERNELS_DIR/cistopic_20230419/
fi

Then, we will write a new `kernel.json` file there that redirects jupyter to our `KERNEL_PATH`. The `kernel.json` file also requires you to define which mounts it should access (since it executes a `singularity exec` command).

In [52]:
MOUNTS=/staging,/lustre1,/data,/vsc-hard-mounts,/scratch,/scratch/leuven/330/vsc33042/tmp/:/tmp,/local_scratch

In [53]:
sed "s+/lustre1/project/stg_00090/scatac_benchmark/0_resources/cistopic_image/20230419_pycistopic.sif+$KERNEL_PATH+g" "kernel_template.json" > "kernel.json"
sed "s+/staging,/lustre1,/data,/vsc-hard-mounts,/scratch,/scratch/leuven/330/vsc33042/tmp/:/tmp,/local_scratch+$MOUNTS+g" "kernel.json"

{
    "argv": [
     "singularity",
     "exec",
     "-B",
     "/staging,/lustre1,/data,/vsc-hard-mounts,/scratch,/scratch/leuven/330/vsc33042/tmp/:/tmp,/local_scratch",
     "/lustre1/project/stg_00002/lcb/fderop/data/20230411_ATACflow_tutorial/PUMATAC_dependencies/jupyter_kernels/20230419_pycistopic.sif",
     "ipython",
     "kernel",
     "-f",
     "{connection_file}"
    ],
    "display_name": "20230419_pycistopic.sif",
    "language": "python"
}


If all is well, the following should return our jupyter `kernel.json`:

In [56]:
cat $KERNELS_DIR/cistopic_20230419/kernel.json

{
    "argv": [
     "singularity",
     "exec",
     "-B",
     "/staging,/lustre1,/data,/vsc-hard-mounts,/scratch,/scratch/leuven/330/vsc33042/tmp/:/tmp,/local_scratch",
     "/lustre1/project/stg_00090/scatac_benchmark/0_resources/cistopic_image/20230419_pycistopic.sif",
     "ipython",
     "kernel",
     "-f",
     "{connection_file}"
    ],
    "display_name": "20230419_pycistopic.sif",
    "language": "python"
}


# 5. ghuls single_cell_toolkit

In [9]:
if [ ! -d PUMATAC_dependencies/single_cell_toolkit ]; then
    echo "cloning single_cell_toolkit"
    cd PUMATAC_dependencies
    git clone https://github.com/aertslab/single_cell_toolkit
    cd ..
else
    cd PUMATAC_dependencies
    echo "pulling single_cell_toolkit"
    cd single_cell_toolkit
    git pull
    cd ..
fi

cloning single_cell_toolkit
Cloning into 'single_cell_toolkit'...
remote: Enumerating objects: 464, done.[K
remote: Counting objects: 100% (220/220), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 464 (delta 123), reused 160 (delta 74), pack-reused 244[K
Receiving objects: 100% (464/464), 148.31 KiB | 111.00 KiB/s, done.
Resolving deltas: 100% (266/266), done.


# Documentation
How the dependencies were generated or downloaded
### a. NextFlow
Download NextFlow version 21.04.03: https://github.com/nextflow-io/nextflow/releases/tag/v21.04.3

In [33]:
if [ ! -f resources/nextflow/nextflow-21.04.3-all ]; then
    wget -P resources/nextflow/ https://github.com/nextflow-io/nextflow/releases/download/v21.04.3/nextflow-21.04.3-all
fi

### b. A bwa-mem2 genome index for your species
This genome index can be generated from a `.fasta` file using `bwa-mem2`. First, follow the github tutorial to install `bwa-mem2`: https://github.com/bwa-mem2/bwa-mem2  
Then, generate a genome index as follows:

### c. Barcode whitelists for 10x Genomics and HyDrop
For 10x Genomics, you can get the whitelist for multiome and ATAC from here:  
https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-  
https://kb.10xgenomics.com/hc/en-us/articles/4412343032205-Where-can-I-find-the-barcode-whitelist-s-for-Single-Cell-Multiome-ATAC-GEX-product-  
  
For HyDrop, the whitelist can be sourced from https://github.com/aertslab/hydrop_data_analysis.
For ddSEQ, a whitelist-agnostic strategy is employed.  
For s3-ATAC, a custom whitelist should be compiled based on the indices you used.  
If you wish to implement your own technique, a tutorial will be added.

### d. Regions for fragment counting
If you also want to employ our QC diagnostics tool, you will require regions to count `.fragments.tsv.gz` files in. There are three possible strategies:
1. Align your data in a bulk ATAC-seq fashion, perform de-novo peak calling on the aggregate alignment, and use these peak regions to generate your count matrix. This strategy can miss regions that are only accessible in rare cell types.
2. Use a region set from a previous experiment on your sample.
3. Use the ENCODE SCREEN regions: https://screen.encodeproject.org/. We subtract the ENCODE blacklist () from the ENCODE SCREEN regions.

### The `PUMATAC_dependencies` directory should look like so:

In [48]:
tree -L 2 PUMATAC_dependencies

PUMATAC_dependencies
├── cache
│   ├── broadinstitute-gatk-4.2.0.0.img
│   ├── cache
│   ├── cflerin-seurat-4.0.3-plus.copy.sif
│   ├── cflerin-seurat-4.0.3-plus.sif
│   ├── seurat.sh
│   ├── vibsinglecellnf-bap-2021-04-27-3b48f4b.img
│   ├── vibsinglecellnf-bwamaptools-bwa-mem2-2.2.1-zlibng-2.0.6.img
│   ├── vibsinglecellnf-popscle-2021-05-05-da70fc7.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-04-15-16314db.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-07-07-0638c1d.img
│   ├── vibsinglecellnf-sinto-0.7.3.1.img
│   └── vibsinglecellnf-trimgalore-0.6.6.img
├── genomes
│   ├── hg38_bwamem2
│   └── mm10_bwamem2
├── nextflow
│   ├── nextflow-21.04.3-all
│   ├── nextflow-21.10.4-all
│   └── nextflow-22.10.7-all
├── regions
│   ├── 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet-scATAC.vcf.gz
│   ├── V2.hg38-rDHS-Unfiltered.blacklisted.bed
│   ├── V2.mm10-rDHS-Unfiltered.blacklisted.bed
│   ├── hg38-blacklist.v2.bed
│   └── mm10-blacklist.v2.bed
└── whitelists
    ├── 20200227