# 0. Resources
Running PUMATAC requires several files.

## 1. PUMATAC
Clone the latest version of the pipeline from github

In [1]:
if [ ! -d PUMATAC ]; then
    echo "cloning PUMATAC"
    git clone https://github.com/aertslab/PUMATAC
else
    echo "pulling PUMATAC"
    cd PUMATAC
    git pull
    cd ..
fi

cloning PUMATAC
Cloning into 'PUMATAC'...
remote: Enumerating objects: 761, done.[K
remote: Counting objects: 100% (761/761), done.[K
remote: Compressing objects: 100% (589/589), done.[K
remote: Total 761 (delta 179), reused 693 (delta 128), pack-reused 0[K
Receiving objects: 100% (761/761), 623.70 KiB | 7.25 MiB/s, done.
Resolving deltas: 100% (179/179), done.


## 2. PUMATAC dependencies
Necessary files such as annotated regions are downloaded from https://resources.aertslab.org/papers/PUMATAC_dependencies/, the rest of this notebook only documents how they were generated.

In [None]:
wget -r -np -nH --cut-dirs 2 -R index.html https://resources.aertslab.org/papers/PUMATAC/PUMATAC_dependencies/

--2023-04-25 16:53:44--  https://resources.aertslab.org/papers/PUMATAC/PUMATAC_dependencies/
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.50.9
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.50.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2021 (2.0K) [text/html]
Saving to: ‘PUMATAC_dependencies/index.html.tmp’


2023-04-25 16:53:45 (159 MB/s) - ‘PUMATAC_dependencies/index.html.tmp’ saved [2021/2021]

Loading robots.txt; please ignore errors.
--2023-04-25 16:53:45--  https://resources.aertslab.org/robots.txt
Reusing existing connection to resources.aertslab.org:443.
HTTP request sent, awaiting response... 404 Not Found
2023-04-25 16:53:45 ERROR 404: Not Found.

--2023-04-25 16:53:45--  https://resources.aertslab.org/papers/PUMATAC/PUMATAC_dependencies/?C=N;O=D
Reusing existing connection to resources.aertslab.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2021 (2.0K) [text/html]
Saving to: ‘PUMAT

In [2]:
tree -L 2 PUMATAC_dependencies/

PUMATAC_dependencies/
├── cache
│   ├── broadinstitute-gatk-4.2.0.0.img
│   ├── cache
│   ├── cflerin-seurat-4.0.3-plus.copy.sif
│   ├── cflerin-seurat-4.0.3-plus.sif
│   ├── seurat.sh
│   ├── vibsinglecellnf-bap-2021-04-27-3b48f4b.img
│   ├── vibsinglecellnf-bwamaptools-bwa-mem2-2.2.1-zlibng-2.0.6.img
│   ├── vibsinglecellnf-popscle-2021-05-05-da70fc7.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-04-15-16314db.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-07-07-0638c1d.img
│   ├── vibsinglecellnf-sinto-0.7.3.1.img
│   └── vibsinglecellnf-trimgalore-0.6.6.img
├── calculate_saturation_from_fragments.py
├── data
│   ├── fixedcells_cto_merged.tsv
│   └── fixedcells_general_statistics.tsv
├── genomes
│   ├── hg38_bwamem2
│   └── mm10_bwamem2
├── jupyter_kernels
│   └── 20230419_pycistopic.sif
├── nextflow
│   ├── nextflow-21.04.3-all
│   ├── nextflow-21.10.4-all
│   └── nextflow-22.10.7-all
├── regions
│   ├── 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet-scATAC.vcf.gz
│   ├─

## 3. Jupyter kernels
For the jupyter notebooks, we need specific versions for some python packages. I have included a singularity image that contains the right versions:

In [2]:
KERNEL_PATH=("/data2/florian/x000_PUMATAC_dependencies/jupyter_kernels/20230504_pycistopic.sif")
ls $KERNEL_PATH

/data2/florian/x000_PUMATAC_dependencies/jupyter_kernels/20230504_pycistopic.sif


We now need to redirect Jupyter to this singularity image. Kernels are usually located in `~/.local/share/jupyter/kernels/`:

In [3]:
KERNELS_DIR=/home/florian/miniforge3/share/jupyter/kernels
ls -1 $KERNELS_DIR

20231114_scenicplus
bash
cistopic_20230504
python3


In [4]:
echo $KERNELS_DIR

/home/florian/miniforge3/share/jupyter/kernels


Create a new directory (`cistopic_20230504`) and add the `kernel.json` file. Make sure that the path to the singularity iamge is updated.:

In [5]:
if [ ! -d $KERNELS_DIR/cistopic_20230504/ ]
then
    echo "making new kernels dir"
    mkdir $KERNELS_DIR/cistopic_20230504/
fi

In [6]:
NEWKERNELNAME=cistopic_20230504

Then, we will write a new `kernel.json` file there that redirects jupyter to our `KERNEL_PATH`. The `kernel.json` file also requires you to define which mounts it should access (since it executes a `singularity exec` command).

In [7]:
MOUNTS=/data,/data2,/data2/florian/tmp/:/tmp

In [8]:
cp kernel_template.json kernel.json

In [12]:
sed -i "s+/lustre1/project/stg_00090/scatac_benchmark/0_resources/cistopic_image/20230419_pycistopic.sif+$KERNEL_PATH+g" "kernel.json"
sed -i "s+/staging,/lustre1,/data,/vsc-hard-mounts,/scratch,/scratch/leuven/330/vsc33042/tmp/:/tmp,/local_scratch+$MOUNTS+g" "kernel.json"
sed -i "s+20230419_pycistopic.sif+$NEWKERNELNAME+g" "kernel.json"

In [13]:
cat kernel.json

{
    "argv": [
     "singularity",
     "exec",
     "-B",
     "/data,/data2,/data2/florian/tmp/:/tmp",
     "/data2/florian/x000_PUMATAC_dependencies/jupyter_kernels/20230504_pycistopic.sif",
     "ipython",
     "kernel",
     "-f",
     "{connection_file}"
    ],
    "display_name": "cistopic_20230504",
    "language": "python"
}


If all is well, the following should return our jupyter `kernel.json`:

In [14]:
cat kernel.json > $KERNELS_DIR/cistopic_20230504/kernel.json
cat $KERNELS_DIR/cistopic_20230504/kernel.json

{
    "argv": [
     "singularity",
     "exec",
     "-B",
     "/data,/data2,/data2/florian/tmp/:/tmp",
     "/data2/florian/x000_PUMATAC_dependencies/jupyter_kernels/20230504_pycistopic.sif",
     "ipython",
     "kernel",
     "-f",
     "{connection_file}"
    ],
    "display_name": "cistopic_20230504",
    "language": "python"
}


## 4. Example FASTQ (optional)
Example FASTQ used to test this pipeline can be found at https://resources.aertslab.org/papers/

In [3]:
wget -r -np -nH --cut-dirs 2 -R index.html https://resources.aertslab.org/papers/PUMATAC/PUMATAC_example_fastq/

--2023-04-25 16:46:00--  https://resources.aertslab.org/papers/PUMATAC/PUMATAC_example_fastq/
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.50.9
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.50.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3523 (3.4K) [text/html]
Saving to: ‘PUMATAC_example_fastq/index.html.tmp’


2023-04-25 16:46:00 (213 MB/s) - ‘PUMATAC_example_fastq/index.html.tmp’ saved [3523/3523]

Loading robots.txt; please ignore errors.
--2023-04-25 16:46:00--  https://resources.aertslab.org/robots.txt
Reusing existing connection to resources.aertslab.org:443.
HTTP request sent, awaiting response... 404 Not Found
2023-04-25 16:46:00 ERROR 404: Not Found.

--2023-04-25 16:46:00--  https://resources.aertslab.org/papers/PUMATAC/PUMATAC_example_fastq/?C=N;O=D
Reusing existing connection to resources.aertslab.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 3523 (3.4K) [text/html]
Saving to: ‘P

In [4]:
md5sum PUMATAC_example_fastq/*

0c8b78a5687d117f71d04a090eeffa80  PUMATAC_example_fastq/BIO_ddseq_4__R1.LIBDS.fastq.gz
6a5319b566331e1c2cb92e0950863075  PUMATAC_example_fastq/BIO_ddseq_4__R2.LIBDS.fastq.gz
0a9f8d48581694a85dc5b12a54fe7e59  PUMATAC_example_fastq/EPF_hydrop_1__R1.LIBDS.fastq.gz
9001b4deff9eef9d6c605eadae7dff92  PUMATAC_example_fastq/EPF_hydrop_1__R2.LIBDS.fastq.gz
b3300d28ed425417dc21d796d7a8e623  PUMATAC_example_fastq/EPF_hydrop_1__R3.LIBDS.fastq.gz
9fa3339b8aea6b7b2f9ef4d4bf73a086  PUMATAC_example_fastq/OHS_s3atac_1__R1.LIBDS.fastq.gz
f08b6c814abb4d3c25329dbc2f6a67ab  PUMATAC_example_fastq/OHS_s3atac_1__R2.LIBDS.fastq.gz
448652e81c27d2000fca275bc7f94990  PUMATAC_example_fastq/OHS_s3atac_1__R3.LIBDS.fastq.gz
dbc5f0fd828f9f81a9f2b77642395c0b  PUMATAC_example_fastq/VIB_10xv2_1__R1.LIBDS.fastq.gz
2e00f84113992335632517f67703adf9  PUMATAC_example_fastq/VIB_10xv2_1__R2.LIBDS.fastq.gz
9d39e83afc86000c8fda42da4e69e4db  PUMATAC_example_fastq/VIB_10xv2_1__R3.LIBDS.fastq.gz
eb2c490a8c21c1ba9a602d0300fad076  PUM

# 5. Gert Hulselman's single_cell_toolkit (Optional)

In [13]:
if [ ! -d PUMATAC_dependencies/single_cell_toolkit ]; then
    echo "cloning single_cell_toolkit"
    cd PUMATAC_dependencies
    git clone https://github.com/aertslab/single_cell_toolkit
    cd ..
else
    cd PUMATAC_dependencies
    echo "pulling single_cell_toolkit"
    cd single_cell_toolkit
    git pull
    cd ..
fi

pulling single_cell_toolkit
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 4), reused 5 (delta 3), pack-reused 0[K
Unpacking objects: 100% (6/6), 707 bytes | 21.00 KiB/s, done.
From https://github.com/aertslab/single_cell_toolkit
   df625f0..ef1962c  master     -> origin/master
Updating df625f0..ef1962c
error: Your local changes to the following files would be overwritten by merge:
	calculate_saturation_from_fragments.py
Please commit your changes or stash them before you merge.
Aborting


# Documentation
How the dependencies were generated or downloaded
### a. NextFlow
NextFlow 21.04.03 can be downloaded here: https://github.com/nextflow-io/nextflow/releases/tag/v21.04.3

In [33]:
wget -P resources/nextflow/ https://github.com/nextflow-io/nextflow/releases/download/v21.04.3/nextflow-21.04.3-all

### b. A bwa-mem2 genome index for your species
This genome index can be generated from a `.fasta` file using `bwa-mem2`. First, follow the github tutorial to install `bwa-mem2`: https://github.com/bwa-mem2/bwa-mem2  
Then, generate a genome index as follows:

### c. Barcode whitelists for 10x Genomics and HyDrop
For 10x Genomics, you can get the whitelist for multiome and ATAC from here:  
https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-  
https://kb.10xgenomics.com/hc/en-us/articles/4412343032205-Where-can-I-find-the-barcode-whitelist-s-for-Single-Cell-Multiome-ATAC-GEX-product-  
  
For HyDrop, the whitelist can be sourced from https://github.com/aertslab/hydrop_data_analysis.
For ddSEQ, a whitelist-agnostic strategy is employed.  
For s3-ATAC, a custom whitelist should be compiled based on the indices you used.  
If you wish to implement your own technique, a tutorial will be added.

### d. Regions for fragment counting
If you also want to employ our QC diagnostics tool, you will require regions to count `.fragments.tsv.gz` files in. There are three possible strategies:
1. Align your data in a bulk ATAC-seq fashion, perform de-novo peak calling on the aggregate alignment, and use these peak regions to generate your count matrix. This strategy can miss regions that are only accessible in rare cell types.
2. Use a region set from a previous experiment on your sample.
3. Use the ENCODE SCREEN regions: https://screen.encodeproject.org/. We subtract the ENCODE blacklist () from the ENCODE SCREEN regions.

# Compiling a hybrid mouse/human genome

In [None]:
#################### SETUP ####################


human_genome="GRCh38"
mouse_genome="GRCm39"
version="2024-A"


build="GRCh38_and_GRCm39_GENCODEv44-GENCODEvm33"
mkdir -p "$build"


# Download source files if they do not exist in reference_sources/ folder
source="reference_sources"
mkdir -p "$source"


# Using release 109 for GRCh38 instead of release 110 -- release 110 moved from GRCh38.p13 to GRCh38.p14,
# which unmasked the pseudo-autosomal region. This causes ambiguous mappings to PAR locus genes.
human_fasta_url="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
human_fasta_in="${source}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
human_gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
human_gtf_in="${source}/gencode.v44.primary_assembly.annotation.gtf"
mouse_fasta_url="http://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz"
mouse_fasta_in="${source}/Mus_musculus.GRCm39.dna.primary_assembly.fa"
mouse_gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M33/gencode.vM33.primary_assembly.annotation.gtf.gz"
mouse_gtf_in="${source}/gencode.vM33.primary_assembly.annotation.gtf"


if [ ! -f "$human_fasta_in" ]; then
    curl -sS "$human_fasta_url" | zcat > "$human_fasta_in"
fi
if [ ! -f "$human_gtf_in" ]; then
    curl -sS "$human_gtf_url" | zcat > "$human_gtf_in"
fi
if [ ! -f "$mouse_fasta_in" ]; then
    curl -sS "$mouse_fasta_url" | zcat > "$mouse_fasta_in"
fi
if [ ! -f "$mouse_gtf_in" ]; then
    curl -sS "$mouse_gtf_url" | zcat > "$mouse_gtf_in"
fi


# String patterns used for both genomes
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"


BIOTYPE_PATTERN=\
"(protein_coding|protein_coding_LoF|lncRNA|\
IG_C_gene|IG_D_gene|IG_J_gene|IG_LV_gene|IG_V_gene|\
IG_V_pseudogene|IG_J_pseudogene|IG_C_pseudogene|\
TR_C_gene|TR_D_gene|TR_J_gene|TR_V_gene|\
TR_V_pseudogene|TR_J_pseudogene)"
GENE_PATTERN="gene_type \"${BIOTYPE_PATTERN}\""
TX_PATTERN="transcript_type \"${BIOTYPE_PATTERN}\""
READTHROUGH_PATTERN="tag \"readthrough_transcript\""

# Process FASTA -- translate chromosome names
human_fasta_modified="$build/$(basename "$human_fasta_in").modified"
cat "$human_fasta_in" \
    | sed -E 's/^>(\S+).*/>\1 \1/' \
    | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
    | sed -E 's/^>MT />chrM /' \
    > "$human_fasta_modified"


# Process GTF -- split Ensembl IDs from version suffixes
human_gtf_modified="$build/$(basename "$human_gtf_in").modified"
cat "$human_gtf_in" \
    | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
    | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
    | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
    > "$human_gtf_modified"


# Process GTF -- filter based on gene/transcript tags
cat "$human_gtf_modified" \
    | awk '$3 == "transcript"' \
    | grep -E "$GENE_PATTERN" \
    | grep -E "$TX_PATTERN" \
    | grep -Ev "$READTHROUGH_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "${build}/gene_allowlist"


human_gtf_filtered="${build}/$(basename "$human_gtf_in").filtered"
grep -E "^#" "$human_gtf_modified" > "$human_gtf_filtered"
grep -Ff "${build}/gene_allowlist" "$human_gtf_modified" \
    | awk -F "\t" '$1 != "chrY" || $1 == "chrY" && $4 >= 2752083 && $4 < 56887903 && !/ENSG00000290840/' \
    >> "$human_gtf_filtered"


#################### MOUSE ####################
# Please see the GRCm39-2024-A build documentation for details on these steps.


# Process FASTA -- translate chromosome names
mouse_fasta_modified="$build/$(basename "$mouse_fasta_in").modified"
cat "$mouse_fasta_in" \
    | sed -E 's/^>(\S+).*/>\1 \1/' \
    | sed -E 's/^>([0-9]+|[XY]) />chr\1 /' \
    | sed -E 's/^>MT />chrM /' \
    > "$mouse_fasta_modified"


# Process GTF -- split Ensembl IDs from version suffixes
mouse_gtf_modified="$build/$(basename "$mouse_gtf_in").modified"
cat "$mouse_gtf_in" \
    | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
    | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
    | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
    > "$mouse_gtf_modified"


# Process GTF -- filter based on gene/transcript tags
cat "$mouse_gtf_modified" \
    | awk '$3 == "transcript"' \
    | grep -E "$GENE_PATTERN" \
    | grep -E "$TX_PATTERN" \
    | grep -Ev "$READTHROUGH_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "${build}/gene_allowlist"


mouse_gtf_filtered="${build}/$(basename "$mouse_gtf_in").filtered"
grep -E "^#" "$mouse_gtf_modified" > "$mouse_gtf_filtered"
grep -Ff "${build}/gene_allowlist" "$mouse_gtf_modified" \
    >> "$mouse_gtf_filtered"


#################### MKREF ####################


cellranger mkref --ref-version="$version" \
    --genome="$human_genome" --fasta="$human_fasta_modified" --genes="$human_gtf_filtered" \
    --genome="$mouse_genome" --fasta="$mouse_fasta_modified" --genes="$mouse_gtf_filtered" \
    --nthreads=16


In [None]:

cellranger mkref --ref-version="$version" \
    --genome="$human_genome" --fasta="$human_fasta_modified" --genes="$human_gtf_filtered" \
    --genome="$mouse_genome" --fasta="$mouse_fasta_modified" --genes="$mouse_gtf_filtered" \
    --nthreads=16


### The `PUMATAC_dependencies` directory should look like so:

In [48]:
tree -L 2 PUMATAC_dependencies

PUMATAC_dependencies
├── cache
│   ├── broadinstitute-gatk-4.2.0.0.img
│   ├── cache
│   ├── cflerin-seurat-4.0.3-plus.copy.sif
│   ├── cflerin-seurat-4.0.3-plus.sif
│   ├── seurat.sh
│   ├── vibsinglecellnf-bap-2021-04-27-3b48f4b.img
│   ├── vibsinglecellnf-bwamaptools-bwa-mem2-2.2.1-zlibng-2.0.6.img
│   ├── vibsinglecellnf-popscle-2021-05-05-da70fc7.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-04-15-16314db.img
│   ├── vibsinglecellnf-singlecelltoolkit-2022-07-07-0638c1d.img
│   ├── vibsinglecellnf-sinto-0.7.3.1.img
│   └── vibsinglecellnf-trimgalore-0.6.6.img
├── genomes
│   ├── hg38_bwamem2
│   └── mm10_bwamem2
├── nextflow
│   ├── nextflow-21.04.3-all
│   ├── nextflow-21.10.4-all
│   └── nextflow-22.10.7-all
├── regions
│   ├── 1000Genomes.wgs.GRCH38.sites.minAF-0.1.freemuxlet-scATAC.vcf.gz
│   ├── V2.hg38-rDHS-Unfiltered.blacklisted.bed
│   ├── V2.mm10-rDHS-Unfiltered.blacklisted.bed
│   ├── hg38-blacklist.v2.bed
│   └── mm10-blacklist.v2.bed
└── whitelists
    ├── 20200227