# Notebook - ATAC-seq and scATAC-seq (in-class)

We thank the ENCODE consortium as well as Satija and Stuart team for their software and package tutorials, from which much of the material below is adapted. We thank the Epigenomics Workshop 2024 for posting educational materials online, which were also adapted in creating this notebook. 

### [[Important!!]] Instruction for running this notebook on Cocalc:
Please follow the instructions below to configure your own environment to run this notebook. 

1. In the side bar, Find the "Settings" button, and click the triangle button next to it.
2. Under the "Control" drop-down manu, change the Software environment to "2024-02-07", then save changes. 
3. Open a Linux Terminal.
4. Run the following command in UNIX (at the prompt `$>`):

    `$> mkdir ~/Rlibs`


5. Start an R session from the UNIX command line (at the prompt `$>`):

    `$> R`
    
    
6. Type the following commands sequentially (each line one at a time, **not** copying and running the entire block of commands all at once!). Always **skip** any software updates by hitting "enter" if/when prompted.
 
    `.libPaths("~/Rlibs")`
    
    `require(devtools)`
    
    `install_version("SeuratObject", version="5.0.1")`
    
    `install_version("Signac", version="1.13.0")`
    
    `install.packages("irlba")`
    
    `devtools::install_github("immunogenomics/presto")`
    
    `BiocManager::install("EnsDb.Hsapiens.v86")`
    
 
7. You can quit the R session via the `q()` function. This will return you to the UNIX command line.

### Loading packages

Now, let's return to the notebook and load libraries that we have installed to check that the setup is working.

First, run the cell below to add the local directory where libraries were installed to this notebook.

In [None]:
.libPaths("~/Rlibs")
.libPaths()

Now, let's load our libraries in two groups:

In [None]:
# run this
library(Seurat)
library(Signac)
packageVersion("Seurat")
library(tidyverse)

In [None]:
# now run this
library(biovizBase)
library(EnsDb.Hsapiens.v86)
library(patchwork)
library("presto")

In [None]:
# Now run this to set up for annotations
library(AnnotationHub)

In [None]:
# Now run this library to set up enrichment analyses
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

## Objectives of this notebook

The objective of this notebook is to get students familiarize with data analysis on bulk and single-cell ATAC-seq data. This notebook will guide students through the steps of quality control, clustering, exploratory data analysis, and differential accessibility analysis. By the end of this process, we aim to uncover insights into the chromatin accessibility landscape across different conditions or cell types, and be able to process both public and primary datasets.

## Setup pipelines bulk ATAC-seq

The ENCODE consortium provides a very useful tool for analyzing the bulk ATAC-seq data. Please take a look at their workflow, and answer Q1. https://www.encodeproject.org/pipelines/ENCPL787FUN/ (we recommend opening the website using safari as you may encounter display issues with Chrome).

Q1. What is the input and output of the pipeline? 

To run the ENCODE pipeline, a `.json` input file is required. All the experimental summary information, `.fastq` file location, and R1 and R2 read file locations should be provided in this file in order for the pipeline to recognize. Here, we will not actually run the whole pipeline, but we will edit the following `.json` file to mimic what we will need to do in running the pipeline. 

**Q2.** Instruction for editing the input `.json` file can be found here: https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input_short.md  Please read the instruction and edit the `.json` file in the text block below, with the information provided here:

1. The data are collected from mouse liver
2. You don't know the adapter, so you want the algorithm to automatically detect the adapter

In your analysis, don't forget to double check the input `.fastq` files. 

In [None]:
{
    "atac.title" : "Example (paired end)",
    "atac.description" : "This is a template input JSON for paired ended sample.",

    "atac.pipeline_type" : "atac",
    "atac.align_only" : false,
    "atac.true_rep_only" : false,

    "atac.genome_tsv" : "/path_to_genome_data/hg38/hg38.tsv",

    "atac.paired_end" : true,

    "atac.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
    "atac.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
    "atac.fastqs_rep2_R1" : [ "rep2_R1_L1.fastq.gz", "rep2_R1_L2.fastq.gz" ],
    "atac.fastqs_rep2_R2" : [ "rep2_R2_L1.fastq.gz", "rep2_R2_L2.fastq.gz" ],

    "atac.auto_detect_adapter" : false,
    "atac.adapter" : "AATTCCGG",
    "atac.adapters_rep1_R1" : [ "AATTCCGG", "AATTCCGG", "AATTCCGG" ],
    "atac.adapters_rep1_R2" : [ "AATTCCGG", "AATTCCGG" ],
    "atac.adapters_rep2_R1" : [ "AATTCCGG", "AATTCCGG", "AATTCCGG" ],
    "atac.adapters_rep2_R2" : [ "AATTCCGG", "AATTCCGG" ],

    "atac.multimapping" : 4
}

Once you have this .json file, you can obtain a pipeline which runs end-to-end QC and processing of ATAC-Seq data, which you can run on a computer cluster:

`https://github.com/ENCODE-DCC/atac-seq-pipeline`

We provide this for you here as many of you may end up workign with bulk ATAC-Seq data in your careers. You can go to this github page, get this pipeline installed on your local cluster (perhaps with the help of systems admins!), and then use the .json template you created above to execute the pipeline.

For the purposes of more extensive in-class work, we'll turn now to setup for performing analysis of scATAC-Seq data.

## Setup -- scATAC-seq

Just like for scRNA-seq, many tools have been developed for the analysis of scATAC-seq data. These packages include `Signac`, `ArchR`, `snapATAC (v1)` in **R** and `snapATAC (v2)`, `EpiScanpy` in **Python**. Since we have used `Seurat` for analyzing scRNA-seq data, we will use `Signac` (developed by the same lab) for the scATAC-seq data analysis. 

Just like `Seurat`, `Signac` have many useful vigenettes as well: https://stuartlab.org/signac/ 

Here, we will be analyzing a single-cell ATAC-seq dataset collected from Human PBMC by 10x Genomics. 

## Loading scATAC-seq data
From the `CellRanger-ATAC` pipeline, four files will be used for constructing the `Signac` object, they are:

- A count file. The rows are regions (peaks) and the colums are cells. Each entry i,j is the number of reads mapping to region i in cell j. In this assignment, our file is 

`atac_pbmc_500_nextgem_filtered_peak_bc_matrix.h5`


- A meta data file, with some overall statistics for each cell. In this assignment, our file is 

`atac_pbmc_500_nextgem_singlecell.csv`


- A fragment file, with information on all sequenced fragments (where it maps to the genome, which cell barcode is associated and how many PCR duplicates were found). In this assignment, our file is 

`atac_pbmc_500_nextgem_fragments_sub.tsv.gz`


- An index file connected to the fragment file. This is like an index file for a bam file, to make it possible to quickly find fragments for a certain genomic region, without having to search the entire file. In this assignment, our file called 

`atac_pbmc_500_nextgem_fragments_sub.tsv.gz.tbi`


(Note that you won't specify this file, but is required. We pre-created this file for you using `tabix` in UNIX.)

### Create Seurat object

Using the above files, edit the code below to specify the file names in the places indicated

In [None]:
counts <- Read10X_h5(filename = "") ##Specify the count file here

metadata <- read.csv(
  file = "", ##Specify the meta data file here
  header = TRUE,
  row.names = 1
)

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = "", ##Specify the fragment file here
  min.cells = 10,
  min.features = 200
)

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)

Next, let's add gene annotations. 

For this, we will take advantage of precomputed annotations which you can search for at `AnnotationHub`.

This will allow downstream functions to pull the gene annotation information directly from the object.

In [None]:
hub <- AnnotationHub()
query(hub, c("ensdb", "homo sapiens"))

From the above, you can see a list of many annotations you could choose from, and from many species. For example, you *could* replace `homo sapiens` in the above with mouse (`mus musculus`) or zebrafish (`danio rerio`) and also obain annotation sets. 

But here, we are working with human data, so let's select a recent annotated genome data base from humans (version 111). As you can see, this recent database corresponds to the database ID `AH116291`. Let's store that annotation in an object called `ensdb`.

In [None]:
query(hub, c("ensdb","homo sapiens", "111"))
ensdb <- hub[["AH116291"]]

In [None]:
# # extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = ensdb)

# # change to UCSC style since the data was mapped to hg38
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg38"

# # add the gene information to the object
Annotation(pbmc) <- annotations
Annotation(pbmc)

### Computing QC metrics 

We have introduced these common QC metrics in our prelab notebook. Please refer back to that notebook or the Signac website (https://stuartlab.org/signac/articles/pbmc_vignette) for more information. 

The enrichment of Tn5 integration events at transcriptional start sites (TSSs) can also be an important quality control metric to assess the targeting of Tn5 in ATAC-seq experiments. The ENCODE consortium defined a TSS enrichment score as the number of Tn5 integration site around the TSS normalized to the number of Tn5 integration sites in flanking regions. See the ENCODE documentation for more information about the TSS enrichment score (https://www.encodeproject.org/data-standards/terms/). 

We can calculate the TSS enrichment score for each cell using the `TSSEnrichment()` function in Signac.

In [None]:
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)

# # compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc)

# add fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100

# # add blacklist ratio
pbmc$blacklist_ratio <- FractionCountsInRegion(
   object = pbmc, 
   assay = 'peaks',
   regions = blacklist_hg38_unified
)

We can look at the relationship between variables stored in our object (`pmbc`), for example, using `DensityScatter`. This can be helpful in deciding suitable cutoff values for different QC metrics that we have calculated. For example, let's look at the relationship between read count an TSS.enrichment:

In [None]:
DensityScatter(pbmc, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)

The data here are are more *sparse* that what you would see in a typical experiment, as we have created a subset of data that you can work with within the CoCalc environment. However, what you can see is that there is a central density of data, with some outliers (high and low peak counts, for example). 

Next, it is also help to look at the fragement length distribution, as we expect nucleosome positioning periodicity in the data. We can look at this using the `nucleosome signal`; let's look at characteristics for scores less than 4 and greater than 4:

In [None]:
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')

The plot on the left shows that for NS < 4, we can see the periodicity we expect from a successful ATAC-Seq experiment. While hard to see on the right, there is a slight excess of mononucleosomal / nucleosome-free ratio (this holds in the larger data sets). as such, we may want to remove these downstream.

Now, let's plot distribution of each QC metric separately via a violin plot:

In [None]:
VlnPlot(
  object = pbmc,
  features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
  pt.size = 0.1,
  ncol = 3
)

### Filtering based on QC metric

Here, we will set some initial QC filters, to focus on a subset of peaks with the following properties:

- read count in peaks greater than 9,000 (i.e., insist on a minimum read depth in a peak)
- read count in peak less than 100,000 (i.e., if a peak has too much depth, exclude)
- minimum fraction of read coverage in peaks (i.e., 40%)
- blacklist ratio less than 1% (i.e., 0.01)
- Nucleosome signal less than 4
- TSS enrichemnt greater than 4

In [None]:
pbmc <- subset(
  x = pbmc,
  subset = nCount_peaks > 9000 &
    nCount_peaks < 100000 &
    pct_reads_in_peaks > 40 &
    blacklist_ratio < 0.01 &
    nucleosome_signal < 4 &
    TSS.enrichment > 4
)
pbmc

### Normalization and linear dimensional reduction

Next, as described in the prelab, we will normalize the subset of peaks we selected above using Term Frequency-Inverse Document Frequency (TF-IDF), and then perform dimensionality reduction for interpretive purposes using singular value decomposition (SVD).

In [None]:
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)

The first LSI component often captures sequencing depth (technical variation) rather than biological variation. If this is the case, the component should be removed from downstream analysis. We can assess the correlation between each LSI component and sequencing depth using the DepthCor() function:

In [None]:
DepthCor(pbmc)

Here we see there is a very strong (negative) correlation between the first LSI component and the total number of counts for the cell, so we will perform downstream steps without this component.

### Non-linear dimension reduction and clustering
Now that the cells are embedded in a low-dimensional space, we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering, and non-linear dimension reduction for visualization. The functions `RunUMAP()`, `FindNeighbors()`, and `FindClusters()` all come from the Seurat package.

In [None]:
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()

**Note**: In this notebook, we will skip the integration with scRNA-seq part. Based on various benchmark efforts, the integration between unmatched scRNA-seq and scATAC-seq can be very challenging. We recommend running this type of analysis with caution. 

### Find differentially accessible peaks between clusters
To find differentially accessible regions between clusters of cells, we can perform a differential accessibility (DA) test. A simple approach is to perform a Wilcoxon rank sum test, and the presto package has implemented an extremely fast Wilcoxon test that can be run on a Seurat object.

In [None]:
DefaultAssay(pbmc) <- 'peaks'

da_peaks <- FindMarkers(
  object = pbmc,
  ident.1 = c("0"), 
  ident.2 = c("1"), 
  test.use = 'wilcox',
  min.pct = 0.1
)

head(da_peaks)

In [None]:
plot1 <- VlnPlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  pt.size = 0.1,
  idents = c("0","1")
)
plot2 <- FeaturePlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  pt.size = 0.1
)

plot1 | plot2

Finally, Let's look at the annotations for some of these significant peaks. let's look at those where cluster 0 is open, relative to cluster 1, and vice-versa where cluster 1 is open, realtive to cluster 0.

Let us also filter this by significance of association as well as the log fold change:

In [None]:
open_c0 <- rownames(da_peaks[da_peaks$avg_log2FC > 3 & da_peaks$p_val_adj < 1e-5, ])
open_c1 <- rownames(da_peaks[da_peaks$avg_log2FC < -3 & da_peaks$p_val_adj < 1e-5, ])

closest_genes_c0 <- ClosestFeature(pbmc, regions = open_c0)
closest_genes_c1 <- ClosestFeature(pbmc, regions = open_c1)

### Gene Ontology Enrichment Analysis
Just like in the proteomics module, we can also perform enrichment analyses for our peaks using the genes TSS that is the nearest to our peaks. We have done this below for c0 and c1 clusters -- but you could do comparisons with different clusters with edits to the above code.

In [None]:
cd0_ego <- enrichGO(gene = closest_genes_c0$gene_id,
                keyType = "ENSEMBL",
                OrgDb = org.Hs.eg.db,
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05,
                qvalueCutoff = 0.05,
                readable = TRUE)

barplot(cd0_ego,showCategory = 20)

In [None]:
cd1_ego <- enrichGO(gene = closest_genes_c1$gene_id,
                keyType = "ENSEMBL",
                OrgDb = org.Hs.eg.db,
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05,
                qvalueCutoff = 0.05,
                readable = TRUE)

barplot(cd1_ego,showCategory = 20)