# Chicken Eye: Clustering
Date: July 7 2025

Author: Ben Zazycki

Adapted from: Jared Tangeman

Professor: Dr. Chun Liang


## Workspace Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!rm -rf /content/sample_data
!sudo apt-get install -y libgsl-dev
!sudo apt-get install -y libhdf5-dev
%load_ext rpy2.ipython
%R .libPaths(c('/content/drive/MyDrive/Bioinformatics/Colab_Lib/R', .libPaths()))
# ^ NOTE: change this based on your individual drive setup

Load relevant packages from library:

In [None]:
%%R
library(Seurat)
library(Signac)
library(ggpubr)
library(ggplot2)
library(future)
library(DT)
library(gprofiler2)
library(scCustomize)
library(Matrix)
library(plotly)
library(ensembldb)
library(JASPAR2024)
library(DirichletMultinomial)
library(TFBSTools)
library(motifmatchr)
library(chromVAR)
library(ggforce)
library(GenomicRanges)
library(BSgenomeForge)
library(BSgenome)
library(biovizBase)
library(patchwork)
library(glmGamPoi)
library(presto)
library(GenomeInfoDb)
library(Biostrings)
library(rtracklayer)
library(BSgenome.Ggallus.ensembl.GRCg7b)

Load in Seurat object (saved as .RDS file from previous notebook)

In [None]:
%%R
rds_path <- '/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Data_Outputs/seu_merged_processed.rds'
seu_merged_processed <- readRDS(rds_path)

Load in annotation, saved in the same way:

In [None]:
%%R
ann_path <- '/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Data_Outputs/annotation.rds'
annotation <- readRDS(ann_path)

## Initial Clustering/UMAP Reduction

Run PCA for dimensionality reduction:

In [None]:
%%R
seu_merged_processed <- RunPCA(seu_merged_processed,
                          features = VariableFeatures(object = seu_merged_processed))

Create elbow plot to judge effectiveness of PCs:

In [None]:
%%R
ElbowPlot(seu_merged_processed, ndims = 50)

Run KNN:

In [None]:
%R seu_merged_processed <- FindNeighbors(seu_merged_processed, reduction = "pca", dims = 1:50)

Run default clustering algorithm. This will use Louvain.

In [None]:
%R seu_merged_processed <- FindClusters(seu_merged_processed, cluster.name = "RNA_clusters")

Run UMAP reduction:

In [None]:
%%R
seu_merged_processed <- RunUMAP(seu_merged_processed, dims = 1:50,
                        reduction.name = "umap.rna",
                        reduction.key = "rnaUMAP_")

View simple UMAP visualization:

In [None]:
%%R
DimPlot(seu_merged_processed, reduction = "umap.rna", group.by = "RNA_clusters")

## ATAC Analysis

Run Term Frequency-Inverse Document Frequency Normalization to increase contrast:

In [None]:
%%R
DefaultAssay(seu_merged_processed) <- "ATAC"
seu_merged <- RunTFIDF(seu_merged_processed, assay = "ATAC")

Filter to peaks that are in at least 5 cells:

In [None]:
%R seu_merged_processed <- FindTopFeatures(seu_merged_processed, min.cutoff = 5)

Run Singular Value Decomposition to project onto fewer dimensions:

In [None]:
%R seu_merged_processed <- RunSVD(seu_merged_processed)

NOTE: I will use the terms SVD (Singular Value Decomposion) and LSI (Latent Semantic Indexint) interchangably in this notebook; they mean the same thing.

Check correlation of LSI components with sequencing depth:

In [None]:
%%R
DepthCor(seu_merged_processed)

Visualize the variance explained by each LSI dimension:

In [None]:
%%R
ElbowPlot(seu_merged_processed, reduction = "lsi", ndims = 50)

In ATAC data, LSI dimension 1 is often captures technical variation (in this case, sequencing depth). It is not a biologically meaningful dimension, and it will be excluded from future analysis.

Construct KNN:

In [None]:
%R seu_merged_processed <- FindNeighbors(object = seu_merged_processed, reduction = 'lsi', dims = 2:30)

Run clustering algorithm. 'algorithm = 3' specifies using SLM instead of Louvain here.

In [None]:
%%R
seu_merged_processed <- FindClusters(object = seu_merged_processed,
                    algorithm = 3, cluster.name = "ATAC_clusters")

Run UMAP:

In [None]:
%%R
seu_merged_processed <- RunUMAP(seu_merged_processed, reduction = 'lsi', dims = 2:30,
                        assay = "ATAC", slot = "data",
                        reduction.name = "umap.atac",
                        reduction.key = "atacUMAP_",
                        group.by = "ATAC_clusters")

Simple UMAP visualization:

In [None]:
%%R
DimPlot(seu_merged_processed, reduction = "umap.atac", group.by = "ATAC_clusters")

## Calling Peaks

We are using a tool called MACS2 to detect regions of open chromatin (called "peaks") in single-cell ATAC-seq data. Instead of calling peaks on all cells together, we are doing it separately for each group of cells, where groups are defined by their RNA expression patterns — this can reveal more specific, meaningful regulatory features.

Once those peaks are identified, we are cleaning and formatting the data so it lines up properly with the chicken genome reference, making sure all coordinates are valid and interpretable. Then, we are building a new dataset that counts how many ATAC-seq reads fall into each peak for each cell — like a big table showing which regions of the genome are open in which cells.

Finally, we create a new "assay" within your data object that stores this information, so we can use it later to study differences in chromatin accessibility between cell types.

First, install MACS2 and find its location within the colab environment:

In [None]:
!pip install MACS2

In [None]:
!which macs2

Then, using that path from above, run the Signac CallPeaks method. Note: effective genome size is specific to each species and is hardcoded here for chicken data.

In [None]:
%%R
peaks <- CallPeaks(object = seu_merged_processed,
                   group.by = "RNA_clusters",
                   effective.genome.size = 1049948333,
                   macs2.path = "/usr/local/bin/macs2",
                   assay = "ATAC")

Extract the 3 important columns from 'peaks' and rename them to be more descriptive:

In [None]:
%%R
peaksFormatted <- data.frame(peaks)[,1:3]
colnames(peaksFormatted) <- c("chr", "start", "end")

Next, I will construct a GRanges object from this new 'peaksFormatted' object. GRanges objects work with Signac and other tools, storing genomic intervals (our ATAC peak regions).

In [None]:
%R peaksGR <- makeGRangesFromDataFrame(peaksFormatted)

Now, we will assign chromosome lengths using our custom-forged chicken genome package.

Filter to chromsomes that are shared between the package and our GRanges object:

In [None]:
%%R
common_chroms <- intersect(seqlevels(peaksGR),
  seqlevels(BSgenome.Ggallus.ensembl.GRCg7b))

Drop all other (non-shared) chromosomes from GRanges object:

In [None]:
%%R
peaksFormatted <- keepSeqlevels(peaksGR,
  common_chroms, pruning.mode = "coarse")

Assign correct sequence lengths for shared chromosomes and call trim method to ensure ranges stay within boundaries:

In [None]:
%%R
seqlengths(peaksGR) <- seqlengths(BSgenome.Ggallus.ensembl.GRCg7b)[common_chroms]
peaksGR <- trim(peaksGR)

## Individual Cluster ATAC Processing

The steps in the previous section have enabled us to create (and analyze) an "ATAC_IC" data assay. Peaks have been called separately for each cluster, rather than globally for all cells.

First, create an insertion count matrix (a sparse peak-by-cell matrix, storing how many fragments from each cell mapped to each peak).

In [None]:
%%R
seu_merged_processed.counts <- FeatureMatrix(fragments = Fragments(seu_merged_processed),
                                   features = peaksGR,
                                   cells = colnames(seu_merged_processed))

Next, I will use that matrix to construct a new assay for Individual Cluster ATAC analysis.

In [None]:
%%R
seu_merged_processed[["ATAC_IC"]] <- CreateChromatinAssay(seu_merged_processed.counts,
                                      sep = c(":", "-"),
                                      fragments = Fragments(seu_merged_processed),
                                      annotation = annotation)

Next, I will call Signac's Term Frequency–Inverse Document Frequency (TF-IDF) transformation method. This transformation helps adjust for differences in sequencing depth and highlights biologically meaningful variation in chromatin accessibility.

In [None]:
%%R
DefaultAssay(seu_merged_processed) <- "ATAC_IC"
seu_merged <- RunTFIDF(seu_merged_processed, assay = "ATAC_IC")

I will select peaks with a score of at least 5:

In [None]:
%R seu_merged_processed <- FindTopFeatures(seu_merged_processed, min.cutoff = 5)

Similarly to the previous ATAC analysis section, I will apply SVD to reduce dimensionality:

In [None]:
%R seu_merged <- RunSVD(seu_merged_processed)

I'll again visualize the correlation between LSI/SVD component and sequencing depth:

In [None]:
%%R
DepthCor(seu_merged_processed)

Visualize elbow plot for first 50 components:

In [None]:
%%R
ElbowPlot(seu_merged_processed, reduction = "lsi", ndims = 50)

Construct KNN with features 2-30:

In [None]:
%R seu_merged_processed <- FindNeighbors(object = seu_merged_processed, reduction = 'lsi', dims = 2:30)

In [None]:
%%R
seu_merged_processed <- FindNeighbors(seu_merged_processed,
  reduction = "lsi", dims = 2:30, graph.name = c("ATAC_IC_nn", "ATAC_IC_snn"))

Again, run SLM clustering algorithm:

In [None]:
%%R
seu_merged_processed <- FindClusters(object = seu_merged_processed,
                          algorithm = 3, cluster.name = "ATAC_IC_clusters",
                          graph.name = "ATAC_IC_snn")

Run UMAP:

In [None]:
%%R
seu_merged_processed <- RunUMAP(seu_merged_processed, reduction = 'lsi', dims = 2:30,
                          assay = "ATAC_IC", slot = "data",
                          reduction.name = "umap.atacIC",
                          reduction.key = "atacicUMAP_")

Visualize UMAP with clusters:

In [None]:
%%R
DimPlot(seu_merged_processed, reduction = "umap.atacIC", group.by = "ATAC_IC_clusters")

## Constructing Gene Activity Assay

Next, I am going to create an assay for "gene activity". This will combine information from the ATAC_IC and RNA assays.

First, construct the Gene Activity assay using the Gene Activity Counts from the ATAC_IC assay:

In [None]:
%%R
seu_merged_processed[['Gene_Activity']] <- CreateAssayObject(
  counts = GeneActivity(seu_merged_processed, assay = "ATAC_IC"))

Next, ensure the ATAC_IC assay is being used and perform log-normalization.

In [None]:
%%R
seu_merged <- NormalizeData(object = seu_merged_processed, assay = 'Gene_Activity',
  normalization.method = 'LogNormalize', scale.factor = median(seu_merged_processed$nCount_ATAC_IC))

Next, I will construct a Weighted Shared Nearest Neighbor (WSNN) graph with the FindMultiModalNeighbors method. This is an integrated graph that combines multiple modalities (in this case, assays). Thus, I need to make sure I set the RNA assay as default first. Note that I list two types of reduction and both of their relevant dimensions.

In [None]:
%%R
DefaultAssay(seu_merged_processed) <- "RNA"
seu_merged_processed <- FindMultiModalNeighbors(object = seu_merged_processed,
  reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:30))

Run SLM clustering on the WSNN graph:

In [None]:
%R seu_merged_processed <- FindClusters(seu_merged_processed, graph.name = "wsnn", algorithm = 3)

Save WSNN clusters to a separate metadata field:

In [None]:
%R seu_merged_processed$WSNN_clusters <- seu_merged_processed$seurat_clusters

Run UMAP:

In [None]:
%%R
seu_merged_processed <- RunUMAP(object = seu_merged_processed,
                          reduction.name = "umap.wnn",
                          nn.name = "weighted.nn")

Visualize UMAP:

In [None]:
%%R
DimPlot(seu_merged_processed, reduction = "umap.wnn", group.by = "WSNN_clusters")

## Notebook Conclusions

In this notebook, we have done clustering and UMAP reduction for 4 different assays. Here is an explanation of each:

**scRNA-seq:** Using PCA to reduce dimensionality of gene expression data and cluster based on transcriptional similarity. This method groups cells by mRNA expression levels, identifying transcriptionally distinct cell states.

**ATAC-seq:** Performed dimensionality reduction using LSI on the raw ATAC peak matrix, followed by clustering and UMAP. This reflects differences in genome accessibility patterns, highlighting regulatory landscape variation between cells.

**Individual Cluster ATAC-seq:** Redefine ATAC peak features by calling peaks per RNA cluster and used this refined set for LSI-based dimensionality reduction and clustering. This approach provides more targeted chromatin accessibility profiles, enhancing sensitivity to biologically meaningful variation.

**Gene Activity Clustering:** Created a GeneActivity assay from ATAC_IC, normalized it, and combined PCA (RNA) and LSI (ATAC_IC) via FindMultiModalNeighbors. This integrative method jointly considers gene expression and regulatory accessibility to define cell identity with higher fidelity.

I will save the current state of the Seurat object for the next and final notebook:

In [None]:
%%R
output_path <- '/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Data_Outputs/seu_clustered.rds'
saveRDS(seu_merged_processed, file = output_path)