# ArchR multi-sample recipe step 3 -- run normalization, dimensionality reduction, and clustering
**Author**: Adam Klie (last modified: 11/06/2023)<br>
***
**Description**: This script runs normalization, dimensionality reduction, and clustering on the ArchR project object. This script is intended to be run after the 2A_preprocess_archr_proj.ipynb script. This script is intended to be run on the cluster.

# Set-up

In [1]:
# Load libraries
suppressMessages(library(Seurat))
suppressMessages(library(ArchR))
suppressMessages(library(parallel))
suppressMessages(library(tidyverse))

In [2]:
# Params
archr_proj_path = "/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/2023_11_15/archr/H1_control"
clustering_resolution = 0.8
umap_neighbors = 30
umap_min_dist = 0.5
umap_metric = "cosine"
threads = 4
seed = 1234
run_harmony = FALSE

In [3]:
# Move the working directory 
set.seed(seed)
addArchRThreads(threads)
setwd(archr_proj_path)

Setting default number of Parallel threads to 4.



The precompiled version of the hg38 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg38, TxDb.Hsapiens.UCSC.hg38.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg38 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg38 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg38 genome:

In [4]:
# Add annotation
addArchRGenome("hg38")

Setting default genome to Hg38.



# Load the ArchR project

In [5]:
# Load the ArchR project
proj = loadArchRProject(path = "./")
proj

Successfully loaded ArchRProject!


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          /   \     |   _  \      /      ||  |  |  | |   _ 

class: ArchRProject 
outputDirectory: /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/2023_11_15/archr/H1_control 
samples(6): mo38 mo22 ... mo14 mo29
sampleColData names(1): ArrowFiles
cellColData names(20): Sample TSSEnrichment ... timecourse
  rna_annotation
numberOfCells(1): 14614
medianTSS(1): 17.039
medianFrags(1): 21553.5

# Add RNA metadata
And subset based on only cells that are in the RNA metadata

In [66]:
# Read in the csv, first column is index
tsv.path = "/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/2023_11_14/cellcommander/H1_control/integrated/rna/annotate_metadata.tsv"
rna_annotations = read.csv(tsv.path, row.names = 1, sep = "\t")
rna_sample = sapply(strsplit(rownames(rna_annotations), "#"), function(x) x[1])
rna_annotations$sample = rna_sample
rna_cellids = rownames(rna_annotations)
head(rna_annotations)

Unnamed: 0_level_0,total_counts,pct_counts_mt,sctransform_none_leiden_1,integrated_manual_cellid_annotation,sample
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<chr>,<chr>
mo1#AAACAGCCAAAGGTAC-1,8767,0.58172697,5,SC.beta,mo1
mo1#AAACAGCCACCTACTT-1,4636,0.06471096,1,SC.EC,mo1
mo1#AAACAGCCAGCAATAA-1,7781,0.02570363,10,SC.alpha,mo1
mo1#AAACCAACAACCGCCA-1,5564,0.0,15,SC.delta,mo1
mo1#AAACCGCGTATTGTGG-1,9338,0.02141786,10,SC.alpha,mo1
mo1#AAACGCGCAAGCCACT-1,6430,0.0466563,6,SC.EC,mo1


In [67]:
# Check if all the cells are in the ArchR project
matched_ids = intersect(rna_cellids, rownames(proj@cellColData))
ids_not_in_atac = setdiff(rna_cellids, rownames(proj@cellColData))
ids_not_in_rna = setdiff(rownames(proj@cellColData), rna_cellids)
print(paste0("In both RNA and ATAC: ", length(matched_ids)))
print(paste0("In RNA but not ATAC: ", length(ids_not_in_atac)))
print(paste0("In ATAC but not RNA: ", length(ids_not_in_rna)))

[1] "In both RNA and ATAC: 14682"
[1] "In RNA but not ATAC: 5856"
[1] "In ATAC but not RNA: 0"


In [68]:
# What cells are in both RNA and ATAC?
table(rna_annotations[matched_ids,]$sample)


 mo1 mo14 mo22 mo29  mo3 mo38 
2128 1804 3155 1136 1968 4491 

In [69]:
# What cells are in RNA but not ATAC?
table(rna_annotations[ids_not_in_atac,]$sample)


 mo1 mo14 mo22 mo29  mo3 mo38 
1044  644 1323  350  867 1628 

In [70]:
# What cells are in ATAC but not RNA?
table(proj@cellColData[ids_not_in_rna,]$Sample)

< table of extent 0 >

In [71]:
# Subset the proj to only cells that are in both RNA and ATAC
idxSample <- BiocGenerics::which(proj$cellNames %in% matched_ids)
length(idxSample)

In [72]:
# Actually subset the proj
proj <- proj[idxSample,]
proj


           ___      .______        ______  __    __  .______      
          /   \     |   _  \      /      ||  |  |  | |   _  \     
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
        /  /_\  \   |      /     |  |     |   __   | |      /     
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|
    



class: ArchRProject 
outputDirectory: /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/07Nov23/archr/MO_control 
samples(6): mo38 mo22 ... mo14 mo29
sampleColData names(1): ArrowFiles
cellColData names(21): Sample TSSEnrichment ... rna_annotation Clusters
numberOfCells(1): 14682
medianTSS(1): 17.044
medianFrags(1): 21581.5

In [76]:
rna_annotations = rna_annotations[proj$cellNames, ]

In [77]:
# Add the RNA annotations
proj$rna_annotation <- rna_annotations$integrated_manual_cellid_annotation

In [78]:
proj@cellColData$rna_annotation %>% table()

.
   other SC.alpha  SC.beta SC.delta    SC.EC 
     803     3727     5021      186     4945 

# Dimensionality reduction

In [33]:
# ArchR dim reduection
proj = addIterativeLSI(
    proj,  
    useMatrix = "TileMatrix",
    name = "IterativeLSI", 
    force = T
)

Checking Inputs...



ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-349345437865b5-Date-2023-11-14_Time-15-06-24.371375.log
If there is an issue, please report to github with logFile!

2023-11-14 15:07:09.353914 : Computing Total Across All Features, 0.27 mins elapsed.

2023-11-14 15:07:39.008897 : Computing Top Features, 0.764 mins elapsed.

###########
2023-11-14 15:07:39.911728 : Running LSI (1 of 2) on Top Features, 0.779 mins elapsed.
###########

2023-11-14 15:07:39.955793 : Sampling Cells (N = 10002) for Estimated LSI, 0.78 mins elapsed.

2023-11-14 15:07:39.960797 : Creating Sampled Partial Matrix, 0.78 mins elapsed.

2023-11-14 15:08:10.833876 : Computing Estimated LSI (projectAll = FALSE), 1.295 mins elapsed.

Filtering 1 dims correlated > 0.75 to log10(depth + 1)

2023-11-14 15:09:39.208501 : Identifying Clusters, 2.768 mins elapsed.

2023-11-14 15:09:49.572423 : Identified 6 Clusters, 2.94 mins elapsed.

2023-11-14 15:09:49.646146 : Saving LSI Iteration, 2.942 mins elapsed.

2023-11-14 15:1

# Clustering

In [34]:
# Run clustering
proj = addClusters(
    proj, 
    reducedDims = "IterativeLSI",
    method = "Seurat",
    name = "Clusters",
    resolution = clustering_resolution,
    force = TRUE
)

ArchR logging to : ArchRLogs/ArchR-addClusters-349345411e631-Date-2023-11-14_Time-15-13-05.798936.log
If there is an issue, please report to github with logFile!



Filtering 1 dims correlated > 0.75 to log10(depth + 1)

2023-11-14 15:13:09.444488 : Running Seurats FindClusters (Stuart et al. Cell 2019), 0.002 mins elapsed.

Computing nearest neighbor graph

Computing SNN



Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 14682
Number of edges: 416277

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.7677
Number of communities: 9
Elapsed time: 1 seconds


2023-11-14 15:13:21.224174 : Testing Biased Clusters, 0.198 mins elapsed.

2023-11-14 15:13:21.25685 : Testing Outlier Clusters, 0.198 mins elapsed.

2023-11-14 15:13:21.263397 : Assigning Cluster Names to 9 Clusters, 0.199 mins elapsed.

2023-11-14 15:13:21.32324 : Finished addClusters, 0.2 mins elapsed.



# Optionally run Harmony

In [35]:
if (run_harmony) {
    # Run Harmony
    print("Running Harmony")
    proj = addHarmony(
        ArchRProj = proj, 
        reducedDims = "IterativeLSI", 
        name = "Harmony", 
        groupBy = "SampleID"
    )
    reduction = "Harmony"
} else {
    print("Using IterativeLSI")
    reduction = "IterativeLSI"
}

[1] "Using IterativeLSI"


# UMAP

In [36]:
# Run UMAP
proj <- addUMAP(
    ArchRProj = proj, 
    reducedDims = reduction,
    name = "UMAP", 
    nNeighbors = 30, 
    minDist = 0.5, 
    metric = "cosine"
)

Filtering 1 dims correlated > 0.75 to log10(depth + 1)

15:13:21 UMAP embedding parameters a = 0.583 b = 1.334

15:13:21 Read 14682 rows and found 29 numeric columns

15:13:21 Using Annoy for neighbor search, n_neighbors = 30

15:13:21 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*


*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|

15:13:22 Writing NN index file to temp file /tmp/RtmpZaF9Ty/file3493451072c7c9

15:13:22 Searching Annoy index using 64 threads, search_k = 3000

15:13:23 Annoy recall = 100%

15:13:24 Commencing smooth kNN distance calibration using 64 threads
 with target n_neighbors = 30

15:13:25 Initializing from normalized Laplacian + noise (using irlba)

15:13:25 Commencing optimization for 200 epochs, with 708230 positive edges

15:13:32 Optimization finished

15:13:32 Creating temp model dir /tmp/RtmpZaF9Ty/dir3493455c59f2e0

15:13:32 Creating dir /tmp/RtmpZaF9Ty/dir3493455c59f2e0

15:13:33 Changing to /tmp/RtmpZaF9Ty/dir3493455c59f2e0

15:13:33 Creating /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/07Nov23/archr/MO_control/Embeddings/Save-Uwot-UMAP-Params-IterativeLSI-3493457aa7004b-Date-2023-11-14_Time-15-13-32.535832.tar



# Save

In [84]:
# Save object with new stuff added
saveArchRProject(
  ArchRProj = proj,
  outputDirectory = "./",
)

Saving ArchRProject...

Loading ArchRProject...

Successfully loaded ArchRProject!


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          

class: ArchRProject 
outputDirectory: /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/07Nov23/archr/MO_control 
samples(6): mo38 mo22 ... mo14 mo29
sampleColData names(1): ArrowFiles
cellColData names(21): Sample TSSEnrichment ... rna_annotation Clusters
numberOfCells(1): 14682
medianTSS(1): 17.044
medianFrags(1): 21581.5

# DONE

---