# ArchR multi-sample recipe step 1 -- initialize an ArchR project
**Author**: Adam Klie (last modified: 11/06/2023)<br>
***
**Description**: This script creates a ArchR project from a set of fragment or bam files. It performs QC and filtering upon creation of the separate arrow files for each input file, detects and filters doublets, and saves an ArchR project.

# Set-up

In [1]:
# Load libraries
suppressMessages(library(Seurat))
suppressMessages(library(ArchR))
suppressMessages(library(parallel))
suppressMessages(library(tidyverse))

“package ‘S4Vectors’ was built under R version 4.3.2”
“package ‘BiocGenerics’ was built under R version 4.3.2”
“package ‘GenomicRanges’ was built under R version 4.3.2”
“package ‘IRanges’ was built under R version 4.3.2”
“package ‘GenomeInfoDb’ was built under R version 4.3.2”
“package ‘SummarizedExperiment’ was built under R version 4.3.2”
“package ‘MatrixGenerics’ was built under R version 4.3.2”
“package ‘Biobase’ was built under R version 4.3.2”


In [2]:
# Move the working directory 
set.seed(1234)
addArchRThreads(4)
setwd("/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/all/archr")

Setting default number of Parallel threads to 4.



The precompiled version of the hg38 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg38, TxDb.Hsapiens.UCSC.hg38.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg38 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg38 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg38 genome:

In [3]:
# Add annotation
addArchRGenome("hg38")

Setting default genome to Hg38.



In [4]:
# List samples to work with
samples <- c("mo1", "mo3")
inputFiles = paste0(
    "/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/processed/21Jul23/igvf_",
    samples,
    "_deep/outs/atac_fragments.tsv.gz"
)
names(inputFiles) = samples
inputFiles

# Make the arrow files

In [None]:
# Make arrow files 
ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4,
  minFrags = 1000, 
  excludeChr = c("chrM"),
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

# Create project

In [None]:
# Make archr project
proj <- ArchRProject(
    ArrowFiles = ArrowFiles, 
    outputDirectory = "./"   
)

In [None]:
# Check memory size
paste0("Memory Size = ", round(object.size(proj) / 10^6, 3), " MB")

# Add doublet scores

In [None]:
# add doublet scores and filter
proj = addDoubletScores(proj, k = 10, knnMethod = "LSI")
proj = filterDoublets(proj)

# Save

In [None]:
# Save the post filtering metadata, these can be considered high quality cells
proj_meta = as.data.frame(proj@cellColData)
write.csv(
  as.data.frame(proj@cellColData),
  file=paste0("snatac_metadata.csv")
)

In [None]:
# Save the archr project
saveArchRProject(
  ArchRProj = proj,
  outputDirectory = "./"
)

# DONE!

---

# Scratch

## Optional additions to project

In [17]:
# Load the ArchR project
proj = loadArchRProject(path = "./")
proj

Successfully loaded ArchRProject!


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          /   \     |   _  \      /      ||  |  |  | |   _ 

class: ArchRProject 
outputDirectory: /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/all/archr 
samples(26): dm44a mo9 ... dm24a mo29
sampleColData names(1): ArrowFiles
cellColData names(15): Sample TSSEnrichment ... DoubletScore
  DoubletEnrichment
numberOfCells(1): 136058
medianTSS(1): 13.264
medianFrags(1): 18404

### Add project metadata

In [18]:
project_metadata_path <- "/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/metadata/sample_metadata.tsv"

In [19]:
# Load the project metadata
project_metadata <- read.csv(project_metadata_path, sep = "\t")
head(project_metadata)

Unnamed: 0_level_0,sample_id,batch,timepoint,condition,timecourse
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>
1,dm11a,A2,6,3-cyt,A2_3-cyt
2,dm12b,A2,6,IFNg,A2_IFNg
3,dm14b,A2,6,Ex-4_HG,A2_Ex-4_HG
4,dm21a,A2,24,3-cyt,A2_3-cyt
5,dm23a,A2,24,dex,A2_dex
6,dm24a,A2,24,Ex-4_HG,A2_Ex-4_HG


In [20]:
# Clean up the archr proj metadata
archr_metadata = as.data.frame(proj@cellColData)
archr_metadata$sample_id = archr_metadata$Sample
head(archr_metadata)

Unnamed: 0_level_0,Sample,TSSEnrichment,ReadsInTSS,ReadsInPromoter,ReadsInBlacklist,PromoterRatio,PassQC,NucleosomeRatio,nMultiFrags,nMonoFrags,nFrags,nDiFrags,BlacklistRatio,DoubletScore,DoubletEnrichment,sample_id
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
dm44a#GATTAGTGTTTCGCGC-1,dm44a,11.788,32032,33143,1130,0.1661153,1,2.050827,26723,32699,99759,40337,0.005663649,5.87071689,2.022222,dm44a
dm44a#GGAACTAAGTTAGCTA-1,dm44a,12.077,33606,34004,1240,0.1705367,1,2.116797,26474,31987,99697,41236,0.006218843,0.0,1.377778,dm44a
dm44a#GAGCTTAGTAGGATTT-1,dm44a,12.325,35365,36072,1240,0.1809481,1,1.937926,26038,33927,99675,39710,0.006220216,0.09224421,1.577778,dm44a
dm44a#CCTAAATCAATAGCCC-1,dm44a,12.135,34004,34171,1210,0.1715412,1,1.979182,26265,33432,99600,39903,0.006074297,11.74596347,2.355556,dm44a
dm44a#AACAGATAGGCGCTAC-1,dm44a,10.153,25548,27767,1375,0.1396646,1,1.518329,23185,39473,99406,36748,0.006916082,7.1510717,2.111111,dm44a
dm44a#AGGACGTAGGCGGGTA-1,dm44a,11.22,32064,33825,1250,0.1719292,1,2.114717,26159,31582,98369,40628,0.006353628,37.85614035,3.488889,dm44a


In [21]:
# Merge and add batch, timepoint, and timecourse (TODO: make this just a join that gets added to dataframe)
archr_metadata = dplyr::left_join(archr_metadata, project_metadata, by = "sample_id")
head(archr_metadata)

Unnamed: 0_level_0,Sample,TSSEnrichment,ReadsInTSS,ReadsInPromoter,ReadsInBlacklist,PromoterRatio,PassQC,NucleosomeRatio,nMultiFrags,nMonoFrags,nFrags,nDiFrags,BlacklistRatio,DoubletScore,DoubletEnrichment,sample_id,batch,timepoint,condition,timecourse
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,<chr>,<chr>
1,dm44a,11.788,32032,33143,1130,0.1661153,1,2.050827,26723,32699,99759,40337,0.005663649,5.87071689,2.022222,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG
2,dm44a,12.077,33606,34004,1240,0.1705367,1,2.116797,26474,31987,99697,41236,0.006218843,0.0,1.377778,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG
3,dm44a,12.325,35365,36072,1240,0.1809481,1,1.937926,26038,33927,99675,39710,0.006220216,0.09224421,1.577778,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG
4,dm44a,12.135,34004,34171,1210,0.1715412,1,1.979182,26265,33432,99600,39903,0.006074297,11.74596347,2.355556,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG
5,dm44a,10.153,25548,27767,1375,0.1396646,1,1.518329,23185,39473,99406,36748,0.006916082,7.1510717,2.111111,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG
6,dm44a,11.22,32064,33825,1250,0.1719292,1,2.114717,26159,31582,98369,40628,0.006353628,37.85614035,3.488889,dm44a,A2,72,Ex-4_HG,A2_Ex-4_HG


In [23]:
proj$batch = archr_metadata$batch
proj$timepoint = archr_metadata$timepoint
proj$condition = archr_metadata$condition
proj$timecourse = paste(proj$batch, proj$condition, sep = "_")
head(proj@cellColData)

DataFrame with 6 rows and 19 columns
                         Sample TSSEnrichment ReadsInTSS ReadsInPromoter
                          <Rle>       <array>    <array>         <array>
dm44a#GATTAGTGTTTCGCGC-1  dm44a        11.788      32032           33143
dm44a#GGAACTAAGTTAGCTA-1  dm44a        12.077      33606           34004
dm44a#GAGCTTAGTAGGATTT-1  dm44a        12.325      35365           36072
dm44a#CCTAAATCAATAGCCC-1  dm44a        12.135      34004           34171
dm44a#AACAGATAGGCGCTAC-1  dm44a        10.153      25548           27767
dm44a#AGGACGTAGGCGGGTA-1  dm44a         11.22      32064           33825
                         ReadsInBlacklist     PromoterRatio  PassQC
                                  <array>           <array> <array>
dm44a#GATTAGTGTTTCGCGC-1             1130 0.166115337964494       1
dm44a#GGAACTAAGTTAGCTA-1             1240  0.17053672628063       1
dm44a#GAGCTTAGTAGGATTT-1             1240 0.180948081264108       1
dm44a#CCTAAATCAATAGCCC-1             12

In [25]:
# Save initial project metadata
proj_meta = as.data.frame(proj@cellColData)
write.table(proj_meta, file = "initial_archr_proj_meta.tsv", sep = "\t", quote = FALSE, row.names = TRUE)

In [26]:
# Save the project
saveArchRProject(
  ArchRProj = proj,
  outputDirectory = "./"
)

Saving ArchRProject...

Loading ArchRProject...

Successfully loaded ArchRProject!


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          

class: ArchRProject 
outputDirectory: /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/all/archr 
samples(26): dm44a mo9 ... dm24a mo29
sampleColData names(1): ArrowFiles
cellColData names(19): Sample TSSEnrichment ... condition timecourse
numberOfCells(1): 136058
medianTSS(1): 13.264
medianFrags(1): 18404

### Remove AMULET doublets

In [5]:
# Load in the barcode list called as doublets
amulet_bcs <- read.csv("/cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/06Nov23/amulet/amulet_bcs_archr.txt", header = FALSE, sep = "\t")$V1
head(amulet_bcs)

In [9]:
# Grab cell id as column
proj_meta = as.data.frame(proj@cellColData)
proj_meta$cell_id = rownames(proj_meta)

In [10]:
# Check how many doublets are in the project
sum(proj_meta$cell_id %in% amulet_bcs)

In [11]:
# Get cell names NOT in amulet
cells_doublet_filt = proj$cellNames[!(proj_meta$cell_id %in% amulet_bcs)]
length(cells_doublet_filt)

In [12]:
# Create new arrow files with doublets removed
proj = subsetArchRProject(
    ArchRProj = proj,
    outputDirectory = "../AMULET_filt/",
    cells = cells_doublet_filt,
    dropCells = TRUE,
    force = TRUE
)

Copying ArchRProject to new outputDirectory : /cellar/users/aklie/data/datasets/igvf_sc-islet_10X-Multiome/annotation/07Nov23/archr/AMULET_filt

Copying Arrow Files...

Getting ImputeWeights

No imputeWeights found, returning NULL

Copying Other Files...

Copying Other Files (1 of 58): ArchRLogs

Copying Other Files (2 of 58): dm0b

Copying Other Files (3 of 58): dm0b.arrow

Copying Other Files (4 of 58): dm11a

Copying Other Files (5 of 58): dm11a.arrow

Copying Other Files (6 of 58): dm12b

Copying Other Files (7 of 58): dm12b.arrow

Copying Other Files (8 of 58): dm14b

Copying Other Files (9 of 58): dm14b.arrow

Copying Other Files (10 of 58): dm21a

Copying Other Files (11 of 58): dm21a.arrow

Copying Other Files (12 of 58): dm23a

Copying Other Files (13 of 58): dm23a.arrow

Copying Other Files (14 of 58): dm24a

Copying Other Files (15 of 58): dm24a.arrow

Copying Other Files (16 of 58): dm25a

Copying Other Files (17 of 58): dm25a.arrow

Copying Other Files (18 of 58): dm31a

C