# Assemble cell type-specific ATAC datasets

To begin our analysis, we'll retrieve the .arrow files that contain ATAC data and metadata after our TEA-seq QC and demultiplexing pipeline. We'll then extract the metadata for cells to use for cell filtering and QC plots.

## Load packages

hise: The Human Immune System Explorer R SDK package  
dplyr: Dataframe handling functions   
ArchR: .arrow file handling  
purrr: Functional programming tools  


In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(dplyr)
quiet_library(ArchR)
quiet_library(purrr)


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______      
          /   \     |   _  \      /      ||  |  |  | |   _  \     
         /  ^  \    |  |_) 

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the TEA-seq .arrow file outputs based on their file UUIDs. These will be placed in the `cache/` subdirectory by default.

In [2]:
atac_file_uuids <- list(
    "052b769d-cbdf-41f6-8fe8-0d34564b442a",
    "16e0c562-5d36-431f-bb27-b443aabc7077",
    "30167a93-70a8-4c38-b615-2252dabe417e",
    "57dde81e-bdaa-4138-add6-2551968672f4",
    "6853fb68-fa85-43d5-9071-c5e42667a75e",
    "6d8185bf-8a35-492d-a6ab-5783006b3b8e",
    "8c2a93be-de53-4d6f-ae2e-ab40a356edc3",
    "9ab975f8-7763-4892-96a7-a7438ecc9470",
    "a56fd2ba-a055-4ff8-9ab1-69df113bc032",
    "ad2f347d-e961-4a70-b294-4df83c12355d",
    "c299a55a-3325-4eb3-ba8e-c8ceccafaa8c",
    "ff8fe67e-cfe0-482a-bad1-aa189390a1c0"
)

In [3]:
fres <- hise::cacheFiles(
    atac_file_uuids
)

[1] "Initiating file download for EXP-00454-P1_PC02184-038_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-040_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-041_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-039_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-045_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-046_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-048_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-044_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-043_archr.arrow"
[1] "Download successful."
[1] "Initiating file download for EXP-00454-P1_PC02184-049_archr.arrow"
[1] "Download successful."
[1] "Initi

We'll also need the cell type labels identified using scRNA-seq and ADT markers to select cells from each cell type:

In [4]:
label_uuids <- list(
    "ebd4bee7-2f5d-46e1-b2fc-22157f1b8d04", # CD4 type labels
    "4d6aade9-288c-452f-8f0d-ac59e539f4cc"  # CD8 type labels
)

In [5]:
fres <- cacheFiles(label_uuids)

[1] "Initiating file download for cd4_cell_type_labels_2023-09-05.csv"
[1] "Download successful."
[1] "Initiating file download for cd8_cell_type_labels_2023-09-05.csv"
[1] "Download successful."


## Assemble Full ArchR Project

First, we'll assemble all cells into a single large ArchR project. We'll then add cell type labels, and then subset this project for each cell type for downstream analyses.

In [6]:
addArchRGenome("hg38")
addArchRThreads(12)

Setting default genome to Hg38.

Setting default number of Parallel threads to 12.



In [7]:
arrow_files <- list.files(
    "cache",
    pattern = ".arrow$",
    recursive = TRUE,
    full.names = TRUE
)

In [8]:
full_proj <- ArchRProject(
    ArrowFiles = arrow_files,
    outputDirectory = "vrdtea_ArchR_all"
)

Using GeneAnnotation set by addArchRGenome(Hg38)!

Using GeneAnnotation set by addArchRGenome(Hg38)!

Validating Arrows...

Getting SampleNames...



Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 


Getting Cell Metadata...



Merging Cell Metadata...

Initializing ArchRProject...


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__               

### Load cell type labels

In [9]:
label_files <- list.files(
    "cache",
    pattern = "cell_type_labels",
    recursive = TRUE,
    full.names = TRUE
)

In [10]:
cell_type_labels <- map_dfr(
    label_files,
    read.csv
)

In [11]:
head(cell_type_labels)

Unnamed: 0_level_0,barcodes,treatment,timepoint,predicted.celltype.l1.score,predicted.celltype.l1,predicted.celltype.l2.score,predicted.celltype.l2,predicted.celltype.l3.score,predicted.celltype.l3,aifi_cell_type
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>
1,2db4ad86fb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.9935699,CD8 TCM,0.7719808,CD8 TCM_2,t_cd8_memory
2,2db5b3acfb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.9516342,CD8 Naive,0.5145206,CD8 Naive,t_cd8_naive
3,2db85e7cfb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.9234621,CD8 Naive,0.5688105,CD8 Naive,t_cd8_naive
4,2dbf2874fb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.6689411,CD8 Naive,0.5051748,CD8 Naive,t_cd8_naive
5,2df6a268fb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.8623866,CD8 Naive,0.8623866,CD8 Naive,t_cd8_naive
6,2df88042fb8111eda35df29f570c0793,dmso,24,1,CD8 T,0.679953,CD8 Naive,0.6599393,CD8 Naive,t_cd8_naive


### Add labels to ArchR metadata

In [12]:
archr_meta <- getCellColData(full_proj)
archr_meta <- as.data.frame(archr_meta)

In [13]:
archr_barcodes <- data.frame(
    archr_name = rownames(archr_meta),
    barcodes = archr_meta$barcodes
)

In [14]:
cell_type_labels <- cell_type_labels %>%
  left_join(archr_barcodes)

[1m[22mJoining with `by = join_by(barcodes)`


In [15]:
sum(is.na(cell_type_labels$archr_name))

In [16]:
full_proj <- addCellColData(
    full_proj,
    data = cell_type_labels$aifi_cell_type,
    name = "aifi_cell_type",
    cells = cell_type_labels$archr_name
)

## Split ArchR Project

In [17]:
dir.create("output")

In [18]:
archr_meta <- getCellColData(full_proj)
archr_meta <- as.data.frame(archr_meta)
cell_type_meta_list <- split(archr_meta, archr_meta$aifi_cell_type)

In [19]:
names(cell_type_meta_list)

In [20]:
cell_types <- names(cell_type_meta_list)
subset_paths <- paste0("output/vrdtea_ArchR-", cell_types, "_", Sys.Date())

In [21]:
walk2(
    subset_paths,
    cell_type_meta_list,
    function(subset_path, cell_type_meta) {

        sub_proj <- subsetArchRProject(
            full_proj,
            cells = rownames(cell_type_meta),
            outputDirectory = subset_path
        )
        
    }
)

Copying ArchRProject to new outputDirectory : /home/jupyter/repro-vrd-tea-seq/04-atac-analysis/output/vrdtea_ArchR-t_cd4_cm_2023-10-02

Copying Arrow Files...

Getting ImputeWeights

No imputeWeights found, returning NULL

Copying Other Files...

Saving ArchRProject...

Loading ArchRProject...

Successfully loaded ArchRProject!


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
       \             

### Bundle projects for storage

To store and retrieve these ArchR projects, we'll bundle these cell type projects using `tar` for later use.

In [22]:
walk2(
    cell_types,
    subset_paths,
    function(cell_type, proj_path) {
        proj_tar <- paste0(proj_path, ".tar")

        command <- paste(
            "tar -cf",
            proj_tar,
            proj_path
        )

        system(command)
    }
)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original filesets for reference.

In [23]:
study_space_uuid <- "40df6403-29f0-4b45-ab7d-f46d420c422e"
title <- paste("VRd TEA-seq T Cell Type ArchR", Sys.Date())

In [24]:
out_files <- list.files(
    "output",
    pattern = ".tar$",
    full.names = TRUE
)
out_list <- as.list(out_files)

In [25]:
out_list

In [26]:
file_uuids <- c(
    atac_file_uuids,
    label_uuids
)

In [27]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = file_uuids,
    store = "project",
    doPrompt = FALSE
)

In [28]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.68.0                  
 [3] rtracklayer_1.60.1                Biostrings_2.68.1                
 [5] XVector_0.40.0                 