In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }

quiet_library(H5weaver)
quiet_library(hise)
quiet_library(dplyr)
quiet_library(furrr)
quiet_library(purrr)
quiet_library(Seurat)
quiet_library(SeuratObject)

## Prepare Azimuth PBMC reference

To use Seurat to label our cells, we'll utilize the PBMC reference provided by the HubMap Azimuth project. This reference is derived from data in this publication from the Satija lab:

Hao, Y. and Hao, S. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021)

The version of record for this reference dataset is provided in a Zenodo repository at this accession:  
https://zenodo.org/records/4546839

Additional information is available [on the Azimuth website](https://azimuth.hubmapconsortium.org/references/#Human%20-%20PBMC).

We'll download the reference from Zenodo for label transfer:

In [2]:
if(!dir.exists("reference")) {
    dir.create("reference")
}
download.file(
    "https://zenodo.org/records/4546839/files/ref.Rds?download=1",
    "reference/ref.Rds"
)

In [3]:
reference <- readRDS("reference/ref.Rds")

Between Level 2 (L2) and Level 3 (L3) labels provided by the Satija lab, we like to add an additional level that we call L2.5, which separates the Treg Naive and Memory cells based on L3, and assigns a CD8 TEMRA cell label to cells with the L3 labels CD8 TEM_4 and CD8 TEM_5. All other cell types use their L2 assignments.

In [4]:
l3 <- as.character(reference@meta.data$celltype.l3)
l2 <- as.character(reference@meta.data$celltype.l2)
l2.5 <- l2
l2.5[l3 == "Treg Naive"] <- "Treg Naive"
l2.5[l3 == "Treg Memory"] <- "Treg Memory"
l2.5[l3 %in% c("CD8 TEM_4", "CD8 TEM_5")] <- "CD8 TEMRA"

reference <- AddMetaData(reference, metadata = l2.5, col.name = "celltype.l2.5")

## Retreive sample metadata

In an earlier step, we assembled and stored sample metadata in HISE. We'll pull this file, and use it to retrieve file for our labeling process.

In [5]:
sample_meta_uuid <- "223b4aa9-19fc-41e1-8bea-43682e5ac278"

In [6]:
res <- cacheFiles(list(sample_meta_uuid))
sample_meta_file <- list.files(
    paste0("cache/", sample_meta_uuid), 
    pattern = ".csv",
    full.names = TRUE
)

ERROR: Error in if (thisDesc$file$id != idsExpanded[[fidx]]) {: argument is of length zero


In [None]:
hise_meta <- read.csv(sample_meta_file)

## Divide data into chunks for parallel processing

For labeling, we'll take files in batches of up to 10 files. We'll label those files, then output the results for each sample in the batch.

In [7]:
hise_meta <- hise_meta %>% 
  arrange(file.batchID)

In [None]:
b <- rep(1:11, each = 10)[1:nrow(hise_meta)]
df_chunk_list <- split(hise_meta, b)

In [9]:
length(df_chunk_list)

## Prepare output directory

We'll store results for each sample in `output/Hao_PBMC/`.

In [9]:
out_dir <- "output/Hao_PBMC/"
if(!dir.exists(out_dir)) {
    dir.create(out_dir, recursive = TRUE)
}

## Functions for parallel label transfer of chunks

For each chunk, we'll retrieve the data from HISE, perform label transfer and ADT imputation using Seurat, and then store the labels and imputed ADT matrices to allow us to assess cell type identity.

The main function to perform these steps is `label_chunk()`, provided below. There are also 4 helper functions that assist in performing these steps for each sample:  
`read_so()` Reads the .h5 files stored in HISE as Seurat Objects for analysis  
`write_labels()` Writes the labeling results to .csv files for each sample  
`get_sample_adt()` subsets the ADT matrix for each sample and returns ADT values per sample  
`write_adt()` Writes the imputed ADT matrix values to a .h5 file for later use

In [None]:
read_so <- function(h5_uuid) {
    res <- cacheFiles(h5_uuid)
    h5_file <- list.files(
        paste0("cache", h5_uuid), 
        pattern = ".h5", 
        full.names = TRUE
    )
    
    counts <- read_h5_dgCMatrix(h5_file)
    rownames(counts) <- make.unique(rownames(counts))
    meta <- read_h5_cell_meta(h5_file)
    rownames(meta) <- meta$barcodes

    so <- CreateSeuratObject(
      counts,
      meta.data = meta,
      assay = "RNA")

    rm_command <- paste0("rm -r cache/", h5_uuid)
    system(rm_command)

    return(so)
}

write_labels <- function(sample_labels, sample_id) {
    out_file <- paste0("output/Hao_PBMC/", sample_id, "_Hao_PBMC.csv")
    write.csv(
        sample_labels,
        out_file,
        row.names = FALSE,
        quote = FALSE
    )
}

get_sample_adt <- function(sample_meta, adt_data) {
    adt_data[,sample_meta$barcodes]
}

write_adt <- function(sample_adt, sample_id) {
    out_file <- paste0("output/Hao_PBMC/", sample_id, "_ADT.h5")
    list_mat <- list(
        i = mat@i,
        p = mat@p,
        x = mat@x,
        Dim = dim(mat),
        rownames = rownames(mat),
        colnames = colnames(mat)
    )

    h5createFile(out_file)
    h5write(list_mat, out_file, "mat")
}

label_chunk <- function(meta_data){
    
    so_list <- map(meta_data$file.id, read_so)
    combined <- Reduce(merge, so_list)
    rm(so_list)
    
    combined <- SCTransform(
        combined,
        method = "glmGamPoi", 
        verbose = FALSE
    )
    
    #find anchors
    anchors <- FindTransferAnchors(
      reference = reference,
      query = combined,
      normalization.method = "SCT",
      reference.reduction = "spca",
      dims = 1:50
    )  
        
    #perform projection to get labels
    combined <- MapQuery(
      anchorset = anchors,
      query = combined,
      reference = reference,
      refdata = list(
        celltype.l1 = "celltype.l1",
        celltype.l2 = "celltype.l2",
        celltype.l3 = "celltype.l3",
        celltype.l2.5 = "celltype.l2.5",
        predicted_ADT = "ADT"
      ),
      reference.reduction = "spca", 
      reduction.model = "wnn.umap"
    )
    
    sample_meta_list <- split(combined@meta.data, combined@meta.data$pbmc_sample_id)

    walk2(
        sample_meta_list, names(sample_meta_list),
        write_labels
    )
    
    sample_adt_list <- map(
        sample_meta_list, 
        get_sample_adt, 
        adt_data = combined[["ADT"]]@)

    walk2(
        sample_adt_list, names(sample_meta_list),
        write_adt
    )
    
    rm(combined)
}

## Apply label transfer to all chunks

We'll use `mclapply()` to perform our label transfer steps in parallel for multiple chunks. 

Because of the extremely memory-intensive use of `SCTransform()`, we're limited in the number of chunks that we can process simultaneously.

In [None]:
mclapply(
    df_chunk_list,
    label_chunk,
    mc.cores = 3
)