# Retrieve RNA and ADT Metadata

To begin our analysis, we'll retrieve the .h5 files that contain RNA and ADT data and metadata after our TEA-seq QC and demultiplexing pipeline. We'll then extract the metadata for cells to use for cell filtering and QC plots.

## Setup

Install BarMixer if not present. BarMixer is an R package that is part of the BarWare tools for barcoded scRNA-seq data, and has helper functions for easily reading cell metadata from our .h5 files.

BarMixer repository: https://github.com/AllenInstitute/BarMixer  
BarWare paper: [Swanson, et al., BMC Bioinformatics (2022)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04620-2)

In [1]:
ip <- installed.packages()
if(!"BarMixer" %in% rownames(ip)) {
    devtools::install_github(
        "alleninstitute/BarMixer",
        upgrade = "never"
    )
}

## Load packages

hise: The Human Immune System Explorer R SDK package  
BarMixer: .h5 file handling  
purrr: Functional programming tools  


In [2]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(BarMixer)
quiet_library(purrr)

In [3]:
read_path_uuid <- function(uuid) {
    uuid_path <- paste0("cache/", uuid)
    if(!dir.exists(uuid_path)) {
        cacheFiles(list(uuid))
    }
    list.files(uuid_path, full.names = TRUE)[1]
}

In [4]:
read_csv_uuid <- function(uuid) {
    filename <- read_path_uuid(uuid)
    read.csv(filename)
}

## Get file metadata stored in HISE

In [5]:
meta_uuid <- "5e3115d4-9207-4020-8e3a-3792dd28ea6b"
sample_meta <- read_csv_uuid(meta_uuid)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the TEA-seq .h5 file outputs based on their file UUIDs. These will be placed in the `cache/` subdirectory by default.

In [6]:
h5_files <- map_chr(
    sample_meta$rna_file.id,
    read_path_uuid
)

## Assemble metadata

Here, we list each of the files in `cache/` and read cell metadata using the BarMixer function `read_h5_cell_meta()`. purrr's `map_dfr()` handles iteration over the files, and assembles a single table with metadata for all cells by row concatenation.

In [7]:
all_metadata <- map_dfr(
    h5_files,
    read_h5_cell_meta
)

In [8]:
head(all_metadata)

Unnamed: 0_level_0,barcodes,adt_qc_flag,adt_umis,batch_id,cell_name,chip_id,hto_barcode,hto_category,n_genes,n_mito_umis,⋯,n_umis,original_barcodes,pbmc_sample_id,pool_id,rna_cell_uuid,seurat_pbmc_type,seurat_pbmc_type_score,umap_1,umap_2,well_id
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,⋯,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
1,dc61da0a31b011ef80e742c13d66f8da,Good,4196,B065,predacious_direful_terrier,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1587,304,⋯,3572,AAACAGCCAATTTGGT,PB00395-02,B065-P1,73476db231bc11efbd03567616973a79,CD8 effector,0.4440226,0.8125938,-10.7439552,B065-P1C1W1
2,dc687e5a31b011ef80e742c13d66f8da,Good,4066,B065,oozy_copacetic_frogmouth,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1437,454,⋯,2959,AAACCGAAGGAGCAAC,PB00395-02,B065-P1,734772d031bc11efbd03567616973a79,CD8 Naive,0.3389567,0.298588,6.8719075,B065-P1C1W1
3,dc6d9d6831b011ef80e742c13d66f8da,Good,2489,B065,genius_atrophic_silkworm,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1783,289,⋯,3605,AAACGCGCAAACTAAG,PB00395-02,B065-P1,73477bf431bc11efbd03567616973a79,CD4 Memory,0.8498093,0.1556549,-0.8671478,B065-P1C1W1
4,dc6f966831b011ef80e742c13d66f8da,Good,1684,B065,glamorous_immediate_blacklemur,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1093,127,⋯,2004,AAACGCGCATTGTCCT,PB00395-02,B065-P1,73477f0a31bc11efbd03567616973a79,CD4 Memory,0.5115535,-0.9462299,4.3294579,B065-P1C1W1
5,dc79690431b011ef80e742c13d66f8da,Good,3756,B065,juiced_amiable_waterdogs,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1220,243,⋯,2330,AAAGCCCGTTTGCAGA,PB00395-02,B065-P1,73478cde31bc11efbd03567616973a79,CD8 effector,0.517388,0.7238464,-6.1364371,B065-P1C1W1
6,dc7cd92231b011ef80e742c13d66f8da,Good,3893,B065,goldleaf_flavorous_fly,B065-P1C1,AGTAAGTTCAGCGTA,singlet,1520,388,⋯,3312,AAAGCTTGTCACCAAA,PB00395-02,B065-P1,7347932831bc11efbd03567616973a79,CD8 Naive,0.377219,-0.3871612,8.022188,B065-P1C1W1


## Write output file

Write the metadata as a .csv for later use. We remove `row.names` and set `quote = FALSE` to simplify the outputs and increase compatibility with other tools.

In [9]:
dir.create("output")

“'output' already exists”


In [10]:
write.csv(
    all_metadata,
    "output/rna_adt_cell_metadata.csv",
    row.names = FALSE,
    quote = FALSE
)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference, which we assembled above when files were retrieved (`input_file_uuids`)

In [11]:
study_space_uuid <- "00a53fa5-18da-4333-84cb-3cc0b0761201"
title <- "TEA-seq demo unfiltered TE cell metadata"

In [12]:
search_id <- ids::adjective_animal()
search_id

In [13]:
in_list <- as.list(sample_meta$rna_file.id)

In [14]:
in_list

In [15]:
out_list <- list("output/rna_adt_cell_metadata.csv")

In [16]:
out_list

In [17]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = in_list,
    destination = search_id
)

[1] "Cannot determine the current notebook."
[1] "1) /home/jupyter/certpro-workflow-demos/adult_vs_pediatric_teaseq/01-R_get_h5_metadata.ipynb"
[1] "2) /home/jupyter/certpro-workflow-demos/adult_vs_pediatric_teaseq/00-R_select_samples.ipynb"
[1] "3) /home/jupyter/examples/Visualization_apps/dash/save_visualization_app_example.ipynb"


Please select (1-3)  1


You are trying to upload the following files:  output/rna_adt_cell_metadata.csv



(y/n) y


In [18]:
sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_1.0.2       BarMixer_1.0.1    rhdf5_2.46.1      Matrix_1.6-4     
[5] data.table_1.15.4 hise_2.16.0      

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.8      dplyr_1.1.4         compiler_4.3.2     
 [4] crayon_1.5.2        tidyselect_1.2.0    IRdisplay_1.1      
 [7] stringr_1.5.1     