# Retrieve RNA and ADT Metadata

To begin our analysis, we'll retrieve the .h5 files that contain RNA and ADT data and metadata after our TEA-seq QC and demultiplexing pipeline. We'll then extract the metadata for cells to use for cell filtering and QC plots.

## Setup

Install BarMixer if not present. BarMixer is an R package that is part of the BarWare tools for barcoded scRNA-seq data, and has helper functions for easily reading cell metadata from our .h5 files.

BarMixer repository: https://github.com/AllenInstitute/BarMixer  
BarWare paper: [Swanson, et al., BMC Bioinformatics (2022)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04620-2)

In [1]:
ip <- installed.packages()
if(!"BarMixer" %in% rownames(ip)) {
    devtools::install_github(
        "alleninstitute/BarMixer",
        upgrade = "never"
    )
}

## Load packages

hise: The Human Immune System Explorer R SDK package  
BarMixer: .h5 file handling  
purrr: Functional programming tools  


In [2]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }
quiet_library(hise)
quiet_library(BarMixer)
quiet_library(purrr)

## Retrieve files

Now, we'll use the HISE SDK package to retrieve the TEA-seq .h5 file outputs based on their file UUIDs. These will be placed in the `cache/` subdirectory by default.

In [3]:
sample_meta <- read.csv("sample_meta.csv")
project_store <- "PedvsSenior"

In [4]:
file_res <- map(
    sample_meta$h5_file,
    function(file_name) {
        downloadFileFromProjectStore(
            storeName = project_store,
            file_name
        )
    }
)

## Assemble metadata

Here, we list each of the files in `cache/` and read cell metadata using the BarMixer function `read_h5_cell_meta()`. purrr's `map_dfr()` handles iteration over the files, and assembles a single table with metadata for all cells by row concatenation.

In [5]:
h5_files <- sample_meta$h5_file

In [6]:
all_metadata <- map_dfr(
    h5_files,
    read_h5_cell_meta
)

In [7]:
head(all_metadata)

Unnamed: 0_level_0,barcodes,adt_qc_flag,adt_umis,batch_id,cell_name,chip_id,hto_barcode,hto_category,n_genes,n_mito_umis,⋯,n_umis,original_barcodes,pbmc_sample_id,pool_id,rna_cell_uuid,seurat_pbmc_type,seurat_pbmc_type_score,umap_1,umap_2,well_id
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,⋯,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
1,970c3f98e40811eba89d42010a19c839,Good,2184,B065,equatorial_wornout_skimmer,B065-P1C1,CTCCTCTGCAATTAC,singlet,1227,306,⋯,2338,AAACCGCGTTTGGGCG,PB00173-02,B065-P1,6041c6b0e40b11ebbdd742010a19c839,CD4 Memory,0.5934867,1.347388,4.8437943,B065-P1C1W1
2,970eb142e40811eba89d42010a19c839,Good,1922,B065,manly_hillocked_cat,B065-P1C1,CTCCTCTGCAATTAC,singlet,1717,257,⋯,3656,AAACGGATCGCTAGCA,PB00173-02,B065-P1,6041cb56e40b11ebbdd742010a19c839,CD4 Memory,0.9915653,-2.247861,-5.5957509,B065-P1C1W1
3,970f7834e40811eba89d42010a19c839,Good,3646,B065,berkelium_botanic_pangolin,B065-P1C1,CTCCTCTGCAATTAC,singlet,1934,180,⋯,3971,AAACGTACAGCAATAA,PB00173-02,B065-P1,6041cc14e40b11ebbdd742010a19c839,CD8 effector,0.5723163,-10.456034,-0.5250534,B065-P1C1W1
4,9713353ce40811eba89d42010a19c839,Good,3182,B065,clearcut_barbarous_tilefish,B065-P1C1,CTCCTCTGCAATTAC,singlet,1975,274,⋯,4622,AAAGCCGCAATATACC,PB00173-02,B065-P1,6041d042e40b11ebbdd742010a19c839,CD4 Memory,0.4698943,-8.026187,-1.8615069,B065-P1C1W1
5,971a68e8e40811eba89d42010a19c839,Good,4279,B065,daring_pricey_barb,B065-P1C1,CTCCTCTGCAATTAC,singlet,1798,271,⋯,3721,AAATCCGGTTAGCATG,PB00173-02,B065-P1,6041d89ee40b11ebbdd742010a19c839,CD8 effector,0.5886705,-8.415179,-1.6606372,B065-P1C1W1
6,971b0bd6e40811eba89d42010a19c839,Good,3075,B065,artycrafty_graceful_robberfly,B065-P1C1,CTCCTCTGCAATTAC,singlet,1954,337,⋯,4382,AAATGCCTCCCTCAAC,PB00173-02,B065-P1,6041d948e40b11ebbdd742010a19c839,CD4 Memory,0.6832154,-1.591888,6.168363,B065-P1C1W1


## Write output file

Write the metadata as a .csv for later use. We remove `row.names` and set `quote = FALSE` to simplify the outputs and increase compatibility with other tools.

In [8]:
dir.create("output")

“'output' already exists”


In [9]:
write.csv(
    all_metadata,
    "output/te_rna_adt_cell_metadata.csv",
    row.names = FALSE,
    quote = FALSE
)

## Store results in HISE

Finally, we store the output file in our Collaboration Space for later retrieval and use. We need to provide the UUID for our Collaboration Space (aka `studySpaceId`), as well as a title for this step in our analysis process.

The hise function `uploadFiles()` also requires the FileIDs from the original fileset for reference, which we assembled above when files were retrieved (`input_file_uuids`)

In [10]:
study_space_uuid <- "4743c203-6af9-469c-b71d-0f66e3518820"
title <- "TEA-seq unfiltered TE cell metadata"

In [11]:
search_id <- ids::adjective_animal()
search_id

In [12]:
in_list <- as.list(sample_meta$h5_uuid)

In [13]:
in_list

In [14]:
out_list <- list("output/te_rna_adt_cell_metadata.csv")

In [15]:
out_list

In [16]:
sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_1.0.2       BarMixer_1.0.1    rhdf5_2.46.1      Matrix_1.6-4     
[5] data.table_1.15.4 hise_2.16.0      

loaded via a namespace (and not attached):
 [1] ids_1.0.1           crayon_1.5.2        vctrs_0.6.5        
 [4] httr_1.4.7          cli_3.6.2           rlang_1.1.3        
 [7] generics_0.1.3    