<h1><b>Performance of expression/data vs expression/streamed-data</b></h1>
    
1. Dataset: <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE134809">GSE134809</a>
2. Paper: Martin et al. <a href="https://pubmed.ncbi.nlm.nih.gov/31474370/">"Single-Cell Analysis of Crohn’s Disease Lesions Identifies a Pathogenic Cellular Module Associated with Resistance to Anti-TNF Therapy"</a>, 2019.
3. Size: 82,417 lamina propria cells from 22 paired inflamed and uninflamed ileum tissues of 11 iCD patients

[Note]: "omics/expression/streamed-data" only exports source.ID and expression values. Thus, we need to attach relevant expression metadata (and sample metadata if needed) using the "source.ID" which consists of both sample-level ID and cell-level barcodes.
    

<h2><b>Set instance and token</b></h2>

In [1]:
suppressMessages(library(tidyverse))
suppressMessages(library(data.table))
suppressMessages(library(httr))
suppressMessages(library(jsonlite))
suppressMessages(library(integrationCurator)) # Genestack client library

# Enter your token
PRED_SPOT_TOKEN = '<your token>' 

# Change the following settings according to your instance
PRED_SPOT_HOST = 'inc-stage.genestack.com'
PRED_SPOT_VERSION = 'v0.1'
BASE_URL = 'frontend/rs/genestack' 
API_VERSION = 'v0.1'
page_limit = 2000

Sys.setenv(PRED_SPOT_HOST=PRED_SPOT_HOST,
           PRED_SPOT_TOKEN=PRED_SPOT_TOKEN,
           PRED_SPOT_VERSION=PRED_SPOT_VERSION)

<h2><b>Set parameters</b></h2>

In [2]:
study_id = 'GSF456605'
study_filter = sprintf('"%s"="%s"', 'genestack:accession', study_id)

tissue_filter = sprintf('"%s"="%s"', 'status.ch1', 'Involved')
patient_filter_pos = sprintf('"%s"="%s"', 'Patient Status', 'GIMATS+')
patient_filter_neg = sprintf('"%s"="%s"', 'Patient Status', 'GIMATS-')
sample_filter = sprintf('%s AND (%s OR %s)', tissue_filter, patient_filter_pos, patient_filter_neg)

gene = 'TNF'
expression_filter = sprintf('Gene=%s MinValue=0', gene)

<h2><b>Get expression data</b></h2>

<h3><b>Non-streaming data</b></h3>

In [3]:
start = Sys.time()

# Extract expression data from omics/expression/data
expressions = as_tibble(do.call(cbind, OmicsQueriesApi_search_expression_data(
    study_filter = study_filter,
    sample_filter = sample_filter,
    ex_query = expression_filter,
    page_limit = page_limit
)$content$data))

end = Sys.time()

cat(sprintf('Time to get %s expression values: %s seconds\n\n', 
    nrow(expressions), round(end-start, digit = 1)))

str(expressions)


Time to get 1791 expression values: 4.1 seconds

tibble [1,791 × 14] (S3: tbl_df/tbl/data.frame)
 $ itemId                        : chr [1:1791] "777059-TNF" "777065-TNF" "777086-TNF" "777124-TNF" ...
 $ metadata.Experimental Platform: chr [1:1791] "val2" "val2" "val2" "val2" ...
 $ metadata.Expression Source    : chr [1:1791] "val1" "val1" "val1" "val1" ...
 $ metadata.Arvados URL          : chr [1:1791] "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" ...
 $ metadata.Genome Version       : chr [1:1791] "val5" "val5" "val5" "val5" ...
 $ metadata.Scale                : chr [1:1791] "val4" "val4" "val4" "val4" ...
 $ metadata.Normalization Method : chr [1:1791] "val3" "val

<h3><b>Streaming data</b></h3>

In [4]:
# Extract expression data from omics/expression/streamed-data
start = Sys.time()

expression_group = as_tibble(ExpressionIntegrationApi_get_parents_by_study(id=study_id)$content)

# Get group accession
group_accession = expression_group$itemId
                     
# Get expression data: return a 2-row csv
streamed_expressions = httr::GET(sprintf('https://%s/%s/integrationCurator/%s/omics/expression/streamed-data', PRED_SPOT_HOST, BASE_URL, API_VERSION),
    add_headers(accept = "gzip", `Genestack-API-Token` = PRED_SPOT_TOKEN), 
    query = list(
        groupAccession = group_accession,
        sampleFilter = sample_filter,
        featureList = gene
    ))$content               
 
end = Sys.time()

cat(sprintf('Time to extract expression values from the endpoint: %s seconds\n', round(end-start, digit = 1)))
                         

Time to extract expression values from the endpoint: 1.9 seconds


In [5]:
# Transform expression/streamed-data output into a comparable dataframe with expression/data output
start = Sys.time()

# Get metadata for expression data
metadata = expression_group$metadata %>%
    rename_all(function(x) paste0("metadata.", x)) %>%
    mutate(groupId = group_accession)
               
# Pivot data from wide to long, and merge with metadata 
               
processed_streamed_expressions = as_tibble(fread(rawToChar(streamed_expressions), showProgress = FALSE)) %>%
    pivot_longer(!NAME, names_to = "source.ID", values_to = "expression") %>%
    filter(!is.na(expression)) %>%
    add_column(metadata) %>%
    rename(`metadata.Run Source ID`= source.ID, gene=NAME)

end = Sys.time()
                         
cat(sprintf('Time to get %s expression values: %s seconds\n\n',
    nrow(processed_streamed_expressions), round(end-start, digits = 1)))
                                          
str(processed_streamed_expressions)   

Time to get 1791 expression values: 0.6 seconds

tibble [1,791 × 10] (S3: tbl_df/tbl/data.frame)
 $ gene                          : chr [1:1791] "TNF" "TNF" "TNF" "TNF" ...
 $ metadata.Run Source ID        : chr [1:1791] "128/128-AAAGTAGGTTCGAATC-1" "128/128-AACTCCCAGAGAGCTC-1" "128/128-AACTCCCGTCTAGCCG-1" "128/128-AACTCCCTCGAACTGT-1" ...
 $ expression                    : num [1:1791] 4 1 9 3 1 12 2 8 1 1 ...
 $ metadata.Experimental Platform: chr [1:1791] "val2" "val2" "val2" "val2" ...
 $ metadata.Expression Source    : chr [1:1791] "val1" "val1" "val1" "val1" ...
 $ metadata.Arvados URL          : chr [1:1791] "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" "https://arvados.inc-s.genestack.com/collections/41y7k-4zz18-tuxuo2ztf0ycn4y/GSE134809.mex" ...
 $ met

<h2><b>Compare outputs from two endpoints</b></h2>

In [6]:
suppressMessages(library(janitor))
compare_df_cols(expressions, processed_streamed_expressions)


column_name,expressions,processed_streamed_expressions
<chr>,<chr>,<chr>
expression,numeric,numeric
gene,character,character
groupId,character,character
itemId,character,
metadata.Arvados URL,character,character
metadata.Experimental Platform,character,character
metadata.Expression Source,character,character
metadata.Genome Version,character,character
metadata.Normalization Method,character,character
metadata.Run Source ID,character,character
