# Single-cell RNA-Seq: Cohort Selection and Data Retrieval

This notebook demonstrates how to use the **Genestack ODM API** to access and explore single-cell RNA-seq data stored in an ODM instance. It explains how to configure the API connection, retrieve metadata and data for selected entities, and interpret the returned results in a reproducible, programmable way.
The notebook is organized into three main parts:
* **Prerequisites** – loads the required Python libraries and helper functions. This section can be minimized when running the notebook end-to-end if all dependencies are already installed.
* **ODM API Configuration** – an interactive setup for establishing a secure connection to your ODM instance using an API token.
* **Working with Data** – examples of typical ODM API endpoints for metadata and data retrieval, with explanations of the API response structure and its relevance for downstream analysis. Most sections are split into the data retrieval and visualization subsections.

## 1. Prerequisites

Before running the notebook, make sure your environment is ready. You will need Python 3.10+ and `pip`. Install all dependencies with:
```
pip install numpy pandas matplotlib seaborn scipy ipywidgets ipykernel requests scanpy anndata itables plotly nbformat

python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple odm-sdk==1.0.0-2610
```


### 1.1 Imports

In [None]:
# standard library (come with Python)
import os
import re
import json
import time
import warnings
from getpass import getpass

# third-party (need installation)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display
from scipy.sparse import coo_matrix
import odm_api
from odm_api.models.cr_request import CRRequest
from odm_api.models.de_request_case_group import DERequestCaseGroup
from odm_api.models.de_request import DERequest
from odm_api.models.gs_request import GSRequest
import scanpy as sc
import anndata as ad
from itables import show
import plotly.graph_objects as go
import plotly.io as pio

# set default matplotlib style
plt.style.use('default')

# set warnings to ignore FutureWarning
warnings.filterwarnings('ignore', category=FutureWarning)

# allow interactive graphs to be displayed in notebook
pio.renderers.default = "notebook"


### 1.2 Functions

This section defines utility functions used across the notebook to streamline interaction with the ODM API and to support visualization of retrieved data. Collecting them here keeps the workflow sections concise and focused on analysis rather than implementation details.

In [None]:
def set_api_credentials(odm_url, api_prefix='/api/v1'):
    """
    Set ODM API credentials interactively, using getpass or fallback to widget-based UI.

    Attempts to use `getpass` to prompt for an API token (works in terminal environments).
    If that fails (e.g., in JupyterLab or web-based notebooks), falls back to a widget-based input form.

    Parameters
    ----------
    odm_url : str
        The ODM server URL, e.g. 'q001-sc-demo.trial.genestack.com/'
    api_prefix : str, optional
        The API endpoint prefix (default is '/api/v1').

    Sets global variables
    ------------
        odm_base_url : str
            The base ODM API URL with prefix.
        token : str
            The provided API authentication token.
    """
    # ensure API prefix is provided with the URL
    if not re.search(r'/api.+', odm_url):
        base_url = odm_url + api_prefix

    try:
        # enter token via getpass (works in terminal-based environments)
        global odm_base_url, token
        token = getpass("Auth Token: ")
        odm_base_url = base_url

    except (EOFError, OSError):
        # fallback to widget-based input (works in web environments)
        set_api_credentials_ui(base_url)


def set_api_credentials_ui(base_url):
    """
    Displays widgets for ODM API server and token selection.

    Args:
        base_url (str, optional): Default server URL for ODM with API prefix (e.g. /api/v1).
        If not provided, uses 'ODM_BASE_URL' env variable or default.
    """
    odm_base_url_widget = widgets.Text(
        value=base_url if base_url is not None else os.getenv('ODM_BASE_URL', ''),
        description='Base URL:',
        layout=widgets.Layout(width='600px')
    )
    token_widget = widgets.Password(
        value=os.getenv('ODM_API_TOKEN', ''),
        description='Auth Token:',
        layout=widgets.Layout(width='400px')
    )
    set_button = widgets.Button(description='Set Credentials', button_style='primary')
    status_html = widgets.HTML()

    def _set_credentials(_):
        global odm_base_url, token
        odm_base_url = odm_base_url_widget.value.strip()
        token = token_widget.value.strip()
        masked = ('***' if not token else (token[:4] + '…' + token[-4:] if len(token) >= 8 else '***'))
        status_html.value = f"<span style='color: green;'>Credentials set. Token: {masked}</span>"

    set_button.on_click(_set_credentials)
    display(widgets.VBox([odm_base_url_widget, token_widget, set_button, status_html]))


def clean_metadata(expression):
    """
    Transforms expression metadata from dictionaries to a clean DataFrame

    Parameters:
        expression : DataFrame
            Data frame containing gene expression.
    """
    # transforms extracted metadata from gene expression query
    df = expression.copy()
           
    # extract keys from itemOrigin
    for key in ['runSourceId', 'runId', 'groupId']:
        df[key] = df['itemOrigin'].apply(lambda x: x.get(key))
    
    # metadata
    df['Data Class'] = df['metadata'].apply(lambda x: x.get('Data Class'))

    # gene ids
    df['Features (string)'] = df['feature'].apply(lambda x: ' ,'.join(x.keys()))
    for key in ['ensembl_id', 'gene_id']:
        df[key] = df['feature'].apply(lambda x: x.get(key))
    
    # expression
    df['value'] = df['value'].apply(lambda x: x.get('value'))

    columns_keep = ['runSourceId', 'runId', 'groupId', 'Data Class', 'Features (string)',
                    'ensembl_id', 'gene_id', 'value']
    new_df = df[columns_keep]
    return new_df


def create_full_matrix(cells, expression):
    """
    Creates sparse matrix containing cell counts and gene expression values

    Parameters:
        cells: DataFrame
            Data frame containing cell counts and metadata.
        expression : DataFrame
            Data frame containing gene expression and metadata.
    """
    
    # filter expression data to only include cells in provided barcodes
    valid_cells = cells.index
    ex_df_filt = expression[expression['runSourceId'].isin(valid_cells)].copy()

    # get unique genes and cells
    genes_unique = ex_df_filt['gene_id'].unique()
    cells_unique = cells.index

    # create mapping gene / cell - matrix index
    gene_to_idx = {gene: i for i, gene in enumerate(genes_unique)}
    cell_to_idx = {cell: i for i, cell in enumerate(cells_unique)}

    # map expression data to matrix indices. Rows = genes, cols = cells
    row_idx = ex_df_filt['gene_id'].map(gene_to_idx).values
    col_idx = ex_df_filt['runSourceId'].map(cell_to_idx).values 
    data = ex_df_filt['value'].values

    # build sparse matrix (genes x cells)
    counts_matrix = coo_matrix(
    (data, (row_idx, col_idx)),
    shape=(len(genes_unique), len(cells_unique))
    )

    counts_matrix = counts_matrix.tocsr()
    return counts_matrix


def extract_stats(df):
    """
    Extracts cell ratio statistics and formats data frame

    Parameters:
        df: DataFrame
            Data frame containing cell statistics.
    """
    # transpose, add cell type
    df_t = df.set_index(df.columns[0]).T
    df_t["cellType"] = cell_type
    # round values
    df_t[['count_selected', 'count_available']] = df_t[
        ['count_selected', 'count_available']
    ].astype(int).round(0)
    df_t[['ratio']] = df_t[['ratio']].round(2)

    return df_t


def make_trace(df, name, color):
    """
    Generates go scatter objects for interactive visualization

    Parameters:
        df: DataFrame
            Data frame containing expression values.
        name: str
            Desired column attribute
        color: str
            Desired color
    """
    
    return go.Scatter(
        x=df["log2FC"],
        y=df["-log10(pvalue)"],
        mode="markers",
        name=name,
        marker=dict(size=6, color=color, opacity=0.7),
        text=df.apply(
            lambda row:
                f"Gene: {row['gene_id']}<br>log2FC: {row['log2FC']:.2f}<br>p-value: {row['p_value']:.3g}", 
            axis=1
        ),
        hoverinfo="text"
    )


def boxplot_gene_summary(
    df,
    fill_colors,
    plot_title="",
    plot_subtitle="",
    log1p=False
):
    """
    Generates boxplot from gene expression summary table
    """
    # parse quantiles
    df['quantiles_list'] = (df['quantiles'])

    gene_summary_expanded = df.explode('quantiles_list')
    gene_summary_expanded = gene_summary_expanded.rename(
        columns={'quantiles_list': 'expression'}
    )

    gene_summary_expanded = gene_summary_expanded[
    ['gene_id', 'cellType', 'expression']
        ]

    gene_summary_expanded['expression'] = pd.to_numeric(
        gene_summary_expanded['expression'], errors='coerce'
    )

    # optional log1 transform
    if log1p:
        gene_summary_expanded['expression'] = np.log1p(
            gene_summary_expanded['expression']
        )
        y_label = "log1p(Expression)"
    else:
        y_label = "Expression"

    # boxplot
    ax = sns.boxplot(
    data=gene_summary_expanded,
    x='gene_id',
    y='expression',
    hue='cellType',
    palette=fill_colors,
    dodge=True
    )

    ax.set_title(plot_title, fontsize=14, fontweight="bold", y=1.05)
    ax.set_xlabel("Gene")
    ax.set_ylabel(y_label)

    if plot_subtitle:
        ax.text(
            0.5, 1.02, plot_subtitle,
            ha="center",
            transform=ax.transAxes,
            fontsize=11)

    plt.xticks(rotation=45, ha="right", fontsize=10)
    plt.legend(title=plot_title)
    sns.despine()
    plt.tight_layout()
    plt.show()

## 2. ODM API configuration

**Configuring Access to Your ODM Instance**

Before querying data, establish a connection to your ODM deployment and authenticate using an API token.  
The ODM API uses token-based authentication, allowing secure programmatic access while preserving user-level permissions.

In this section:
* **Specify the ODM instance URL** – defines the environment you are connecting to.  
* **Provide the API token** – identifies and authorizes your user session.  
* **Initialize the ODM API client** – creates the communication layer for all subsequent requests.


In [None]:
# input ODM server address
server = 'https://q001-sc-demo.trial.genestack.com/'

# input API token
set_api_credentials(server)

In [None]:
# credentials sanity check
if len(token)==0:
    print("Failed to paste API token from clipboard! ",
          "Set the token manually (e.g. via `token = 'your_token'`).")
else:
    print("Token successfully set!")

In [None]:
# initialize API client
configuration = odm_api.Configuration(
    host=server,
    api_key={'Genestack-API-Token': token}
)
api_client = odm_api.ApiClient(configuration)

# read odm-api documentation from
print(f"{server}/user-docs/tools/odm-api/python/generated/")

## 3. Working with Data

### 3.1 Exploring Sample Endpoints

In ODM, each entity—such as samples, datasets, or assays—can be accessed through dedicated API endpoints. Omics query endpoints provide access to quantitative datasets such as gene expression, variant, or flow cytometry data, along with their associated metadata. These endpoints extend integration capabilities, enabling direct retrieval and filtering of omics measurements linked to specific samples or studies.

In this step, the `OmicsQueries` interface is initialized, and its available methods are listed to illustrate the range of omics data types accessible through the ODM API.

In [None]:
# initialize API class
omics_api = odm_api.OmicsQueriesAsUserApi(api_client)

# list all available sample_api endpoints
for item in [item for item in dir(omics_api) if item.endswith("_as_user")]:
    print(item)

### 3.2 Samples

#### 3.2.1 Search

The `omics_search_samples_as_user` endpoint enables sample metadata search within the omics query interface. It allows combining study-level filters with sample attributes and linking downstream omics data types.

In this example, the search parameters are defined to retrieve left ventricle samples of the patients with hypertrophic cardiomyopathy from the "single-nuclei" study and filter them by age.

In [None]:
# define search parameters
study_query = "single-nuclei"

sample_filter = " AND ".join([
    'Organism="Homo sapiens"',
    'Tissue="heart left ventricle"',
    'Disease="hypertrophic cardiomyopathy"',
    '"Age Unit"="year"',
    '(Age > 30 AND Age < 50)'
])

# search samples
start_time = time.time()
samples = omics_api.omics_search_samples_as_user(
    study_query=study_query,
    sample_filter=sample_filter,
    returned_metadata_fields='original_data_included'
)
end_time = time.time()
elapsed_time = round(end_time - start_time, 2)

print("Transcriptomics samples search log:")
print(json.dumps(samples.log, indent=2))
print("")
print(f"Elapsed time: {elapsed_time} seconds")

#### 3.2.2. Exploring Sample Metadata Summary

This section illustrates how metadata attributes can be summarized and explored visually. By converting the retrieved sample metadata list into a structured DataFrame, we can examine the distribution of key attributes such as disease state, sex and somatometric parameters.

The following summary statistics table shows how samples are distributed across the attribute categories, enabling quick inspection of cohort balance and potential biases. Visualizing attribute frequencies at this stage helps verify that metadata relationships are consistent before performing downstream omics queries or expression analyses.

In [None]:
# convert samples.data list of nested dicts to a DataFrame
samples_df = pd.DataFrame([
    item['metadata'] for item in samples.data
])
samples_df.dropna(axis=1, how='all', inplace=True)
samples_df = samples_df.applymap(lambda x: ', '.join(x) if isinstance(x, list) else x)

# compute summary statistics for most common attribute values
samples_summary = pd.DataFrame({
    'unique': samples_df.nunique(),
    'total': samples_df.shape[0],
    'top_values': samples_df.apply(
        lambda col: " / ".join(col.value_counts(dropna=False).index.astype(str))
    )
})

# show summary statistics
samples_summary.sort_values('top_values')

### 3.3 Cells

#### 3.3.1 Retrieval 

The `omics_search_cells_as_user` endpoint enables cell metadata search within the omics query interface. It allows combining sample-level filters with cell attributes and linking downstream omics data types.

In this example, the `cell_query` is defined to retrieve high-quality nuclei with small mitochondrial DNA content and sufficient total UMI counts. The cells can be searched for each sample in the cohort separately or together in a single request using `sample_filter` parameter.

In [None]:
# define high-quality cell filter
cell_query = " AND ".join([
    'percentMito <= 5',
    '(nCounts > 1000 AND nCounts < 15000)'
])

# iterate over samples and search high-quality cells
cells_list = []
for sample in samples_df["genestack:accession"]:
    start_time = time.time()
    cells_sample = omics_api.omics_search_cells_as_user(
        sample_filter='"genestack:accession"="' + sample + '"',
        cell_query=cell_query,
        page_limit=20000
    )
    end_time = time.time()
    
    # generate data frame, assign sample name and store
    df = pd.DataFrame([item for item in cells_sample.data])
    df["sampleId"] = sample
    cells_list.append(df)

    # print search statistics
    elapsed_time = round(end_time - start_time, 2)    
    print(f"Found {len(df)} cells in sample {sample} in {elapsed_time} seconds")

# combine all cells into a single metadata table and establish barcode as index
cells_df = pd.concat(cells_list, ignore_index=True)
cells_df = cells_df.set_index("barcode")

#### 3.3.2 Visualization 
We will use the combined cell metadata table to create a Scanpy object with the cells metadata and UMAP coordinates. This allows to visualize the available cell type populations using the `Scanpy` package.

In [None]:
# create Scanpy object with placeholder expression matrix
adata = ad.AnnData(
    X=None,
    obs=cells_df.copy()
)

# transfer umap coordinates to obsm
adata.obsm["X_umap"] = np.stack(adata.obs["umap"].values)

# plot UMAP with cellType labels
sc.pl.umap(
    adata,
    color='cellType',
    legend_loc='on data',
    title="UMAP by Cell Type",
    show=False
)

# modify text characteristics
ax = plt.gca()
for text in ax.texts:
    text.set_fontweight('light')
    text.set_fontsize(10)

plt.show()

### 3.4 Cell Expression data

#### 3.4.1 Retrieval
The `omics_search_cells_expression_data_as_user` endpoint enables cell expression data search within the omics query interface. It allows combining sample-level filters with cell attributes and linking downstream omics data types.

In this example, the `ex_query` is used to retrieve expression data for the selected known cell type markers genes. The expression data is then searched for the same pool of cells within the samples cohort, which we defined previously by the `sample_filter` and `cell_query` parameters.

In [None]:
# define search parameters
genes = "TNNT2", "DCN", "PECAM1", "CD163", "PDGFRB", "FABP4", "MAP2", "CD4", "CD8A", "CPA3"

cell_query = " AND ".join([
    'percentMito <= 5',
    '(nCounts > 1000 AND nCounts < 15000)'
])

sample_filter = " OR ".join(
    '"genestack:accession"="' + sample + '"'
    for sample in samples_df["genestack:accession"]
)

# iterate over genes
ex_list = []
for gene in genes:
    start_time = time.time()
    ex_gene = omics_api.omics_search_cells_expression_data_as_user(
        sample_filter=sample_filter,
        ex_query="feature=" + gene,
        cell_query=cell_query,
        page_limit=20000
      )
    end_time = time.time()
    
    df = pd.DataFrame([item for item in ex_gene.data])
    ex_list.append(df)

    elapsed_time = round(end_time - start_time, 2) 

    print(f"Found {len(df)} cells with gene {gene} in {elapsed_time} seconds")

# combine all cells into a single metadata table
ex_df = clean_metadata(pd.concat(ex_list, ignore_index=True))

#### 3.4.2 Visualization
The purpose of this section is to validate the cell type assignment by visualizing the expression of selected cell type-specific markers. We will create an adata object with both cells metadata and expression data, and visualize the expression of the selected genes using the `pl.umap` and `pl.violin` plot functions from `Scanpy` package.

In [None]:
# create AnnData object
counts_matrix = create_full_matrix(cells_df, ex_df)
genes_unique = ex_df['gene_id'].unique()
cells_unique = cells_df.index
adata = ad.AnnData(
    X=counts_matrix.T,
    obs=cells_df.loc[cells_unique].copy(),
    var=pd.DataFrame(index=genes_unique)
)

# transfer umap coordinates to obsm
adata.obsm["X_umap"] = np.stack(adata.obs["umap"].values)

# visualize gene expression on UMAP
sc.pl.umap(
    adata,
    color=genes,
    ncols=2,
    size=50,
    color_map='viridis',
    show=True
)

In [None]:
# plot violin with 2x5 grid layout
fig, axes = plt.subplots(5, 2, figsize=(10, 20))
axes = axes.flatten()

for idx, gene in enumerate(genes):
    sc.pl.violin(
        adata,
        keys=gene,
        groupby='cellType',
        stripplot=False,
        rotation=90,
        multi_panel=False,
        ax=axes[idx],
        size=1,
        show=False
    )
    axes[idx].set_title(gene)

plt.tight_layout()
plt.show()

### 3.5 Retrieving Analytical Omics Endpoints

The `AnalyticsOmicsQueriesAsUser` interface provides access to analytical omics endpoints that enable advanced statistical analyses and cell population comparisons. These endpoints include cell ratio statistics, differential expression analysis, and gene summary statistics, which are essential for identifying cell type-specific expression patterns and uncovering meaningful biological insights.

In [None]:
# initialize API class
omics_analytics_api = odm_api.BETAAnalyticsOmicsQueriesAsUserApi(api_client)

# list all available sample_api endpoints
for item in [item for item in dir(omics_analytics_api) if item.endswith("_as_user")]:
    print(item)

### 3.6 Cell Ratio Statistics
The `cell_ratio_as_user` endpoint enables cell ratio statistics retrieval within the analytical omics interface. It allows quantifying the proportion of cells that meet specific criteria (`count_selected`, e.g. cell type or expression threshold) relative to a defined reference group or the total cell population (`count_available`, e.g. defined by study or samples metadata).

Since fibrosis is among the hallmarks of hypertrophic cardiomyopathy, we are interested to evaluate the subpopulations of fibroblast in our patients cohort. We will compare the proportions of `Activated_fibroblast` subtype and the major fibroblast subtype (i.e. `Fibroblast_I`) among all the cells that meet the established quality criteria.

In [None]:
# define search parameters
cell_types = "Activated_fibroblast", "Fibroblast_I"

cell_query = " AND ".join([
    'percentMito <= 5',
    '(nCounts > 1000 AND nCounts < 15000)'
])

sample_filter = " OR ".join(
    '"genestack:accession"="' + sample + '"'
    for sample in samples_df["genestack:accession"]
)

# iterate over cell types and retrieve cell ratio statistics
cell_ratio_list = []
for cell_type in cell_types:
    start_time = time.time()
    cell_ratio = omics_analytics_api.cell_ratio_as_user(
        cr_request=CRRequest(
            cellGroup=DERequestCaseGroup(
                sampleFilter=sample_filter,
                cellQuery="cellType=" + cell_type + " AND " + cell_query
            )
        )
    )
    end_time = time.time()

    # generate and format data frame
    df = pd.DataFrame([item for item in cell_ratio])
    df_t = extract_stats(df)
    cell_ratio_list.append(df_t)

    # print search statistics
    elapsed_time = round(end_time - start_time, 2)
    
    print(f"Retrieved cell ratio statistics for {cell_type} cell_type in \
        {elapsed_time} seconds")

# combine all cells into a single metadata table
cell_ratio_df = pd.concat(cell_ratio_list)
cell_ratio_df

### 3.7 Differential Expression Analysis

#### 3.7.1 Retrieval

The `differential_expression_as_user` endpoint enables differential gene expression analysis retrieval within the analytical omics interface. It allows identifying genes that are differentially expressed between two cell populations or cell types, defined by the `cell_query` parameter. 

In this example, the request is defined to retrieve differential gene expression between the `Activated_fibroblast` and `Fibroblast_I` cell types. The cells are defined by the `cell_query` parameter, which also includes the quality control criteria. To narrow down the cell population, we will use the `sample_filter` parameter to define the samples cohort. 

The results can be retrieved for selected genes or for all genes in the cohort. We are going to retrieve results for all the genes in the cohort in chunks of 20,000 genes (which is the maximal page limit for the endpoint). The results are converted to a dataframe and sorted by p-value and absolute log2 fold change. The minimal sample size is set to 6 cells in each group. Below we show results for the top 100 differentially expressed genes.


In [None]:
# define search parameters
cell_types = "Activated_fibroblast", "Fibroblast_I"

cell_query = " AND ".join([
    'percentMito <= 5',
    '(nCounts > 1000 AND nCounts < 15000)'
])

sample_filter = " OR ".join(
    '"genestack:accession"="' + sample + '"'
    for sample in samples_df["genestack:accession"]
)


# retrieve differential gene expression in chunks
offset = 0
limit = 20000
de_results_list = []
start_time = time.time()

while True:
    de_results = omics_analytics_api.differential_expression_as_user(
        de_request=DERequest(
            caseGroup=DERequestCaseGroup(
                sampleFilter=sample_filter,
                cellQuery="cellType=" + cell_types[0] + " AND " + cell_query
            ),
            controlGroup=DERequestCaseGroup(
                sampleFilter=sample_filter,
                cellQuery="cellType=" + cell_types[1] + " AND " + cell_query
            ),
            limit=limit,
            offset=offset            
        )
    )
    offset = offset + limit
    
    # generate and format data frame,
    df = pd.DataFrame(de_results.results_per_gene)
    de_results_list.append(df)
    
    if len(de_results.results_per_gene) < limit:
        break
end_time = time.time()

# combine all cells into a single metadata table
de_results_df = pd.concat(de_results_list)
col_names = [cell[0] for cell in de_results_df.iloc[0]]
de_results_df = de_results_df.applymap(lambda x: x[1])
de_results_df.columns = col_names

# print search statistics
elapsed_time = round(end_time - start_time, 2)
print(f"Retrieved differential gene expression for {len(de_results_df)} genes \
    in {elapsed_time} seconds")

# calculate Log2FC and sort by p-value and absolute log2FC
de_results_df["log2FC"] = (np.log2(de_results_df["fold_change"])).round(2)
de_results_df.sort_values(
    by=["p_value", "log2FC"], 
    ascending=[True, False], 
    key=lambda col: col.abs() if col.name == 'log2FC' else col,
    inplace=True
)

# set minimal sample size
de_results_df = de_results_df[
    de_results_df[["case_cell_count", "control_cell_count"]].min(axis=1) >= 6
]

# keep only columns of interest
filt_results_df = de_results_df[[
    "gene_id", "case_avg_expression", "control_avg_expression", 
    "log2FC", "p_value", "mann_whitney_u"
]].copy()

# extract top DE genes
filt_results_df = filt_results_df.head(100)
show(filt_results_df, paging=True)

#### 3.7.2 Visualization

The purpose of this section is to visualize the differentially expressed genes in an interactive Volcano plot using `plotly`. The plot shows the log2 fold change on the x-axis and the -log10 p-value on the y-axis. The genes are colored by the p-value and the fold change. The genes with p-value < 0.05 and absolute log2 fold change > 0.5 are highlighted.


In [None]:
# log-transform p-value
de_results_df["p_value"] = de_results_df["p_value"].replace(0, 0.001)
de_results_df["-log10(pvalue)"] = -np.log10(de_results_df["p_value"])

# highlight down- or up- regulated genes
down = de_results_df[
    (de_results_df['log2FC']<= -0.5) &
    (de_results_df['p_value'] <= 0.05)
]
up = de_results_df[
    (de_results_df['log2FC'] >= 0.5) &
    (de_results_df['p_value'] <= 0.05)
]
ns = de_results_df[
    ~de_results_df.index.isin(down.index) & 
    ~de_results_df.index.isin(up.index)
] 

fig = go.Figure()

# add traces
fig.add_trace(make_trace(ns, 'Not Significant', 'grey'))
fig.add_trace(make_trace(down, 'Down-regulated', 'blue'))
fig.add_trace(make_trace(up, 'Up-regulated', 'red'))

# add threshold lines
fig.add_hline(y=-np.log10(0.05), line_dash="dash", line_color="grey")
fig.add_vline(x=-0.5, line_dash="dash", line_color="grey")
fig.add_vline(x=0.5, line_dash="dash", line_color="grey")

fig.update_layout(
    xaxis_title="log2 Fold Change",
    yaxis_title="-log10(p-value)",
    template="plotly_white",
    width=800,
    height=600,
    hovermode="closest"
)

fig.show()

### 3.8 Gene Summary Statistics

#### 3.8.1. Summary for Top Differentially Expressed Genes

The `gene_summary_as_user` endpoint enables gene summary statistics retrieval within the analytical omics interface. It allows retrieving the expression distribution of selected genes in the defined cell populations, defined by the `cell_query` parameter. To narrow down the cell population, we will use the `sample_filter` parameter to define the samples cohort. The results can be retrieved for selected genes or for all genes in the cohort.

We will retrieve the expression distribution of the top 3 overexpressed and 3 downregulated genes in the `Activated_fibroblast` and `Fibroblast_I` cell types.

In [None]:
# define search parameters
up_genes = (
    de_results_df
    .loc[de_results_df["log2FC"] > 0, "gene_id"]
    .head(3)
    .tolist()
)
down_genes = (
    de_results_df
    .loc[de_results_df["log2FC"] < 0, "gene_id"]
    .head(3)
    .tolist()
)
top_genes = up_genes + down_genes

cell_types = "Activated_fibroblast", "Fibroblast_I"

cell_query = " AND ".join([
    'percentMito <= 5',
    '(nCounts > 1000 AND nCounts < 15000)'
])

sample_filter = " OR ".join(
    '"genestack:accession"="' + sample + '"'
    for sample in samples_df["genestack:accession"]
)

# iterate over genes and retrieve summary statistics
gene_summary_list = []
for cell_type in cell_types:

    # retrieve summary statistics
    start_time = time.time()
    gene_summary_cell_type = omics_analytics_api.gene_summary_as_user(
        gs_request=GSRequest(
            geneNames=top_genes,
            cellGroup=DERequestCaseGroup(
                sampleFilter=sample_filter,
                cellQuery="cellType=" + cell_type + " AND " + cell_query
            )
        )
    )
    end_time = time.time()
    
    # generate and format data frame,
    df = pd.DataFrame(gene_summary_cell_type.results_per_gene)
    df["cellType"] = [("cellType", cell_type)] * len(df)
    gene_summary_list.append(df)

    # print search statistics
    n_genes = len(gene_summary_cell_type.results_per_gene)
    elapsed_time = round(end_time - start_time, 2)
    print(
        f"Retrieved summary for {n_genes} genes in {cell_type} "
        f"cell type in {elapsed_time} seconds"
    )
    
# combine all cells into a single metadata table
gene_summary_df = pd.concat(gene_summary_list)

col_names = [cell[0] for cell in gene_summary_df.iloc[0]]
gene_summary_df = gene_summary_df.applymap(lambda x: x[1])
gene_summary_df.columns = col_names

# order genes
gene_summary_df = gene_summary_df.sort_values('gene_id')

# keep only columns of interest
gene_summary_df = gene_summary_df[[
    "gene_id", "cell_count", "mean", "median", "std_dev",
     "min", "max", "quantiles", "cellType"
]]

show(gene_summary_df, paging=True)

#### 3.8.2 Visualization for Top Differentially Expressed Genes

The purpose of this section is to visualize the expression distribution of the top 6 differentially expressed genes (3 overexpressed and 3 downregulated). The boxplot shows the expression distribution of the genes in the `Activated_fibroblast` and `Fibroblast_I` cell types.

In [None]:
# fill colors mapping
fill_colors = {
    "Activated_fibroblast": "#E69F00",
    "Fibroblast_I": "#56B4E9"
}

boxplot_gene_summary(
    df=gene_summary_df,
    fill_colors=fill_colors,
    plot_title="Gene Expression Distribution by Cell Type",
    log1p=True
)    

#### 3.8.3 Summary for Commonly Expressed Genes

The `gene_summary_as_user` endpoint allows rapid retrieval of the expression distribution of genes not only within the defined cell populations, but also across all the cells indexed in ODM. In the example below, we will check how many cells across the database have expression of the most common genes.

In [None]:
# define search parameters
common_genes = "MALAT1", "INS", "NEAT1", "ZBTB20", "REG1A"

# retrieve gene summary statistics
start_time = time.time()
common_gene_summary = omics_analytics_api.gene_summary_as_user(
    gs_request=GSRequest(
        geneNames=common_genes,
        exQuery="value >= 1"
    )
)
end_time = time.time()

# print search statistics
n_genes = len(common_gene_summary.results_per_gene)
elapsed_time = round(end_time - start_time, 2)
print(f"Retrieved summary for {n_genes} genes in {elapsed_time} seconds")

# convert gene summary to dataframe
common_gene_summary_df = pd.DataFrame(common_gene_summary.results_per_gene)
col_names = [cell[0] for cell in common_gene_summary_df.iloc[0]]
common_gene_summary_df = common_gene_summary_df.applymap(lambda x: x[1])
common_gene_summary_df.columns = col_names

# order genes
common_gene_summary_df.sort_values('cell_count', ascending=False, inplace=True)

# keep only columns of interest
common_gene_summary_df = common_gene_summary_df[[
    "gene_id", "cell_count", "mean", "median", "std_dev",
    "min", "max", "quantiles"
]]

show(common_gene_summary_df, paging=True)