# FAUST Annotation Embedding

In this notebook we are generating an annotation-driven embedding using FAUST annotations of the MMRF study.

In [1]:
import jscatter
import numpy as np
import pandas as pd
from cmaps import glasbey_dark, gray

In [2]:
name = 'MMRF_1267'

## Data Loading

In [3]:
data = pd.read_csv(f'data/{name}_output.csv')
data.head(5)

Unnamed: 0,umapX,umapY,sampleOfOrigin,faustLabels,CD3,HLADR,NKG2D,CD8,CD45,CD4,...,CD56_faust_annotation,KLRG1_faust_annotation,Ki67_faust_annotation,PDL1_faust_annotation,Tbet_faust_annotation,ICOS_faust_annotation,NKG2A_faust_annotation,CD138_faust_annotation,CD1c_faust_annotation,DNAM1_faust_annotation
0,1.538682,0.374683,MMRF_1267.fcs,CD3-HLADR+NKG2D-CD8-CD45+CD4-CD5-CD14-CD19+CD1...,1.07537,5.342214,0.0,0.000268,4.862887,0.824602,...,-,-,-,-,-,-,-,-,+,-
1,-3.134375,11.461645,MMRF_1267.fcs,0_0_0_0_0,5.824924,0.915367,3.522006,4.770885,4.734933,0.0,...,-,+,-,-,-,-,-,-,-,+
2,-0.935311,12.303033,MMRF_1267.fcs,0_0_0_0_0,5.358062,1.360134,1.814536,2.393424,4.235237,0.0,...,-,-,+,-,+,+,-,-,+,+
3,2.128425,-5.859528,MMRF_1267.fcs,CD3-HLADR-NKG2D-CD8-CD45+CD4-CD5-CD14-CD19-CD1...,0.0,2.271876,0.347916,0.0,1.779969,0.547474,...,-,-,-,-,-,-,-,-,-,-
4,-2.191941,13.085436,MMRF_1267.fcs,0_0_0_0_0,5.449534,0.0,2.883831,4.385998,4.845957,0.0,...,-,+,-,-,-,-,-,+,-,-


## Data Preparation

We'll start by extracting all markers

In [4]:
suffix = '_faust_annotation'
all_markers = [c[:-len(suffix)] for c in list(data.columns) if c.endswith(suffix)]

In [5]:
markers = all_markers[:20]

Then, we extract the "raw" marker expression values

In [6]:
raw_expression = data[markers].values

Next we create a new column for the _complete_ FAUST annotation labels. In comparison to `faustLabels`, the complete FAUST label does not depent on selected phenotypes and is just the concatenation of all marker labels.

Think of the _complete FAUST label_ as the cluster name that each data entry belongs to.

In [7]:
data['complete_faust_label'] = ''
for marker in markers:
    data['complete_faust_label'] += marker + data[f'{marker}_faust_annotation']
    
data[['faustLabels', 'complete_faust_label']].head(5)

Unnamed: 0,faustLabels,complete_faust_label
0,CD3-HLADR+NKG2D-CD8-CD45+CD4-CD5-CD14-CD19+CD1...,CD3-HLADR+NKG2D-CD8-CD45+CD4-CD5-CD14-CD19+CD1...
1,0_0_0_0_0,CD3+HLADR-NKG2D+CD8+CD45+CD4-CD5+CD14-CD19-CD1...
2,0_0_0_0_0,CD3+HLADR-NKG2D+CD8-CD45+CD4-CD5+CD14-CD19-CD1...
3,CD3-HLADR-NKG2D-CD8-CD45+CD4-CD5-CD14-CD19-CD1...,CD3-HLADR-NKG2D-CD8-CD45+CD4-CD5-CD14-CD19-CD1...
4,0_0_0_0_0,CD3+HLADR-NKG2D+CD8+CD45+CD4-CD5+CD14-CD19-CD1...


Finally, we're extracting the expression levels detected by FAUST.

Think of expression levels as simply a discretization of the "raw" expression values. E.g., if the protein's expression range is `[0,10]` we could choose to discretize the range into low and high, where low represents values in `[0,5]` and high represents values in `[5,10]`.

In [8]:
expression_levels = list(data.CD3_faust_annotation.unique())
expression_levels

['-', '+']

## Data Transformation

The following steps are the core of creating an annotation-driven embedding.

For each cell type (i.e., cluster) and marker (i.e., feature) we winsozrize and standardize the expression values to have zero mean and unit variance. And then we translate the expression values of each marker according to the marker's expression level (positive or negative) to separate their expression ranges.

In [9]:
from scipy.stats.mstats import winsorize
from time import time

expression_level_translation = { '-': 1000, '+': 2000 }

faust_labels = data.complete_faust_label.unique()

marker_annotation_cols = [f'{m}_faust_annotation' for m in markers]

embedding_expression = raw_expression.copy()

t = 0

# For each cluster (i.e., cell phenotype defined by the FAUST label)
for i, faust_label in enumerate(faust_labels):
    if i % 1000 == 0:
        t = time()
        print(f'Transform {i}-{i + 999} of {len(faust_labels)} clusters... ', end='')
        
    # First, we get the indices of all data points belonging to the cluster (i.e., cell phenotype)
    idxs = data.query(f'complete_faust_label == "{faust_label}"').index
    
    # 1. We winsorize the expression values to [0.01, 99.9]
    embedding_expression[idxs] = winsorize(
        embedding_expression[idxs],
        limits=[0.01, 0.01],
        axis=0,
    )
    
    # 2. Then we standardize the expression values
    # to have zero mean and unit standard deviation
    mean = embedding_expression[idxs].mean(axis=0)
    sd = np.nan_to_num(embedding_expression[idxs].std(axis=0))
    sd[sd == 0] = 1
    
    embedding_expression[idxs] -= mean
    embedding_expression[idxs] /= sd

    # 3. Next, we translate the expressions values based on their expression levels
    embedding_expression[idxs] += data.iloc[idxs[0]][marker_annotation_cols].map(
        expression_level_translation
    ).values
    
    if i % 1000 == 999:
        print(f'done! ({round(time() - t)}s)')

Transform 0-999 of 2626 clusters... done! (7s)
Transform 1000-1999 of 2626 clusters... done! (6s)
Transform 2000-2999 of 2626 clusters... 

### UMAP Embedding

The last step is to embed the data using UMAP (or any other kind of dimensionality reduction tool).

In [10]:
from umap import UMAP

embedding = UMAP().fit_transform(embedding_expression)



Let's save the embedded data for easy access later on

In [11]:
df_embedding = pd.concat(
    [
        pd.DataFrame(data.complete_faust_label.values, columns=['cellType']),
        pd.DataFrame(embedding, columns=['umapX', 'umapY'])
    ],
    axis=1
)
df_embedding.cellType = df_embedding.cellType.where(
    data.faustLabels != '0_0_0_0_0',
    '0_0_0_0_0'
).astype('category')

df_embedding.to_parquet(f'data/{name}_embedding-new-v4.pq', compression='gzip')

df_embedding.head()

Unnamed: 0,cellType,umapX,umapY
0,CD3-HLADR+NKG2D-CD8-CD45+CD4-CD5-CD14-CD19+CD1...,4.584349,11.086934
1,0_0_0_0_0,2.101406,-3.889677
2,0_0_0_0_0,1.886085,1.678656
3,CD3-HLADR-NKG2D-CD8-CD45+CD4-CD5-CD14-CD19-CD1...,8.504686,4.114351
4,0_0_0_0_0,6.513772,11.100647


## Visualize Embedding

In [17]:
from IPython.display import display

# Uncomment the line below to load previously embedded data
# df_embedding = pd.read_parquet(f'data/{name}_embedding.pq')

scatter = jscatter.Scatter(
    data=df_embedding.sort_values(by=['cellType']),
    x='umapX',
    y='umapY',
    opacity=0.66,
    color_by='cellType',
    color_map=[gray]+glasbey_dark+glasbey_dark+glasbey_dark,
    height=640,
)

display(
    scatter.show(),
    scatter.widget.size_widget,
    scatter.widget.opacity_widget,
)

HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(width='36px'), style…

HBox(children=(HTML(value='Point size', layout=Layout(width='128px')), IntSlider(value=3, max=10, min=1)))

HBox(children=(HTML(value='Point opacity', layout=Layout(width='128px')), FloatSlider(value=0.66, max=1.0, ste…