# Annotation Embedding

In this notebook we are explaining how to generate an annotation embedding using expression level annotations obtained using our [FAUST method](https://github.com/RGLab/FAUST) from data of [Mair et al., 2022, Nature](https://www.nature.com/articles/s41586-022-04718-w).

In [19]:
import jscatter
import numpy as np
import pandas as pd
from glob import glob
from colors import glasbey_dark, gray_light

In [6]:
dataset_name = 'TUMOR_006'

## Data Loading

In [7]:
dataset = glob(f'data/mair-2022/{dataset_name}*')[0]
df = pd.read_parquet(dataset)
df.head(5)

Unnamed: 0,umapX,umapY,sampleOfOrigin,faustLabels,CD4,CD8,CD3,CD45RA,CD27,CD19,...,CD69_faust_annotation,PD1_faust_annotation,HLADR_faust_annotation,GranzymeB_faust_annotation,CD25_faust_annotation,ICOS_faust_annotation,TCRgd_faust_annotation,CD38_faust_annotation,CD127_faust_annotation,Tim3_faust_annotation
0,-12.197959,10.958616,TUMOR_006_samples_FM92_Tumor_006_033_CD45_live...,CD4+CD8-CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,134.479553,64.721008,160.687744,59.179371,123.690056,65.03624,...,-,+,-,-,-,-,-,-,+,-
1,-0.338861,-4.746182,TUMOR_006_samples_FM92_Tumor_006_033_CD45_live...,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,27.485004,168.428329,157.620132,60.775932,86.70932,57.068966,...,-,-,-,+,-,-,-,-,-,-
2,-6.862342,-14.07071,TUMOR_006_samples_FM92_Tumor_006_033_CD45_live...,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103+CD28+CD69+P...,23.172733,176.405228,158.386978,60.367004,144.181641,50.614777,...,+,+,-,-,-,-,-,+,-,-
3,-9.874931,0.273441,TUMOR_006_samples_FM92_Tumor_006_033_CD45_live...,0_0_0_0_0,116.68148,67.511528,155.634628,56.426315,149.26651,62.5065,...,-,+,-,-,-,+,-,-,-,+
4,7.248039,4.996465,TUMOR_006_samples_FM92_Tumor_006_033_CD45_live...,CD4-CD8-CD3-CD45RA+CD27-CD19+CD103-CD28-CD69-P...,28.431238,62.975693,39.712109,176.277328,46.269531,172.01149,...,-,-,-,-,-,-,-,-,-,-


## Data Preparation

We'll start by extracting all markers

In [10]:
suffix = '_faust_annotation'
markers = [c[:-len(suffix)] for c in list(df.columns) if c.endswith(suffix)]

Then, we extract the "raw" marker expression values

In [11]:
raw_expression = df[markers].values

Next we create a new column for the _complete_ FAUST annotation labels. In comparison to `faustLabels`, the complete FAUST label does not depent on selected phenotypes and is just the concatenation of all marker labels.

Think of the _complete FAUST label_ as the cluster name that each data entry belongs to.

In [12]:
df['complete_faust_label'] = ''
for marker in markers:
    df['complete_faust_label'] += marker + df[f'{marker}_faust_annotation']
    
df[['faustLabels', 'complete_faust_label']].head(5)

Unnamed: 0,faustLabels,complete_faust_label
0,CD4+CD8-CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,CD4+CD8-CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...
1,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...
2,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103+CD28+CD69+P...,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103+CD28+CD69+P...
3,0_0_0_0_0,CD4+CD8-CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...
4,CD4-CD8-CD3-CD45RA+CD27-CD19+CD103-CD28-CD69-P...,CD4-CD8-CD3-CD45RA+CD27-CD19+CD103-CD28-CD69-P...


Finally, we're extracting the expression levels detected by FAUST.

Think of expression levels as simply a discretization of the "raw" expression values. E.g., if the protein's expression range is `[0,10]` we could choose to discretize the range into low and high, where low represents values in `[0,5]` and high represents values in `[5,10]`.

In [13]:
expression_levels = list(df.CD3_faust_annotation.unique())
expression_levels

['+', '-']

## Data Transformation

The following steps are the core of creating an annotation-driven embedding.

For each cell type (i.e., cluster) and marker (i.e., feature) we winsozrize and standardize the expression values to have zero mean and unit variance. And then we translate the expression values of each marker according to the marker's expression level (positive or negative) to separate their expression ranges.

In [14]:
from scipy.stats.mstats import winsorize
from time import time

expression_level_translation = { '-': 0, '+': 1000 }

faust_labels = df.complete_faust_label.unique()

marker_annotation_cols = [f'{m}_faust_annotation' for m in markers]

embedding_expression = raw_expression.copy()

t = 0

# For each cluster (i.e., cell phenotype defined by the FAUST label)
for i, faust_label in enumerate(faust_labels):
    if i % 1000 == 0:
        t = time()
        print(f'Transform {i}-{i + 999} of {len(faust_labels)} clusters... ', end='')
        
    # First, we get the indices of all data points belonging to the cluster (i.e., cell phenotype)
    idxs = df.query(f'complete_faust_label == "{faust_label}"').index
    
    # 1. We winsorize the expression values to [0.01, 99.9]
    embedding_expression[idxs] = winsorize(
        embedding_expression[idxs],
        limits=[0.01, 0.01],
        axis=0,
    )
    
    # 2. Then we standardize the expression values
    # to have zero mean and unit standard deviation
    mean = embedding_expression[idxs].mean(axis=0)
    sd = np.nan_to_num(embedding_expression[idxs].std(axis=0))
    sd[sd == 0] = 1
    
    embedding_expression[idxs] -= mean
    embedding_expression[idxs] /= sd

    # 3. Next, we translate the expressions values based on their expression levels
    embedding_expression[idxs] += df.iloc[idxs[0]][marker_annotation_cols].map(
        expression_level_translation
    ).values
    
    if i % 1000 == 999:
        print(f'done! ({round(time() - t)}s)')

Transform 0-999 of 5388 clusters... done! (17s)
Transform 1000-1999 of 5388 clusters... done! (16s)
Transform 2000-2999 of 5388 clusters... done! (15s)
Transform 3000-3999 of 5388 clusters... done! (14s)
Transform 4000-4999 of 5388 clusters... done! (15s)
Transform 5000-5999 of 5388 clusters... 

### UMAP Embedding

The last step is to embed the data using UMAP (or any other kind of dimensionality reduction tool).

In [15]:
from sklearn.decomposition import PCA
from umap import UMAP

pca = PCA(n_components=2).fit_transform(
    df[[f'{m}_Windsorized' for m in markers]].values
)

embedding = UMAP(init=pca, random_state=42).fit_transform(embedding_expression)

Let's save the embedded data for easy access later on

In [16]:
df_embedding = pd.concat(
    [
        pd.DataFrame(df.complete_faust_label.values, columns=['cellType']),
        pd.DataFrame(embedding, columns=['x', 'y'])
    ],
    axis=1
)
df_embedding.cellType = df_embedding.cellType.where(
    df.faustLabels != '0_0_0_0_0',
    '0_0_0_0_0'
).astype('category')

df_embedding.to_parquet(f'data/{dataset_name}_embedding_umap.pq', compression='gzip')

df_embedding.head()

Unnamed: 0,cellType,x,y
0,CD4+CD8-CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,1.344373,-7.968547
1,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103-CD28+CD69-P...,4.135707,8.410807
2,CD4-CD8+CD3+CD45RA-CD27+CD19-CD103+CD28+CD69+P...,-10.298281,7.296144
3,0_0_0_0_0,0.350603,-3.498238
4,CD4-CD8-CD3-CD45RA+CD27-CD19+CD103-CD28-CD69-P...,13.752163,6.420627


## Visualize Embedding

In [20]:
# Uncomment the line below to load previously embedded data
df_embedding = pd.read_parquet(f'data/{dataset_name}_embedding_umap.pq')

scatter = jscatter.Scatter(
    data=df_embedding.sort_values(by=['cellType']),
    x='x',
    y='y',
    opacity=0.66,
    color_by='cellType',
    color_map=[gray_light]+glasbey_dark+glasbey_dark+glasbey_dark,
    height=640,
)
scatter.show()

HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(width='36px'), style…