# Cell BLAST tutorial

In [None]:
import time
import warnings
import numpy as np
import pandas as pd
import tensorflow as tf
import Cell_BLAST as cb

warnings.filterwarnings("ignore")
np.set_printoptions(threshold=200)
pd.set_option("max_rows", 6)
tf.logging.set_verbosity(0)
cb.config.N_JOBS = 4
cb.config.RANDOM_SEED = 0

## Preparing database

In this tutorial, we demonstrate how to perform Cell BLAST based on DIRECTi models.

Again, we use the human pancreatic islet datasets as an example.

In [None]:
baron_human = cb.data.ExprDataSet.read_dataset("../../Datasets/data/Baron_human/data.h5")

Cell BLAST uses multiple models to increase specificity.

Here we first train 4 DIRECTi models, each with a different random seed.

> Please refer to the accompanying [DIRECTi](DIRECTi.html) notebook for more detailed introduction to model training.

In [None]:
%%capture
start_time=time.time()
models = []
for i in range(4):
    models.append(cb.directi.fit_DIRECTi(
        baron_human, genes=baron_human.uns["seurat_genes"],
        latent_dim=10, cat_dim=20, random_seed=i
    ))

In [None]:
print("Time elapsed: %.1fs" % (time.time() - start_time))

Then we build a Cell BLAST "database" by feeding our previously trained models and the reference dataset.

In [None]:
blast = cb.blast.BLAST(models, baron_human)

Like DIRECTi models, [`BLAST`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.BLAST) objects can be easily saved and loaded.

In [None]:
blast.save("./baron_human_blast")
del blast
blast = cb.blast.BLAST.load("./baron_human_blast")

## Querying

We load another human pancreatic islet dataset to demonstrate the querying process.

Note that we do **NOT** perform data normalization or gene subsetting here. These should be internally handled by the BLAST object later in querying.

In [None]:
lawlor = cb.data.ExprDataSet.read_dataset("../../Datasets/data/Lawlor/data.h5")

To query the database, we first use the [`query()`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.BLAST.query) method to obtain initial hits in the reference database. This is done by efficient Euclidean distance based nearest neighbor search in the latent space. Nearest neighbors in the latent space of each model will be merged. Though highly efficient, latent space Euclidean distance is not the best metric to determine cell-cell similarity. To increase accuracy and specificity, we also compute posterior distribution distances as well as empirical p-values for these nearest neighbors.

In [None]:
start_time = time.time()
lawlor_hits = blast.query(lawlor)
print("Time per query: %.1fms" % (
    (time.time() - start_time) * 1000 / lawlor.shape[0]
))

Then we use [`reconcile_models()`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.Hits.reconcile_models) to pool together informarion from multiple models and [`filter()`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.Hits.filter) the initial hits to obtain significant hits.

In [None]:
lawlor_hits = lawlor_hits.reconcile_models().filter(by="pval", cutoff=0.05)

Optionally, we may use the [`to_data_frames()`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.Hits.to_data_frames) method to extract detailed information about the query hits.

The return value is a python dict, with query cell names as keys and meta table of query hits as values.

In [None]:
hits_dict = lawlor_hits[0:5].to_data_frames()
hits_dict.keys()

In [None]:
hits_dict["1st-61_S27"]

Finally, we can use the [`annotate()`](../modules/Cell_BLAST.blast.html#Cell_BLAST.blast.Hits.annotate) method to obtain cell type predictions.

In [None]:
lawlor_predictions = lawlor_hits.annotate("cell_ontology_class")

For the "Lawlor" dataset, we also have author provided "ground truth" cell type annotations.

By comparing with the "ground truth", we see that the predictions are quite accurate.

In [None]:
fig = cb.blast.sankey(
    lawlor.obs["cell_ontology_class"].values,
    lawlor_predictions.values.ravel(),
    title="Lawlor to Baron_human", tint_cutoff=2
)