# Preprocessor Seruat Recipe Test on Zebrafish Data 

Import the package and silence some warning information (mostly `is_categorical_dtype` warning from anndata)

In [1]:
import dynamo as dyn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from dynamo.configuration import DKM
import warnings
import scvelo as scv
warnings.filterwarnings('ignore')


this is like R's sessionInfo() which helps you to debug version related bugs if any. 

## Load data

In [2]:
adata = dyn.sample_data.zebrafish()
scv_adata = dyn.sample_data.zebrafish()

|-----> Downloading data to ./data/zebrafish.h5ad
|-----> Downloading data to ./data/zebrafish.h5ad


## Apply Seurat Flavor Gene Selection

In [3]:
# adata = dyn.sample_data.zebrafish()
from dynamo.preprocessing import Preprocessor
import pearson_residual_normalization_recipe
preprocessor = Preprocessor()
preprocessor.config_monocle_recipe(adata) # use monocle as default base config
preprocessor.config_seurat_recipe()
preprocessor.select_genes(adata, **preprocessor.select_genes_kwargs)
# preprocessor.preprocess_adata(adata)


|-----------> <insert> {} to uns['pp'] in AnnData Object.
|-----> filtering genes by dispersion...
|-----> select genes by recipe: seurat
|-----------> choose 2000 top genes
|-----> <insert> pp_gene_means to var in AnnData Object.
|-----> <insert> gene_vars to var in AnnData Object.
|-----> <insert> gene_highly_variable to var in AnnData Object.
|-----> number of selected highly variable genes: 2000
|-----> [filter genes by dispersion] in progress: 100.0000%
|-----> [filter genes by dispersion] finished [1.6386s]


In [4]:
## Compare preprocess results with those of scvelo seurat impl
temp_adata = scv_adata.copy()
scv.pp.filter_genes_dispersion(temp_adata, flavor="seurat", n_top_genes=2000)
preprocess_genes = adata.var_names[adata.var[DKM.VAR_GENE_HIGHLY_VARIABLE_KEY]]
scv_genes = temp_adata.var_names

assert not set(scv_genes).difference(set(preprocess_genes))
assert not set(preprocess_genes).difference(set(scv_genes))

Extracted 2000 highly variable genes.


In [5]:
# scv.pp.filter_genes(adata, min_shared_counts=20)
# scv.pp.normalize_per_cell(adata)
# scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
# scv.pp.log1p(adata)

# The two lines below are equivalent to the code commented out above
scv.pp.filter_and_normalize(scv_adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(scv_adata, n_pcs=30, n_neighbors=30)
# scv.tl.recover_dynamics(adata, n_jobs=16)
scv.tl.velocity(adata) # , mode='dynamical')
scv.tl.velocity_graph(adata)

Filtered out 11388 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
computing neighbors
    finished (0:00:02) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:00) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
Normalized count data: X, spliced, unspliced.
computing neighbors
    finished (0:00:00) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:02) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocities
    finished (0:00:03) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 1/16 cores)


  0%|          | 0/4181 [00:00<?, ?cells/s]

    finished (0:00:14) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)


In [6]:
# scv.tl.umap(scv_adata)
# scv.pl.velocity_embedding_stream(scv_adata, basis='umap')

computing velocity embedding


KeyError: 'velocity'

Conclusion: our implementation and scvelo's has the same set of genes.

## Zebrafish Visualization Routine 

In [7]:
dyn.tl.reduceDimension(adata,basis="pca")
dyn.tl.dynamics(adata)
dyn.pl.streamline_plot(adata, color=['Cell_type'], basis='umap', show_legend='on data', show_arrowed_spines=True)
dyn.pl.phase_portraits(adata, genes=['tfec', 'pnp4a'],  figsize=(6, 4), color='Cell_type')

|-----> retrive data for non-linear dimension reduction...
|-----> perform umap...
|-----> [dimension_reduction projection] in progress: 100.0000%
|-----> [dimension_reduction projection] finished [38.5518s]
|-----> dynamics_del_2nd_moments_key is None. Using default value from DynamoAdataConfig: dynamics_del_2nd_moments_key=False
|-----> calculating first/second moments...
|-----> [moments calculation] in progress: 100.0000%
|-----> [moments calculation] finished [10.3293s]


KeyError: 'M_us'

In [8]:
dyn.tl.reduceDimension(scv_adata,basis="pca")
dyn.tl.dynamics(scv_adata)
dyn.pl.streamline_plot(scv_adata, color=['Cell_type'], basis='umap', show_legend='on data', show_arrowed_spines=True)
dyn.pl.phase_portraits(scv_adata, genes=['tfec', 'pnp4a'],  figsize=(6, 4), color='Cell_type')

|-----> retrive data for non-linear dimension reduction...
|-----? adata already have basis umap. dimension reduction umap will be skipped! 
set enforce=True to re-performing dimension reduction.
|-----> [dimension_reduction projection] in progress: 100.0000%
|-----> [dimension_reduction projection] finished [0.0010s]
|-----> dynamics_del_2nd_moments_key is None. Using default value from DynamoAdataConfig: dynamics_del_2nd_moments_key=False


ValueError: 
Please run `dyn.pp.receipe_monocle(adata)` before running this function!