# Zebrafish pigmentation

This tutorial uses data from [Saunders, et al (2019)](https://elifesciences.org/articles/45181). Special thanks also go to [Lauren](https://twitter.com/LSaund11) for the tutorial improvement. 

In this [study](https://elifesciences.org/articles/45181), the authors profiled thousands of neural crest-derived cells from trunks of post-embryonic zebrafish. These cell classes include pigment cells, multipotent pigment cell progenitors, peripheral neurons, Schwann cells, chromaffin cells and others. These cells were collected during an active period of post-embryonic development, which has many similarities to fetal and neonatal development in mammals, when many of these cell types are migrating and differentiating as the animal transitions into its adult form. This study also explores the role of thyroid hormone (TH), a common endocrine factor, on the development of these different cell types. 

Such developmental and other dynamical processes are especially suitable for dynamo analysis as dynamo is designed to accurately estimate direction and magnitude of expression dynamics (`RNA velocity`), predict the entire lineage trajectory of any intial cell state (`vector field`), characterize the structure (`vector field topology`) of full gene expression space, as well as fate commitment potential (`single cell potential`). 

Import the package and silence some warning information (mostly `is_categorical_dtype` warning from anndata)

In [1]:
import warnings
warnings.filterwarnings('ignore')

import dynamo as dyn 
from dynamo.configuration import DKM
import numpy as np

this is like R's sessionInfo() which helps you to debug version related bugs if any. 

In [2]:
dyn.get_all_dependencies_version()

package,dynamo-release,pre-commit,colorcet,cvxopt,hdbscan,loompy,matplotlib,networkx,numba,numdifftools,numpy,pandas,pynndescent,python-igraph,scikit-learn,scipy,seaborn,setuptools,statsmodels,tqdm,trimap,umap-learn
version,1.0.0,2.15.0,2.0.6,1.2.7,0.8.27,3.0.6,3.4.3,2.6.3,0.54.0,0.9.40,1.20.3,1.3.3,0.5.4,0.9.6,0.24.2,1.7.1,0.11.2,58.0.4,0.12.2,4.62.3,1.0.15,0.5.1


## Load data 

Dynamo comes with a few builtin sample datasets so you can familiarize with dynamo before analyzing your own dataset.
You can read your own data via `read`, `read_loom`, `read_h5ad`, `read_h5` (powered by the [anndata](https://anndata.readthedocs.io/en/latest/anndata.AnnData.html) package) or load_NASC_seq, etc. Here I just load the zebrafish sample data that comes with dynamo. This dataset has 4181 cells and 16940 genes. Its `.obs` attribute also included `condition`, `batch` information from the original study (you should also store those information to your `.obs` attribute which is essentially a Pandas Dataframe, see more at [anndata](https://anndata.readthedocs.io/en/latest/)). `Cluster`, `Cell_type`, umap coordinates that was originally analyzed with [Monocle 3](https://cole-trapnell-lab.github.io/monocle3/) are also provided. 

In [3]:
adata = dyn.sample_data.zebrafish()


|-----> Downloading data to ./data/zebrafish.h5ad


In [4]:
def select_genes_func(adata):
    dyn.preprocessing.select_genes_by_dispersion_general(adata, recipe="seurat")
# preprocessor = dyn.preprocessing.Preprocessor(select_genes_function=select_genes_func, use_log1p=False)
# preprocessor.preprocess_adata(adata)
select_genes_func(adata)

|-----> filtering genes by dispersion...
|-----> n_top_genes is None, reserve all genes and add filter gene information
|-----> select genes by recipe: seurat
|-----> <insert> pp_gene_means to var in AnnData Object.
|-----> <insert> gene_vars to var in AnnData Object.
|-----> <insert> gene_highly_variable to var in AnnData Object.
|-----> number of selected highly variable genes: 2384
|-----> [filter genes by dispersion] in progress: 100.0000%
|-----> [filter genes by dispersion] finished [1.6228s]


In [5]:

print(adata.var[DKM.VAR_USE_FOR_PCA])
print("#genes selected:", adata.var[DKM.VAR_USE_FOR_PCA].sum())
pp_gene_list = list(adata.var_names[np.where(adata.var[DKM.VAR_USE_FOR_PCA])])

tmsb4x        False
rpl8          False
ppiaa         False
rpl10a        False
rps4x         False
              ...  
cdc42ep1a     False
camk1da       False
zdhhc22       False
zgc:153681    False
mmp16b        False
Name: use_for_pca, Length: 16940, dtype: bool
#genes selected: 2384


In [6]:
import scvelo
scvelo_adata = dyn.sample_data.zebrafish()
# dyn.preprocessing.utils.convert2symbol(scvelo_adata)
print("#genes:", len(scvelo_adata.var_names))
scvelo.pp.filter_genes_dispersion(scvelo_adata, flavor="seurat", log=True)# n_top_genes=500)
print("#genes:", len(scvelo_adata.var_names))

|-----> Downloading data to ./data/zebrafish.h5ad


#genes: 16940
Extracted 1653 highly variable genes.
#genes: 1653


In [7]:
scvelo_genes = scvelo_adata.var_names

In [8]:
intersection = set(pp_gene_list).intersection(scvelo_genes)
print("#genes intersection:", len(intersection))
intersection = set(scvelo_genes).intersection(pp_gene_list)
print("#genes intersection:", len(intersection))

#genes intersection: 1652
#genes intersection: 1652
