# Annotation Notebook


This notebook lets you annotate your own data with the tabula sapiens dataset.

By default, it will run the following methods: `onclass`, `scanVI`, `svm`, and `singleCellNet`. Compute permitting, we suggest running all the methods. If your dataset exceeds 100k cells, total runtime will be around 2-3 hours on GPU. (CHECK THIS)


## Arguments:
- **annotation_method**: list from [`"onclass"`, `"scanvi"`, `"svm"`, `"singlecellnet"`]
- **tissue**: `None` or one of [`Bladder`, `Blood`, `Bone_Marrow`, `Kidney`, `Large_Intestine`, `Lung`, `Lymph_Node`, `Muscle`, `Pancreas`, `Skin`,`Small_Intestine`, `Spleen`, `Thymus`, `Trachea`, `Vasculature`]. If `None`, will use the entire tabula sapiens dataset.
- **input_anndata**: path to your input anndata
- **output_folder_name**: folder in `/data` to save outputs to. Should be unexisting directory
- **use_gpu**: if `True`, will use the GPU for training. Note: runtimes are significantly longer on CPU
- **use_10X_only**: If `True`, only uses the 10X data from tabula sapiens. This should only equal True if `input_anndata` is 10X. Based on our observations, scanVI will perform better if only using 10X data. Should not be True if you have smartseq2 data.
- **batch_correction_conditions**: List from [`"donor"`, `"method"`] or `None`

Optional arguments for scanVI:
- **scvi_model**: path to pretrained scvi model. Default: `None`.
- **scanvi_model**: path to pretrained scanvi mode. Default: `None`.
- **n_scvi_epochs**: n_epochs to train scvi for. Default: `400` 
- **n_scanvi_epochs**: n_epochs to train scanvi for. Default: `15`


In [9]:
annotation_method = ["onclass", 'scanvi', 'svm', 'singlecellnet']
tissue = 'Blood'
input_anndata = 'data/adata_small_test.h5ad'
output_folder_name = 'outputs3'
use_gpu = True
use_10X_only= True
batch_correction_conditions = ['donor', 'method']

#scVI arguments:
scvi_model = None
scanvi_model= None
n_scvi_epochs= 400
n_scanvi_epochs = 15

In [10]:
import os
from annotation import setup_dataset,scanvi_annotation, svm_annotation, onclass_annotation, singlecellnet_annotation

output_folder = os.path.join('data', output_folder_name)
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
else:
    raise ValueError("{} already exists. Please provide an unexisting directory to save outputs".format(output_folder))
#will be changed to actual TS dataset
tabula_sapiens_filepath = 'data/OnClass_data/data_used_for_training/tabula-muris-senis-facs_cell_ontology_test.h5ad'


In [6]:

full = setup_dataset(input_anndata, tabula_sapiens_filepath, tissue, use_10X_only, batch_correction_conditions)

if 'scanvi' in annotation_method:
    scanvi_annotation(
        full_dataset= full,
        batch_key='batch_indices', 
        output_folder = output_folder,
        ts_label_key = 'manual_cell_ontology_class',
        scvi_model=None,
        scanvi_model=None,
        n_scvi_epochs = 1,
        n_scanvi_epochs = 1,
        use_gpu = use_gpu)
    
if "onclass" in annotation_method:
    onclass_annotation(input_anndata, output_folder)

if "svm" in annotation_method:
    svm_annotation( 
    full,
    batch_key='_batch_indices', 
    output_folder= output_folder)

if "singlecellnet" in annotation_method:
    singlecellnet_annotation()

Trying to set attribute `.obs` of view, copying.


data preprocessing
[2020-08-18 14:56:37,756] INFO - scvi.dataset.anndataset | Dense size under 1Gb, casting to dense format (np.ndarray).
[2020-08-18 14:56:37,898] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-08-18 14:56:37,900] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-08-18 14:56:38,271] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-08-18 14:56:38,325] INFO - scvi.dataset.dataset | Downsampled from 1044 to 1044 cells
Training scVI
[2020-08-18 14:56:41,265] INFO - scvi.inference.inference | KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
[2020-08-18 14:56:41,269] INFO - scvi.inference.inference | KL warmup for 10 epochs


HBox(children=(IntProgress(value=0, description='training', max=1, style=ProgressStyle(description_width='init…


[2020-08-18 14:56:50,477] INFO - scvi.inference.inference | Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
training scanpy
[2020-08-18 14:56:50,824] INFO - scvi.inference.inference | KL warmup for 1 epochs


HBox(children=(IntProgress(value=0, description='training', max=1, style=ProgressStyle(description_width='init…


[2020-08-18 14:57:00,780] INFO - scvi.inference.inference | Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.
[2020-08-18 14:57:00,993] INFO - scvi.dataset.anndataset | Dense size under 1Gb, casting to dense format (np.ndarray).
[2020-08-18 14:57:01,082] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-08-18 14:57:01,084] INFO - scvi.dataset.dataset | Remapping labels to [0,N]


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  **kwargs)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


[2020-08-18 14:57:01,483] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-08-18 14:57:01,524] INFO - scvi.dataset.dataset | Downsampled from 1019 to 1019 cells


Trying to set attribute `.obs` of view, copying.
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'tissue' as categorical
... storing 'manual_cell_ontology_class' as categorical
... storing 'method' as categorical
... storing 'donor' as categorical
... storing '_batch' as categorical
... storing 'age' as categorical
... storing 'cell_ontology_class' as categorical
... storing 'cell_ontology_id' as categorical
... storing 'free_annotation' as categorical
... storing 'mouse.id' as categorical
... storing 'sex' as categorical
... storing 'subtissue' as categorical
... storing 'tissue_free_annotation' as categorical
... storing '_dataset' as categorical


Embed the cell ontology
init OnClass
Here, we used the pretrain cell type embedding file tp2emb_500
100.000000 precentage of labels are in the Cell Ontology
Loading the new dataset.
97.055937 precentage of labels are in the Cell Ontology
Predicting the labels of cells

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Use standard file APIs to check for files with this prefix.


Instructions for updating:
Use standard file APIs to check for files with this prefix.


INFO:tensorflow:Restoring parameters from data/OnClass_data/pretrain/BilinearNN_50019


INFO:tensorflow:Restoring parameters from data/OnClass_data/pretrain/BilinearNN_50019


training finished
number of intersection genes 20138


Trying to set attribute `.obs` of view, copying.
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'OnClass_annotation_ontology_name' as categorical


989 989


  if not is_categorical(df_full[k]):
  if isinstance(data, AnnData) and data.isview:
  if is_string_dtype(df[key]) and not is_categorical(df[key])
... storing 'tissue' as categorical
... storing 'manual_cell_ontology_class' as categorical
... storing 'method' as categorical
... storing 'donor' as categorical
... storing '_batch' as categorical
... storing 'age' as categorical
... storing 'cell' as categorical
... storing 'cell_ontology_class' as categorical
... storing 'cell_ontology_id' as categorical
... storing 'free_annotation' as categorical
... storing 'mouse.id' as categorical
... storing 'sex' as categorical
... storing 'subtissue' as categorical
... storing 'tissue_free_annotation' as categorical
... storing '_dataset' as categorical
  if not is_categorical(df_full[k]):
  if not is_categorical(df_full[k]):
  if not is_categorical(df_full[k]):
  if not is_categorical(df_full[k]):
  if not is_categorical(df_full[k]):
Trying to set attribute `.obs` of view, copying.
  if is_strin