# Using Tabula Sapiens as a reference for annotating new datasets
This notebook allows you to annotate your data with a number of annotation methods using the Tabula Sapiens dataset as the reference. 

This notebook is also available as a [Google Colab notebook](https://colab.research.google.com/drive/1KEsTbySmXtnOeQo4lJu8OGtPQjHRs1R4#scrollTo=ObY9kyf7exOx)

Integration Methods Provided:
- scVI
- bbKNN
- scanorama

Annotation Methods:
- KNN on integrated spaces
- scANVI
- onClass
- SVM
- RandomForest

**User action is only required in Step 2 and Step 3.**


# Step 1: Setup Environment 
No user input required here.

In [2]:
import sys
import os
import anndata
import numpy as np
import scanpy as sc
import scvi

# Step 2: Load your data

Load your data into query_adata

In [3]:
# Read in your data with the following command
query_adata = anndata.read("data/LCA_raw.h5ad")


In [4]:
query_adata

AnnData object with n_obs × n_vars = 75071 × 23681
    obs: 'cell_id', 'method', 'donor', 'cell_ontology_type', 'donor_method', 'cell_ontology_id'
    var: 'feature_types.0.0-0', 'n_cells.0.0-0', 'gene_symbol.0.0-0', 'n_cells.1.0-0', 'n_cells.0-0', 'n_cells.1.1-0', 'feature_types.0.0.0.1-0', 'gene_symbol.0.0.0.1-0', 'n_cells.1.0.0.1-0', 'n_cells.1.0.1-0', 'n_cells-0', 'len-0', 'ensembl_id-0', 'contamination_prop-0-0', 'contamination_prop-1-0', 'contamination_prop-10-0', 'contamination_prop-11-0', 'contamination_prop-12-0', 'contamination_prop-13-0', 'contamination_prop-14-0', 'contamination_prop-2-0', 'contamination_prop-3-0', 'contamination_prop-4-0', 'contamination_prop-5-0', 'contamination_prop-6-0', 'contamination_prop-7-0', 'contamination_prop-8-0', 'contamination_prop-9-0'

# Step 3: Setting Up Annotation Parameters (User Action Required)

Here is where you set the parameters for the automated annotation.

Arguments:
- **tissue:** Tabula Sapiens tissue to annotate your data with. Available tissues: ["Bladder", "Blood", "Bone_Marrow", "Kidney", "Large_Intestine", "Lung","Lymph_Node", "Pancreas", "Small_Intestine", "Spleen", "Thymus","Trachea", "Vasculature"]
- **save_location:** location to save results to. By default will save to a folder named `annotation_results`. 
- **query_batch_key:** key in `query_adata.obs` for batch correction. Set to None for no batch correction. 
- **methods:** these are the methods to run. By default, will run all methods.
- **training_mode** can be `online` or `offline`. If `offline` will train scVI and scANVI models from scratch. If `online`, will use pretrained models.
- **query_layers_key**: Key in `query_adata.layers` for count data.

Lesser used parameters
- **query_labels_key**: scANVI has the option to use labeled cells in the query dataset during training. To use some prelabeled cells from the query dataset, set `query_labels_key` to the corresponding key in `query_adata.obs`
- **unknown_celltype_label**: If `query_labels_key` is not None, will treat everything not labeled `unknown_celltype_label` as a labeled cell

In [5]:
""" 
tissue options: 
["Bladder", "Blood", "Bone_Marrow", "Kidney", "Large_Intestine", "Lung",
 "Lymph_Node", "Pancreas", "Small_Intestine", "Spleen", "Thymus",
 "Trachea", "Vasculature"]
"""
tissue = 'Lung'
save_folder = 'data/docker_container_test'
query_batch_key = 'method'
methods = ['bbknn','scvi', 'scanvi', 'svm', 'rf', 'onclass', 'scanorama']
training_mode='offline'
query_layers_key=None


# Lesser used parameters
query_labels_key=None
unknown_celltype_label='unknown'

# Step 4: Downloading Reference Data and Pretrained Models
No more user input required! Just run all the following code blocks.

In [6]:
# Here we download the necessary data:
if tissue == 'Bladder':
  refdata_url = 'https://ndownloader.figshare.com/files/27388874'
  pretrained_url='https://www.dropbox.com/s/rb89y577l6vs2mm/Bladder.tar.gz?dl=1'
elif tissue == 'Blood':
  refdata_url = 'https://ndownloader.figshare.com/files/27388853'
  pretrained_url = 'https://www.dropbox.com/s/kyh9nv202n0db65/Blood.tar.gz?dl=1'
elif tissue == 'Bone_Marrow':
  refdata_url = 'https://ndownloader.figshare.com/files/27388841'
  pretrained_url = 'https://www.dropbox.com/s/a3r4ddg7o7kua7z/Bone_Marrow.tar.gz?dl=1'
elif tissue == 'Kidney':
  refdata_url = 'https://ndownloader.figshare.com/files/27388838'
  pretrained_url = 'https://www.dropbox.com/s/k41r1a346z0tuip/Kidney.tar.gz?dl=1'
elif tissue == 'Large_Intestine':
  refdata_url = 'https://ndownloader.figshare.com/files/27388835'
  pretrained_url = 'https://www.dropbox.com/s/jwvpk727hd54byd/Large_Intestine.tar.gz?dl=1'
elif tissue == 'Lung':
  refdata_url = 'https://ndownloader.figshare.com/files/27388832'
  pretrained_url = 'https://www.dropbox.com/s/e4al4ia9hm9qtcg/Lung.tar.gz?dl=1'
elif tissue == 'Lymph_Node':
  refdata_url = 'https://ndownloader.figshare.com/files/27388715'
  pretrained_url = 'https://www.dropbox.com/s/mbejy9tcbx9e1yv/Lymph_Node.tar.gz?dl=1'
elif tissue == 'Pancreas':
  refdata_url = 'https://ndownloader.figshare.com/files/27388613'
  pretrained_url = 'https://www.dropbox.com/s/r3klvr22m6kq143/Pancreas.tar.gz?dl=1'
elif tissue == 'Small_Intestine':
  refdata_url = 'https://ndownloader.figshare.com/files/27388559'
  pretrained_url = 'https://www.dropbox.com/s/7eiv2mke70jinzc/Small_Intestine.tar.gz?dl=1'
elif tissue == 'Spleen':
  refdata_url = 'https://ndownloader.figshare.com/files/27388544'
  pretrained_url = 'https://www.dropbox.com/s/6j3iwahsjnb8rb3/Spleen.tar.gz?dl=1'
elif tissue == 'Thymus':
  refdata_url = 'https://ndownloader.figshare.com/files/27388505'
  pretrained_url='https://www.dropbox.com/s/9k0mneu2wvpiudz/Thymus.tar.gz?dl=1'
elif tissue == 'Trachea':
  refdata_url = 'https://ndownloader.figshare.com/files/27388460'
  pretrained_url = 'https://www.dropbox.com/s/57tthfgkl8jtxk6/Trachea.tar.gz?dl=1'
elif tissue == 'Vasculature':
  refdata_url = 'https://ndownloader.figshare.com/files/27388451'
  pretrained_url='https://www.dropbox.com/s/1wt3r871kxjas5o/Vasculature.tar.gz?dl=1'

#TODO: save into data folder (will automatically be saved to the folder mounted by user)
    
# Download reference dataset
output_fn = 'TS_{}.h5ad'.format(tissue)
!wget -O data/$output_fn $refdata_url

# Download pretrained scVI and scANVI models.
output_fn = '{}.tar.gz'.format(tissue)
output_fn = os.path.join('data', output_fn)
!wget -O $output_fn $pretrained_url
!tar -xvzf $output_folder

# Download onclass files
!wget -O data/cl.obo -q https://www.dropbox.com/s/hodp0etapzrd8ak/cl.obo?dl=1 
!wget -O data/cl.ontology -q https://www.dropbox.com/s/nes0zprzfbwbgj5/cl.ontology?dl=1
!wget -O data/cl.ontology.nlp.emb https://www.dropbox.com/s/y9x9yt2pi7s0d1n/cl.ontology.nlp.emb?dl=1

# Download annoation code
!wget -O data/annotation.py -q https://www.dropbox.com/s/id8sallwrunjc5c/annotation.py?dl=1

--2021-05-07 02:34:26--  https://ndownloader.figshare.com/files/27388832
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.48.216.177, 18.200.61.225, 34.247.134.83, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.48.216.177|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2673660872 (2.5G) [application/octet-stream]
Saving to: ‘data/TS_Lung.h5ad’


2021-05-07 02:36:45 (18.5 MB/s) - ‘data/TS_Lung.h5ad’ saved [2673660872/2673660872]

--2021-05-07 02:36:46--  https://www.dropbox.com/s/e4al4ia9hm9qtcg/Lung.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:601a:18::a27d:712
Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/e4al4ia9hm9qtcg/Lung.tar.gz [following]
--2021-05-07 02:36:46--  https://www.dropbox.com/s/dl/e4al4ia9hm9qtcg/Lung.tar.gz
Reusing existing connection to www.dropbox.c

In [7]:
!pip install obonet



In [8]:
# here we setup the query dataset with the reference dataset
import annotation
import importlib
import os
importlib.reload(annotation)
from annotation import process_query

# Following parameters are specific to Tabula Sapiens dataset
ref_labels_key='Annotation'
ref_adata_path = 'data/TS_{}.h5ad'.format(tissue)
ref_layers_key = 'raw_counts'

pretrained_scanvi_path = os.path.join(tissue, tissue + "_scanvi_model")
pretrained_scvi_path = os.path.join(tissue, tissue + "_scvi_model")

adata = process_query(query_adata,
                      tissue=tissue,
                      save_folder=save_folder,
                      query_batch_key=query_batch_key,
                      query_layers_key=query_layers_key,
                      query_labels_key=query_labels_key,
                      unknown_celltype_label=unknown_celltype_label,
                      pretrained_scvi_path=pretrained_scvi_path,
                      ref_labels_key=ref_labels_key, 
                      ref_layers_key=ref_layers_key,
                      training_mode=training_mode,
                      ref_adata_path=ref_adata_path)

Is not subset, training offline.
Sampling 100 per label


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ref_adata.obs["_ref_subsample"][ref_subsample_idx] = True
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
  res = method(*args, **kwargs)
  df.loc[: int(n_top_genes), 'highly_variable'] = True


[34mINFO    [0m Using batches from adata.obs[1m[[0m[32m"_batch_annotation"[0m[1m][0m                                   
[34mINFO    [0m Using labels from adata.obs[1m[[0m[32m"_labels_annotation"[0m[1m][0m                                   
[34mINFO    [0m Using data from adata.layers[1m[[0m[32m"scvi_counts"[0m[1m][0m                                         
[34mINFO    [0m Computing library size prior per batch                                              
[34mINFO    [0m Successfully registered anndata object containing [1;34m104505[0m cells, [1;34m4000[0m vars, [1;34m6[0m        
         batches, [1;34m36[0m labels, and [1;34m0[0m proteins. Also registered [1;34m0[0m extra categorical covariates  
         and [1;34m0[0m extra continuous covariates.                                                  
[34mINFO    [0m Please do not further modify adata until model is trained.                          


... storing 'cell_id' as categorical
... storing 'method' as categorical
... storing 'donor' as categorical
... storing 'cell_ontology_type' as categorical
... storing 'donor_method' as categorical
... storing 'cell_ontology_id' as categorical
... storing '_batch_annotation' as categorical
... storing '_dataset' as categorical
... storing 'final_annotation_cell_ontology_id' as categorical
... storing '_labels_annotation' as categorical
... storing 'Annotation' as categorical
... storing 'Manually Annotated' as categorical
... storing 'Donor' as categorical
... storing 'Method' as categorical
... storing 'Organ' as categorical
... storing 'Compartment' as categorical
... storing 'Anatomical Information' as categorical
... storing '_batch' as categorical
... storing '_batch_annotation' as categorical
... storing '_dataset' as categorical
... storing 'final_annotation_cell_ontology_id' as categorical
... storing '_labels_annotation' as categorical


In [9]:
query_adata


AnnData object with n_obs × n_vars = 75071 × 23681
    obs: 'cell_id', 'method', 'donor', 'cell_ontology_type', 'donor_method', 'cell_ontology_id', '_batch_annotation', '_dataset', '_ref_subsample', 'final_annotation_cell_ontology_id', '_labels_annotation'
    var: 'feature_types.0.0-0', 'n_cells.0.0-0', 'gene_symbol.0.0-0', 'n_cells.1.0-0', 'n_cells.0-0', 'n_cells.1.1-0', 'feature_types.0.0.0.1-0', 'gene_symbol.0.0.0.1-0', 'n_cells.1.0.0.1-0', 'n_cells.1.0.1-0', 'n_cells-0', 'len-0', 'ensembl_id-0', 'contamination_prop-0-0', 'contamination_prop-1-0', 'contamination_prop-10-0', 'contamination_prop-11-0', 'contamination_prop-12-0', 'contamination_prop-13-0', 'contamination_prop-14-0', 'contamination_prop-2-0', 'contamination_prop-3-0', 'contamination_prop-4-0', 'contamination_prop-5-0', 'contamination_prop-6-0', 'contamination_prop-7-0', 'contamination_prop-8-0', 'contamination_prop-9-0'
    layers: 'scvi_counts'

In [None]:
!pip install bbknn

# Step 5: Run Automated Cell Annotation Methods
No user action required. Takes about 30 minutes to run. 

Your results will be saved to the folder you provided as **save_folder**.

There will be the following files:
- `annotated_query.h5ad` containing annotated query cells. The consensus annotations will be in `consensus_prediction`. There will also be a `consensus_percentage` field which is the percentage of methods that had the same prediction. 
- `annotated_query_plus_ref.h5ad` containing your query and the reference cells with predicted annotations. 
- `confusion_matrices.pdf` which contains the confusion matrices between the consensus_predictions and each individual method.
- `csv` files containing the metrics for each confusion matrix. 


In [None]:
import annotation
importlib.reload(annotation)
from annotation import annotate_data

annotate_data(adata,
              methods, 
              save_folder,
              pretrained_scvi_path=pretrained_scvi_path,
              pretrained_scanvi_path=pretrained_scanvi_path)

Integrating data with bbknn.
Classifying with knn on bbknn distances.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs[result_key][query_idx] = knn_pred


Saved knn on bbknn results to adata.obs["knn_on_bbknn_pred"]


... storing 'knn_on_bbknn_pred' as categorical
... storing 'knn_on_bbknn_pred' as categorical
  return torch._C._cuda_getDeviceCount() > 0
GPU available: False, used: False
TPU available: None, using: 0 TPU cores


Running scVI.
Training scvi offline.
Epoch 1/77:   0%|          | 0/77 [00:00<?, ?it/s]



# Step 6 Generate Statistics and Figures
No user action required.

## Agreements
First we define some variables from the results files to make the code cleaner 

In [15]:
!pip install bbknn

Collecting bbknn
  Downloading bbknn-1.4.1-py3-none-any.whl (9.4 kB)
Collecting Cython
  Downloading Cython-0.29.23-cp38-cp38-manylinux1_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 4.5 MB/s eta 0:00:01
Installing collected packages: Cython, bbknn
Successfully installed Cython-0.29.23 bbknn-1.4.1


In [16]:
import pandas as pd
import matplotlib.pyplot as plt

results_file = os.path.join(save_folder,'annotated_query_plus_ref.h5ad')
results = anndata.read(results_file)
is_query = results.obs._dataset == "query"
methods = [x for x in results.obs.columns if x.endswith("_pred")]
labels = results.obs.consensus_prediction.astype(str)
labels[~is_query] = results[~is_query].obs._labels_annotation.astype(str)
celltypes = np.unique(labels)
latent_methods = results.obsm.keys()

AttributeError: 'DataFrame' object has no attribute 'consensus_prediction'

### Distribution of consensus percentage
The more the algorithms agree with each other, the better the annotation has worked


In [None]:
agreement_counts = pd.DataFrame(
    np.unique(results[is_query].obs["consensus_percentage"], return_counts=True)
).T

agreement_counts.columns = ["Percent Agreement", "Count"]
agreement_counts.plot.bar(
    x="Percent Agreement", y="Count", legend=False, figsize=(4, 3)
)
plt.ylabel("Frequency")
plt.xlabel("Percent of Algorithms Agreeing with Majority Vote")
figpath = os.path.join(save_folder, "Concensus_Percentage_barplot.pdf")
plt.savefig(figpath, bbox_inches="tight")

### Per cell type agreement
Some cell types can be better predicted than others, and we can highlight the celltypes that are poorly predicted by looking at the per celltype agreement. The cell types are separated by the concensus predictions.

In [None]:
mean_agreement = [
    np.mean(results[is_query & (labels == x)].obs["consensus_percentage"].astype(float))
    for x in celltypes
]
mean_agreement = pd.DataFrame([mean_agreement], index=["agreement"]).T
mean_agreement.index = celltypes

mean_agreement = mean_agreement.sort_values("agreement", ascending=True)
mean_agreement.plot.bar(y="agreement", figsize=(15, 2), legend=False)
plt.ylabel("Mean Agreement")
plt.xticks(rotation=290, ha="left")
figpath = os.path.join(save_folder, "percelltype_agreement_barplot.pdf")
plt.savefig(figpath, bbox_inches="tight")

### Cell type proportion plot

In [None]:
prop = pd.DataFrame(index=celltypes, columns=["ref", "query"])
for x in celltypes:
    prop.loc[x, "query"] = np.sum(labels[is_query] == x)
    prop.loc[x, "ref"] = np.sum(labels[~is_query] == x)


In [None]:
prop.loc[mean_agreement.index].plot(kind='bar', figsize=(len(celltypes)*0.5,4),logy=True)
plt.legend(bbox_to_anchor=(1, 0.9))
plt.ylabel('log Celltype Abundance')
plt.tight_layout()
figpath = os.path.join(save_folder, 'celltype_prop_barplot.pdf')
plt.savefig(figpath, bbox_inches="tight")
plt.show()
plt.close()
