# Annotation Tutorial

**NB**: please refer to the scVI-dev notebook for introduction of the scVI package.

In this notebook, we investigate how semi-supervised learning combined with the probabilistic modelling of latent variables in scVI can help address the annotation problem.

The annotation problem consists in labelling cells, ie. **inferring their cell types**, knowing only a part of the labels.

In [1]:
cd ~/scVI

/home/ubuntu/scVI


In [5]:
from scvi.dataset import CortexDataset
from scvi.models import SCANVI, VAE
from scvi.inference import JointSemiSupervisedTrainer

We instantiate the SVAEC model and train it over 250 epochs. Only labels from the `data_loader_labelled` will be used, but to cross validate the results, the labels of `data_loader_unlabelled` will is used at test time. The accuracy of the `unlabelled` dataset reaches 93% here at the end of training.

In [6]:
gene_dataset = CortexDataset()

use_batches=False
use_cuda=True

scanvi = SCANVI(gene_dataset.nb_genes, gene_dataset.n_batches, gene_dataset.n_labels)
trainer = JointSemiSupervisedTrainer(scanvi, gene_dataset, n_labelled_samples_per_class=10, classification_ratio=100)
trainer.train(n_epochs=100)

trainer.unlabelled_set.accuracy()

File data/expression.bin already downloaded
Preprocessing Cortex data
Finished preprocessing Cortex data
training: 100%|██████████| 100/100 [00:42<00:00,  2.38it/s]


0.91686541737649063

**Benchmarking against other algorithms**

We can compare ourselves against the random forest and SVM algorithms, where we do grid search with 3-fold cross validation to find the best hyperparameters of these algorithms. This is automatically performed through the functions **`compute_accuracy_svc`** and **`compute_accuracy_rf`**.

These functions should be given as input the numpy array corresponding to the equivalent dataloaders, which is the purpose of the **`get_raw_data`** method from `scvi.dataset.utils`.

The format of the result is an Accuracy named tuple object giving higher granularity information about the accuracy ie, with attributes:

- **unweighted**: the standard definition of accuracy

- **weighted**: we might give the same weight to all classes in the final accuracy results. Informative only if the dataset is unbalanced.

- **worst**: the worst accuracy score for the classes

- **accuracy_classes** : give the detail of the accuracy per classes


Compute the accuracy score for rf and svc

In [7]:
from scvi.inference.annotation import compute_accuracy_rf, compute_accuracy_svc

data_train, labels_train = trainer.labelled_set.raw_data()
data_test, labels_test = trainer.unlabelled_set.raw_data()
svc_scores = compute_accuracy_svc(data_train, labels_train, data_test, labels_test)
rf_scores = compute_accuracy_rf(data_train, labels_train, data_test, labels_test)

print("\nSVC score test :\n", svc_scores[0][1])
print("\nRF score train :\n", rf_scores[0][1])


SVC score test :
 Accuracy(unweighted=0.87018739352640551, weighted=0.84659088617012479, worst=0.72236503856041134, accuracy_classes=[0.79439252336448596, 0.90666666666666662, 0.88571428571428568, 0.79545454545454541, 0.9345679012345679, 0.88697524219590962, 0.72236503856041134])

RF score train :
 Accuracy(unweighted=0.8933560477001703, weighted=0.87433300965302296, worst=0.65038560411311053, accuracy_classes=[0.88317757009345799, 0.89777777777777779, 0.99642857142857144, 0.81818181818181823, 0.96049382716049381, 0.91388589881593107, 0.65038560411311053])
