## Generating the matrix of OpenCell target localization encodings

__Keith Cheveralls__<br>
__October 2021__

This notebook documents the generation of the matrix of OpenCell target localization encodings. This matrix is encapsulated by an [anndata](https://anndata.readthedocs.io/en/latest/) object that is used in the `clustering-figures` and `clustering-performance` notebooks. The final version of this object used for the 2021-opencell paper can be downloaded [here](https://figshare.com/articles/dataset/Consensus_protein_localization_encodings_for_all_OpenCell_targets/16754965).

Note: the generation of the target localization encodings here depends on a trained cytoself model, which is not part of the OpenCell project itself. Please refer to [the cytoself project](https://github.com/royerlab/cytoself) for more information. 

Internal note: construction of the adata object depends on filtering out targets that are not publication-ready. This is done at the very beginning of the processing (i.e., before the kNN matrix is calculated), so everything (both UMAP and clusters) depends on it. The filtering is done using the publication_ready annotations in a cache of the /lines endpoint from the opencell API. This was a point of fragility in the original analysis, because there was no versioning or even timestamp associated with the cache, and no way to know which cache was used for which figures/exports. On 2021-06-09, I generated a new, final cache and manually verified that the publication_ready annotations in this cached payload are the same as they were back in March 2021.

In [None]:
import anndata as ad
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import sys
import os

%load_ext autoreload
%autoreload 1

sys.path.append('../../')
%aimport scripts.cytoself_analysis.clustering_workflows
from scripts.cytoself_analysis import (
    loaders, clustering_workflows, ground_truth_labels
)

sc.settings.set_figure_params(dpi=80, facecolor='white', frameon=False)

In [None]:
# the expected number of OpenCell targets in the dataset
num_targets = 1294

# the expected dimensionality of the VQ2 vectors 
# (that is, the localization encodings)
num_vq2_features = 12 * 12 * 64

# the expected number of patches in the test dataset
num_test_patches = 109995

### Load the cytoself model results

These results include the arrays of image-patch encodings (from both the VQ1 and VQ2 layers) from the test dataset used in training the cytoself model. 

In [None]:
clustering_results_dirpath = '/Users/keith.cheveralls/clustering-results'
results = loaders.load_december_results(
    root_dirpath=clustering_results_dirpath, dataset='full', rep=3
)

In [None]:
# concatenate the C4orf32 orphan
results.concatenate_orphans()

In [None]:
assert results.test_labels.shape[0] == num_test_patches
results.test_labels.shape, results.test_vq2.shape, results.test_vq2_ind.shape

### Merge OpenCell annotations
This uses a cache of the /lines endpoint from the OpenCell API, and is needed to filter out non-publication-ready targets from the results.

In [None]:
results.test_labels = ground_truth_labels.merge_all(
    df=results.test_labels, data_dirpath=os.path.abspath('../../data')
)

### Export the matrix of consensus (patch-averaged) target encodings

This constructs an `anndata` object that represents the consensus localization encodings for each OpenCell target. These encodings are the flattened VQ2 vectors, averaged over all patches for each target. 

In [None]:
adata = results.export_adata(
    vq='vq2', kind='vectors', using='mean', rerun=True, pub_ready_only=True
)

assert adata.obs.shape[0] == num_targets

### Create the `ClusteringWorkflow` instance

The `ClusteringWorkflow` instance is used here only to minimally preprocess the raw target localization encodings (by using PCA to reduce dimensionality from 9216 to 200) and then to calculate the matrix of nearest neighbors (which is later used for both the UMAP embedding and the Leiden clustering).

In [None]:
cwv = clustering_workflows.ClusteringWorkflow(adata=adata)

# preprocess the VQ2 features and calculate the principal components
cwv.preprocess(do_log1p=False, do_scaling=False, n_top_genes=None, n_pcs=200)

# calculate the kNN matrix
cwv.calculate_neighbors(n_neighbors=10, n_pcs=200, metric='euclidean')

In [None]:
assert cwv.adata.X.shape[0] == num_targets
assert cwv.adata.X.shape[1] == num_vq2_features

### The target UMAP
This generate the UMAP of localization encodings shown in Figures 3 and 4 (among others). 

In [None]:
# the publication UMAP uses a random seed of 51 and a min_dist of zero
sc.tl.umap(cwv.adata, init_pos='spectral', min_dist=0.0, random_state=51)

In [None]:
sc.pl.umap(cwv.adata, color='grade_3_annotation', palette='tab10', alpha=0.5)

### Export the `anndata` object

This object is used for further analysis in the `clustering-figures` and `clustering-performance` notebooks, and is identical to the object available on Figshare [here](https://figshare.com/articles/dataset/Consensus_protein_localization_encodings_for_all_OpenCell_targets/16754965). 

In [None]:
# drop unused/internal metadata columns
cwv.adata.obs.drop(labels=['plate_id', 'well_id', 'oc_categories'], axis=1, inplace=True)

In [None]:
adata_filepath = '../../data/figshare/final-opencell-target-localization-encodings.h5ad'
cwv.adata.write_h5ad(adata_filepath)