CAP Encoding for AnnData file

Contact: [...]

Version: 1.0.0

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.

Background

Much of this schema has been based on the v4.0.0 CELLxGENE schema, documented at this GitHub commit here.

Required Ontologies

The following ontologies are utilized within this schema:

Ontology	OBO Prefix	Download
Cell Ontology	CL	cl.owl
Experimental Factor Ontology	EFO	efo.owl
Mondo Disease Ontology	MONDO	mondo.owl
NCBI organismal classification	NCBITaxon	ncbitaxon.owl
Uberon multi-species anatomy ontology	UBERON	uberon.owl

Sections

Sections are as follows, based on the AnnData file format:

X (Matrix layers)
obs (Cell metadata), metadata on each cell
- Dataset-specific metadata
- Cell annotation metadata
var and raw.var (Gene metadata), metadata on each gene
obsm (Embeddings)
uns (Dataset metadata), metadata related to the dataset itself

`X` (Matrix Layers)

This is the standard X layer within the AnnData file, see here.

This is the data matrix of the dimension (#observations, #variables) data matrix.

Users MUST provide the raw count matrix either in .X or .raw.X fields. If the user provides the raw count matrix in .X the .raw layer MUST be empty.

Users MAY provide the normalized matrix in the .X field. We STRONGLY RECOMMEND users to provide both a raw count matrix and a normalized one. We STRONGLY RECOMMEND users to normalize the matrix with the following algorithm:

Normalize counts per cell up to 10 000 reads per cell (NOTE: this is done simply so that counts become comparable among cells. See the default values used in the ScanPy tutorial here or the Seurat function NormalizeData() here)
Use log(1+x) transformation

If the normalized matrix provided by the user differs from the expected one based on the algorithm above, then the user provided matrix will be moved to the additional AnnData layer named AnnData.layers['user-provided']. In this case, the .X field will be filled by a re-normalized raw count matrix using the algorithm above. NOTE: The re-normalization is needed for reliable work of the CAP calculations.

After the CAP preprocessing, any AnnData file downloadable via CAP:

MUST have the raw count matrix in .raw.X,
MUST have a normalized (by the algorithms above) matrix in .X
MAY have a normalized (by another algorithm) matrix in the .layers['user-provided']

In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a scipy.sparse.csr_matrix.

`obs` (Cell Metadata)

obs is a pandas.DataFrame.

Dataset-specific Metadata

organism_ontology_term_id

column	`organism_ontology_term_id`
dtype	string
value	This MUST be a child of `NCBITaxon:33208` for Metazoa.
source	file or UI
required for publication on CAP	yes
example	`'NCBITaxon:9606'` or `'NCBITaxon:10090'`

organism

column	`organism`
dtype	string
value	This MUST be the human-readable term assigned to the value of `organism_ontology_term_id`. The ontology term and ontology term ID MUST match.
source	file
required for publication on CAP	yes
example	`'Homo sapiens'` or `'Mus musculus'`

disease_ontology_term_id

column	`disease_ontology_term_id`
dtype	string
value	This MUST be a MONDO term or `'PATO:0000461'` for normal or healthy.
source	file or UI
required for publication on CAP	yes
example	`'MONDO:0004975'` or `'MONDO:0018177'` or `'PATO:0000461'`

disease

column	`disease`
dtype	string
value	This MUST be the human-readable term which corresponds to the value of `disease_ontology_term_id`. The ontology term and ontology term ID MUST match.
source	file
required for publication on CAP	yes
example	`'Alzheimer's Disease'` or `'Adult Brain Glioblastoma'` or `'normal'`

assay_ontology_term_id

column	`assay_ontology_term_id`
dtype	string
value	This MUST be an EFO term and SHOULD be the most accurate EFO term for this assay. The two options are more specific terms under `"assay by molecule"` i.e. `'EFO:0002772'` or `"single cell library construction"` i.e. `'EFO:0010183'`. Recommended values for commonly-used assays: `'10x 3' v2'`corresponds to `'EFO:0009899'` `'10x 3' v3'`corresponds to `'EFO:0009922'` `'10x 5' v1'`corresponds to `'EFO:0011025'` `'10x 5' v2'`corresponds to `'EFO:0009900'` `'Smart-seq2'`corresponds to `'EFO:0008931'` `'Visium Spatial Gene Expression'`corresponds to `'EFO:0010961'`
source	file or UI
required for publication on CAP	yes
example	`'EFO:0009922'` or `'EFO:0008931'`

assay

column	`assay`
dtype	string
value	This MUST be the human-readable term which corresponds to the value of `assay_ontology_term_id`. The ontology term and ontology term ID MUST match.
source	file
required for publication on CAP	yes
example	`'10x 3' v3'` or `'Smart-seq2'`

tissue_ontology_term_id

column	`tissue_ontology_term_id`
dtype	string
value	This MUST be the most accurate child of `UBERON:0001062` for anatomical entity.
source	file or UI
required for publication on CAP	yes
example	`'UBERON:0000451'` or `'UBERON:0000966'`

tissue

column	`tissue`
dtype	string
value	This MUST be the human-readable term assigned to the value of `tissue_ontology_term_id`. The ontology term and ontology term ID MUST match.
source	file
required for publication on CAP	yes
example	`'prefrontal cortex'` or `'retina'`

Clustering

Users may OPTIONALLY include a single field for clustering within AnnData files, or multiple fields denoting clustering, e.g. different clustering algorithms, multiple resolutions of clustering, etc.

We therefore REQUIRE that clustering is clearly denoted within the AnnData file if it contains clustering fields.

ScanPy has set an AnnData community standard of defining the *.obs value by the type of algorithm. e.g. the function scanpy.tl.louvain (documented here) by default saves the clustering as anndata.obs['louvain']. Similarly, leiden (documented here) is often encoded as anndata.obs['leiden'].

column	`'cluster'`, `'leiden'`, `'louvain'` or `'cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]'` whereby `[ALGORITHM_TYPE]` and `[SUFFIX]` are OPTIONAL.
dtype	string
value	`'cluster'`, `'leiden'` or `'louvain'`: MUST be used to denote clustering in `AnnData.obs` `[ALGORITHM]`: Denotes the algorithm used, e.g. be either 'leiden' or 'louvain'. OPTIONAL. `[SUFFIX]`: Denotes a descriptive tag informative enough for third-party users; used to distinguish between multiple clusterings. OPTIONAL.
source	file
required for publication on CAP	no
example column name	`'cluster_leiden'` or `'cluster_leiden_broad'` or `'louvain'` or `'leiden_fine'`
example value	`'0'` or `'1'` or `'2'`

Cell Annotation Metadata

[cellannotation_setname]

The string specified by the user for [cellannotation_setname] will be used as the pandas DataFrame column name (key) to encode the following cell annotation metadata columns in *.obs.

NOTE: A dataset may have multiple sets of cell annotations each with a cooresponding set of cell annotation metadata, e.g. 'cell_type' and 'broadclustering_celltype'.

Format: The column name is the string [cellannotation_setname] and the values are the strings of cell_label. Refer to the fields cellannotation_setname and cell_label in the JSON Schema.

column	`[cellannotation_set]`
index	Cell barcode names
dtype	string
value	Any free-text term which the author uses to annotate cells, the preferred cell label name used by the author.
source	file or UI
required for publication on CAP	yes
example	`'HBC2'` or `'rod bipolar'`

[cellannotation_setname]--cell_fullname

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_fullname' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_fullname'

For example, if the user specified the cell annotation as broad_cells1, then the name of the column in the pandas DataFrame will be broad_cells1--cell_fullname.

column	`[cellannotation_set]--cell_fullname`
index	Cell barcode names
dtype	string
value	The full-length name for the biological entity listed in `[cellannotation_setname]` by the author.
source	file or UI
required for publication on CAP	yes
example	`'rod bipolar'`

[cellannotation_setname]--cell_ontology_exists

format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value cell_ontology_exists, i.e. [cellannotation_setname] + '--' + 'cell_ontology_exists'

column	`[cellannotation_set]--cell_ontology_exists`
index	Cell barcode names
dtype	boolean
value	Boolean value in Python (either True or False).
source	file or UI
required for publication on CAP	yes
example	`'True'`

[cellannotation_setname]--cell_ontology_term_id

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_ontology_term_id' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term_id'

column	`[cellannotation_set]--cell_ontology_term_id`
index	Cell barcode names
dtype	string
value	This MUST be a term from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology e.g. the Provisional Cell Ontology or the Drosophila Anatomy Ontology (DAO).
source	file or UI
required for publication on CAP	yes
example	`'CL:0000751'`

[cellannotation_setname]--cell_ontology_term

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_ontology_term' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term'

column	`[cellannotation_set]--cell_ontology_term`
index	Cell barcode names
dtype	string
value	The human-readable name assigned to the value of `'cell_ontology_term_id'`.
source	file or UI
required for publication on CAP	yes
example	`'rod bipolar cell'`

[cellannotation_setname]--rationale

Format: The column name is the value [cellannotation_setname] concatenated with the string 'rationale' and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale'

column	`[cellannotation_set]--rationale`
index	Cell barcode names
dtype	string
value	The free-text rationale which users provide as justification/evidence for their cell annotations.
source	file or UI
required for publication on CAP	yes
example	`'This cell was annotated with [blank] given the canonical markers in the field [X], [Y], [Z]. We noticed [X] and [Y] running differential expression.'`

[cellannotation_setname]--rationale_dois

Format: The column name is the value [cellannotation_setname] concatenated with the string 'rationale_dois' and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale_dois'

column	`[cellannotation_set]--rationale_dois`
index	Cell barcode names
dtype	string
value	Comma-separated string of valid publication DOIs cited by the author to support or provide justification/evidence/context for `cell_label`.
source	file or UI
required for publication on CAP	no
example	`'10.1038/s41587-022-01468-y, 10.1038/s41556-021-00787-7, 10.1038/s41586-021-03465-8'`

[cellannotation_setname]--marker_gene_evidence

Format: The column name is the value [cellannotation_setname] concatenated with the string 'marker_gene_evidence' and two hyphens, i.e. [cellannotation_setname] + '--' + 'marker_gene_evidence'

column	`[cellannotation_set]--marker_gene_evidence`
index	Cell barcode names
dtype	string
value	Comma-separated string of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file.
source	file or UI
required for publication on CAP	yes
example	`'TP53, KRAS, BRCA1'`

[cellannotation_setname]--canonical_marker_genes

Format: The column name is the value [cellannotation_setname] concatenated with the string 'canonical_marker_genes' and two hyphens, i.e. [cellannotation_setname] + '--' + 'canonical_marker_genes'

column	`[cellannotation_setname]--canonical_marker_genes`
index	Cell barcode names
dtype	string
value	Comma-separated string of gene names considered to be canonical markers for the biological entity used in the cell annotation.
source	file or UI
required for publication on CAP	yes
example	`'GATA3, CD3D, CD3E'`

[cellannotation_setname]--synonyms

Format: The column name is the value [cellannotation_setname] concatenated with the string 'synonyms' and two hyphens, i.e. [cellannotation_setname] + '--' + 'synonyms'

column	`[cellannotation_set]--synonyms`
index	Cell barcode names
dtype	string
value	Comma-separated string of synonyms for values in `[cellannotation_setname]`. Abbreviations are acceptable.
source	file or UI
required for publication on CAP	yes
example	`'neuroglial cell, glial cell, neuroglia'` or `'amacrine cell'` or `'FMB cell'`

[cellannotation_setname]--category_fullname

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_fullname, i.e. [cellannotation_setname] + '--' + 'category_fullname'. This MUST be the full-length name for the biological entity, not an abbreviation.

column	`[cellannotation_set]--category_fullname`
index	Cell barcode names
dtype	string
value	A single value of the category/parent term for the cell label value in `[cellannotation_setname]`.
source	file or UI
required for publication on CAP	yes
example	`'ON-bipolar cell'`

[cellannotation_setname]--category_cell_ontology_exists

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_cell_ontology_exists, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_exists'

column	`[cellannotation_set]--category_cell_ontology_exists`
index	Cell barcode names
dtype	boolean
value	Boolean value in Python (either True or False).
source	file or UI
required for publication on CAP	yes
example	`'True'`

[cellannotation_setname]--category_cell_ontology_term_id

Format: The column name is the value [cellannotation_setname] concatenated with the string 'synonyms' and two hyphens, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term_id'

column	`[cellannotation_set]--category_cell_ontology_term_id`
index	Cell barcode names
dtype	string
value	The ID from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology.
source	file or UI
required for publication on CAP	yes
example	`'CL:0000749'`

[cellannotation_setname]--category_cell_ontology_term

format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_cell_ontology_term, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term'

column	`[cellannotation_set]--category_cell_ontology_term`
index	Cell barcode names
dtype	string
value	The human-readable name assigned to the value of `'category_cell_ontology_term_id'`.
source	file or UI
required for publication on CAP	yes
example	`'ON-bipolar cell'`

[cellannotation_setname]--cell_ontology_assessment

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value cell_ontology_assessment, i.e. [cellannotation_setname] + '--' + 'cell_ontology_assessment'

column	`[cellannotation_set]--cell_ontology_assessment`
index	Cell barcode names
dtype	string
value	Free-text field for researchers to express disagreements with any aspect of the Cell Ontology for this cell annotation.
source	file or UI
required for publication on CAP	no
example	`'Amacrine cell should have four child terms: glycinergic, GABAergic, GABAergic Glycinergic amacrine cells and non-GABAergic non-glycinergic amacrine cells; which then contain their cooresponding child terms'`

`var` and `raw.var` (Gene Metadata)

CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var fields following the AnnData standard.

NOTE: var is a pandas.DataFrame object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index.

NOTE: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens and Mus musculus.

If there are other species you wish to upload to CAP, please contact support@celltpye.info and we will work to accommodate your request.

`obsm` (Embeddings)

Users MUST include at least one two-dimensional embedding, which must be encoded by X_.

That is, given a data matrix X of the dimension (#observations, #variables) data matrix, the dimensions of all embeddings MUST be (#observations, 2).

NOTE: Embeddings of higher dimensions >=2 may be encoded in the AnnData file, but these embeddings will not be accessible via the CAP UI.

The format for the name of embeddings in obsm is RECOMMENDED to be the following format:

'X + _ + [EMBEDDING_TYPE] + _ + [SUFFIX]'

whereby:

'X_': MUST be used to denote embeddings in AnnData.obsm. REQUIRED.
[EMBEDDING TYPE]: MUST denote the algorithm used to generate the embedding (e.g. UMAP, tSNE, pca, etc.). REQUIRED.
[SUFFIX]: Denotes a descriptive tag informative enough for third-party users; used to distinguish between multiple embeddings of the same type. OPTIONAL.

examples: 'X_pca', 'X_tsne', 'X_tSNE', 'X_umap', 'X_UMAP_nneigbors15', 'X_umap_2'

uns (Dataset metadata)

NOTE: Each time a cell annotation cellannotation_setname is modified, these values potentially change.