Skip to content

Latest commit

 

History

History
1614 lines (1405 loc) · 45.1 KB

cap_anndata_schema.md

File metadata and controls

1614 lines (1405 loc) · 45.1 KB

CAP Encoding for AnnData file

Contact: [...]

Version: 1.0.0

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.

Background

Much of this schema has been based on the v4.0.0 CELLxGENE schema, documented at this GitHub commit here.

Required Ontologies

The following ontologies are utilized within this schema:

Ontology OBO Prefix Download
Cell Ontology CL cl.owl
Experimental Factor Ontology EFO efo.owl
Mondo Disease Ontology MONDO mondo.owl
NCBI organismal classification NCBITaxon ncbitaxon.owl
Uberon multi-species anatomy ontology UBERON uberon.owl

Sections

Sections are as follows, based on the AnnData file format:

X (Matrix Layers)

This is the standard X layer within the AnnData file, see here.

This is the data matrix of the dimension (#observations, #variables) data matrix.

Users MUST provide the raw count matrix either in .X or .raw.X fields. If the user provides the raw count matrix in .X the .raw layer MUST be empty.

Users MAY provide the normalized matrix in the .X field. We STRONGLY RECOMMEND users to provide both a raw count matrix and a normalized one. We STRONGLY RECOMMEND users to normalize the matrix with the following algorithm:

  1. Normalize counts per cell up to 10 000 reads per cell (NOTE: this is done simply so that counts become comparable among cells. See the default values used in the ScanPy tutorial here or the Seurat function NormalizeData() here)
  2. Use log(1+x) transformation

If the normalized matrix provided by the user differs from the expected one based on the algorithm above, then the user provided matrix will be moved to the additional AnnData layer named AnnData.layers['user-provided']. In this case, the .X field will be filled by a re-normalized raw count matrix using the algorithm above. NOTE: The re-normalization is needed for reliable work of the CAP calculations.

After the CAP preprocessing, any AnnData file downloadable via CAP:

  • MUST have the raw count matrix in .raw.X,
  • MUST have a normalized (by the algorithms above) matrix in .X
  • MAY have a normalized (by another algorithm) matrix in the .layers['user-provided']

In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a scipy.sparse.csr_matrix.

obs (Cell Metadata)

obs is a pandas.DataFrame.

Dataset-specific Metadata

organism_ontology_term_id

column organism_ontology_term_id
dtype string
value This MUST be a child of NCBITaxon:33208 for Metazoa.
source file or UI
required for publication on CAP yes
example 'NCBITaxon:9606' or 'NCBITaxon:10090'

organism

column organism
dtype string
value This MUST be the human-readable term assigned to the value of organism_ontology_term_id. The ontology term and ontology term ID MUST match.
source file
required for publication on CAP yes
example 'Homo sapiens' or 'Mus musculus'

disease_ontology_term_id

column disease_ontology_term_id
dtype string
value This MUST be a MONDO term or 'PATO:0000461' for normal or healthy.
source file or UI
required for publication on CAP yes
example 'MONDO:0004975' or 'MONDO:0018177' or 'PATO:0000461'

disease

column disease
dtype string
value This MUST be the human-readable term which corresponds to the value of disease_ontology_term_id. The ontology term and ontology term ID MUST match.
source file
required for publication on CAP yes
example 'Alzheimer's Disease' or 'Adult Brain Glioblastoma' or 'normal'

assay_ontology_term_id

column assay_ontology_term_id
dtype string
value This MUST be an EFO term and SHOULD be the most accurate EFO term for this assay. The two options are more specific terms under "assay by molecule" i.e. 'EFO:0002772' or "single cell library construction" i.e. 'EFO:0010183'.

Recommended values for commonly-used assays:
source file or UI
required for publication on CAP yes
example 'EFO:0009922' or 'EFO:0008931'

assay

column assay
dtype string
value This MUST be the human-readable term which corresponds to the value of assay_ontology_term_id. The ontology term and ontology term ID MUST match.
source file
required for publication on CAP yes
example '10x 3' v3' or 'Smart-seq2'

tissue_ontology_term_id

column tissue_ontology_term_id
dtype string
value This MUST be the most accurate child of UBERON:0001062 for anatomical entity.
source file or UI
required for publication on CAP yes
example 'UBERON:0000451' or 'UBERON:0000966'

tissue

column tissue
dtype string
value This MUST be the human-readable term assigned to the value of tissue_ontology_term_id. The ontology term and ontology term ID MUST match.
source file
required for publication on CAP yes
example 'prefrontal cortex' or 'retina'

Clustering

Users may OPTIONALLY include a single field for clustering within AnnData files, or multiple fields denoting clustering, e.g. different clustering algorithms, multiple resolutions of clustering, etc.

We therefore REQUIRE that clustering is clearly denoted within the AnnData file if it contains clustering fields.

ScanPy has set an AnnData community standard of defining the *.obs value by the type of algorithm. e.g. the function scanpy.tl.louvain (documented here) by default saves the clustering as anndata.obs['louvain']. Similarly, leiden (documented here) is often encoded as anndata.obs['leiden'].

column 'cluster', 'leiden', 'louvain' or 'cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]' whereby [ALGORITHM_TYPE] and [SUFFIX] are OPTIONAL.
dtype string
value
  • 'cluster', 'leiden' or 'louvain': MUST be used to denote clustering in AnnData.obs
  • [ALGORITHM]: Denotes the algorithm used, e.g. be either 'leiden' or 'louvain'. OPTIONAL.
  • [SUFFIX]: Denotes a descriptive tag informative enough for third-party users; used to distinguish between multiple clusterings. OPTIONAL.
source file
required for publication on CAP no
example column name 'cluster_leiden' or 'cluster_leiden_broad' or 'louvain' or 'leiden_fine'
example value '0' or '1' or '2'

Cell Annotation Metadata

[cellannotation_setname]

The string specified by the user for [cellannotation_setname] will be used as the pandas DataFrame column name (key) to encode the following cell annotation metadata columns in *.obs.

NOTE: A dataset may have multiple sets of cell annotations each with a cooresponding set of cell annotation metadata, e.g. 'cell_type' and 'broadclustering_celltype'.

Format: The column name is the string [cellannotation_setname] and the values are the strings of cell_label. Refer to the fields cellannotation_setname and cell_label in the JSON Schema.

column [cellannotation_set]
index Cell barcode names
dtype string
value Any free-text term which the author uses to annotate cells, the preferred cell label name used by the author.
source file or UI
required for publication on CAP yes
example 'HBC2' or 'rod bipolar'

[cellannotation_setname]--cell_fullname

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_fullname' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_fullname'

For example, if the user specified the cell annotation as broad_cells1, then the name of the column in the pandas DataFrame will be broad_cells1--cell_fullname.

column [cellannotation_set]--cell_fullname
index Cell barcode names
dtype string
value The full-length name for the biological entity listed in [cellannotation_setname] by the author.
source file or UI
required for publication on CAP yes
example 'rod bipolar'

[cellannotation_setname]--cell_ontology_exists

format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value cell_ontology_exists, i.e. [cellannotation_setname] + '--' + 'cell_ontology_exists'

column [cellannotation_set]--cell_ontology_exists
index Cell barcode names
dtype boolean
value Boolean value in Python (either True or False).
source file or UI
required for publication on CAP yes
example 'True'

[cellannotation_setname]--cell_ontology_term_id

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_ontology_term_id' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term_id'

column [cellannotation_set]--cell_ontology_term_id
index Cell barcode names
dtype string
value This MUST be a term from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology e.g. the Provisional Cell Ontology or the Drosophila Anatomy Ontology (DAO).
source file or UI
required for publication on CAP yes
example 'CL:0000751'

[cellannotation_setname]--cell_ontology_term

Format: The column name is the value [cellannotation_setname] concatenated with the string 'cell_ontology_term' and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term'

column [cellannotation_set]--cell_ontology_term
index Cell barcode names
dtype string
value The human-readable name assigned to the value of 'cell_ontology_term_id'.
source file or UI
required for publication on CAP yes
example 'rod bipolar cell'

[cellannotation_setname]--rationale

Format: The column name is the value [cellannotation_setname] concatenated with the string 'rationale' and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale'

column [cellannotation_set]--rationale
index Cell barcode names
dtype string
value The free-text rationale which users provide as justification/evidence for their cell annotations.
source file or UI
required for publication on CAP yes
example 'This cell was annotated with [blank] given the canonical markers in the field [X], [Y], [Z]. We noticed [X] and [Y] running differential expression.'

[cellannotation_setname]--rationale_dois

Format: The column name is the value [cellannotation_setname] concatenated with the string 'rationale_dois' and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale_dois'

column [cellannotation_set]--rationale_dois
index Cell barcode names
dtype string
value Comma-separated string of valid publication DOIs cited by the author to support or provide justification/evidence/context for cell_label.
source file or UI
required for publication on CAP no
example '10.1038/s41587-022-01468-y, 10.1038/s41556-021-00787-7, 10.1038/s41586-021-03465-8'

[cellannotation_setname]--marker_gene_evidence

Format: The column name is the value [cellannotation_setname] concatenated with the string 'marker_gene_evidence' and two hyphens, i.e. [cellannotation_setname] + '--' + 'marker_gene_evidence'

column [cellannotation_set]--marker_gene_evidence
index Cell barcode names
dtype string
value Comma-separated string of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file.
source file or UI
required for publication on CAP yes
example 'TP53, KRAS, BRCA1'

[cellannotation_setname]--canonical_marker_genes

Format: The column name is the value [cellannotation_setname] concatenated with the string 'canonical_marker_genes' and two hyphens, i.e. [cellannotation_setname] + '--' + 'canonical_marker_genes'

column [cellannotation_setname]--canonical_marker_genes
index Cell barcode names
dtype string
value Comma-separated string of gene names considered to be canonical markers for the biological entity used in the cell annotation.
source file or UI
required for publication on CAP yes
example 'GATA3, CD3D, CD3E'

[cellannotation_setname]--synonyms

Format: The column name is the value [cellannotation_setname] concatenated with the string 'synonyms' and two hyphens, i.e. [cellannotation_setname] + '--' + 'synonyms'

column [cellannotation_set]--synonyms
index Cell barcode names
dtype string
value Comma-separated string of synonyms for values in [cellannotation_setname]. Abbreviations are acceptable.
source file or UI
required for publication on CAP yes
example 'neuroglial cell, glial cell, neuroglia' or 'amacrine cell' or 'FMB cell'

[cellannotation_setname]--category_fullname

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_fullname, i.e. [cellannotation_setname] + '--' + 'category_fullname'. This MUST be the full-length name for the biological entity, not an abbreviation.

column [cellannotation_set]--category_fullname
index Cell barcode names
dtype string
value A single value of the category/parent term for the cell label value in [cellannotation_setname].
source file or UI
required for publication on CAP yes
example 'ON-bipolar cell'

[cellannotation_setname]--category_cell_ontology_exists

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_cell_ontology_exists, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_exists'

column [cellannotation_set]--category_cell_ontology_exists
index Cell barcode names
dtype boolean
value Boolean value in Python (either True or False).
source file or UI
required for publication on CAP yes
example 'True'

[cellannotation_setname]--category_cell_ontology_term_id

Format: The column name is the value [cellannotation_setname] concatenated with the string 'synonyms' and two hyphens, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term_id'

column [cellannotation_set]--category_cell_ontology_term_id
index Cell barcode names
dtype string
value The ID from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology.
source file or UI
required for publication on CAP yes
example 'CL:0000749'

[cellannotation_setname]--category_cell_ontology_term

format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value category_cell_ontology_term, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term'

column [cellannotation_set]--category_cell_ontology_term
index Cell barcode names
dtype string
value The human-readable name assigned to the value of 'category_cell_ontology_term_id'.
source file or UI
required for publication on CAP yes
example 'ON-bipolar cell'

[cellannotation_setname]--cell_ontology_assessment

Format: The column name is the string prefix [cellannotation_setname]-- concatenated with the string value cell_ontology_assessment, i.e. [cellannotation_setname] + '--' + 'cell_ontology_assessment'

column [cellannotation_set]--cell_ontology_assessment
index Cell barcode names
dtype string
value Free-text field for researchers to express disagreements with any aspect of the Cell Ontology for this cell annotation.
source file or UI
required for publication on CAP no
example 'Amacrine cell should have four child terms: glycinergic, GABAergic, GABAergic Glycinergic amacrine cells and non-GABAergic non-glycinergic amacrine cells; which then contain their cooresponding child terms'

var and raw.var (Gene Metadata)

CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var fields following the AnnData standard.

NOTE: var is a pandas.DataFrame object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index.

NOTE: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens and Mus musculus.

If there are other species you wish to upload to CAP, please contact support@celltpye.info and we will work to accommodate your request.

obsm (Embeddings)

Users MUST include at least one two-dimensional embedding, which must be encoded by X_.

That is, given a data matrix X of the dimension (#observations, #variables) data matrix, the dimensions of all embeddings MUST be (#observations, 2).

NOTE: Embeddings of higher dimensions >=2 may be encoded in the AnnData file, but these embeddings will not be accessible via the CAP UI.

The format for the name of embeddings in obsm is RECOMMENDED to be the following format:

'X + _ + [EMBEDDING_TYPE] + _ + [SUFFIX]'

whereby:

  • 'X_': MUST be used to denote embeddings in AnnData.obsm. REQUIRED.
  • [EMBEDDING TYPE]: MUST denote the algorithm used to generate the embedding (e.g. UMAP, tSNE, pca, etc.). REQUIRED.
  • [SUFFIX]: Denotes a descriptive tag informative enough for third-party users; used to distinguish between multiple embeddings of the same type. OPTIONAL.

examples: 'X_pca', 'X_tsne', 'X_tSNE', 'X_umap', 'X_UMAP_nneigbors15', 'X_umap_2'

uns (Dataset metadata)

NOTE: Each time a cell annotation cellannotation_setname is modified, these values potentially change.

cellannotation_schema_version

Key-value pair in the uns dictionary

key cellannotation_schema_version
type string
value The schema version, the cell annotation open standard. This versioning MUST follow the format '[MAJOR].[MINOR].[PATCH]' as defined by Semantic Versioning 2.0.0. Current version MUST follow 0.1.0
source software
required for publication on CAP yes
example '0.1.0'

publication_timestamp

Key-value pair in the uns dictionary

key publication_timestamp
type string
value The timestamp of the dataset published on CAP. This MUST be a string in the format %yyyy-%MM-%dd'T'%hh:%mm:%ss.
source software
required for publication on CAP yes
example '2023-11-21T04:12:36'

publication_version

Key-value pair in the uns dictionary

key publication_version
type string of 'v' + '[integer]'
value This versioning MUST follow the format 'v' + '[integer]', whereby newer versions must be naturally incremented.
source software
required for publication on CAP yes
example 'v1' or 'v3'

title

Key-value pair in the uns dictionary

key title
type string
value The title of the dataset on CAP. This MUST be less than or equal to 200 characters.
source file or UI
required for publication on CAP yes
example 'Human retina cell atlas - retinal ganglion cells'

description

Key-value pair in the uns dictionary

key description
type string
value The description of the dataset on CAP.
source file or UI
required for publication on CAP yes
example 'A total of 15 retinal ganglion cell clusters were identified from over 99K retinal ganglion cell nulcei in the current atlas. Utilizing previoulsy characterized markers from macaque, 5 clusters can be annotated.'

dataset_url

Key-value pair in the uns dictionary

key dataset_url
type string
value A persistent URL of the dataset on CAP.
source software
required for publication on CAP yes
example 'https://celltype.info/CAP_DataCuration/A-Single-Cell-Transcriptome-Atlas-of-the-Human-Pancreas/1/dataset/20'

cap_publication_title

Key-value pair in the uns dictionary

key cap_publication_title
type string
value The title of the publication on CAP. This MUST be less than or equal to 200 characters.

NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.
source file or UI
required for publication on CAP yes
example 'Integrated multi-omics single cell atlas of the human retina'

cap_publication_description

Key-value pair in the uns dictionary

key cap_publication_description
type string
value The description of the publication on CAP.

NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.
source file or UI
required for publication on CAP yes
example Similar to an abstract in a scientific publication, the cap_publication_description should provide enough information for other scientists unfamilar with the work.

cap_publication_url

Key-value pair in the uns dictionary

key cap_publication_url
type string
value A persistent URL of the publication on CAP.

NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp.
source software
required for publication on CAP yes
example 'https://celltype.info/CAP_DataCuration/A-Single-Cell-Transcriptome-Atlas-of-the-Human-Pancreas/1'

authors_list

Key-value pair in the uns dictionary

key authors_list
type string
value This field stores a list of CAP users who are included in the CAP project as collaborators, regardless of their specific role (Viewer, Editor, or Owner).
source software
required for publication on CAP yes
example '['John Smith', 'Cody Miller', 'Sarah Jones']'

author_name

Key-value pair in the uns dictionary

key author_name
type string
value This MUST be a string in the format [FIRST NAME] [LAST NAME].
source file or UI
required for publication on CAP yes
example 'John Smith'

author_contact

Key-value pair in the uns dictionary

key author_contact
type string
value This MUST be a valid email address of the author.
source file or UI
required for publication on CAP yes
example 'jsmith@university.edu'

author_orcid

Key-value pair in the uns dictionary

key author_orcid
type string
value This MUST be a valid ORCID for the author.
source file or UI
required for publication on CAP yes
example '0000-0002-3843-3472'

cellannotation_metadata

Python dictionary within the uns dictionary, with the key the string [cellannotation_setname]

cellannotation_metadata

key [cellannotation_setname]
type python dictionary
value The rest of the dictionary as defined below.
source file or UI
required for publication on CAP yes
example '{'cell_type': {'annotation_method':'algorithmic', ...}}'

description

key 'description'
type string
value Description of the cellannotation_set created. This is free-text for collaborators and third-parties to understand the context/background for the creation of this cell annotation set.

We STRONGLY recommend this field be descriptive for other scientists unfamiliar with this project to understand why this set of cell annotations exist.
source file or UI
required for publication on CAP yes
example 'Cell annotations based on resolution broad clustering using the Leiden algorithm.'

annotation_method

key 'annotation_method'
type string
value 'algorithmic', 'manual', or 'both'

NOTE: If 'algorithmic' or 'both', more details are required. If 'manual', the values in the following 'algorithm_' and 'reference_' fields will be 'NA'.
source file or UI
required for publication on CAP yes
example 'algorithmic' or 'manual' or 'both'

algorithm_name

key 'algorithm_name'
type string
value The name of the algorithm used.
source file or UI
required for publication on CAP yes
example 'scArches' or if 'manual' then 'NA'

algorithm_version

key 'algorithm_version'
type string
value The string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]', but other versioning systems are permitted based on the algorithm's versioning.
source file or UI
required for publication on CAP yes
example '0.5.9'or if 'manual' then 'NA'

algorithm_repo_url

key 'algorithm_repo_url'
type string
value The string of the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL.
source file or UI
required for publication on CAP yes
example 'https://github.com/theislab/scarches'or if 'manual' then 'NA'

reference_location

key 'reference_location'
type string
value The string of the URL pointing to the reference dataset.
source file or UI
required for publication on CAP no
example 'https://figshare.com/projects/Tabula_Muris_Senis/64982'or if 'manual' then 'NA'

reference_description

key 'reference_description'
type string
value Free-text description of the reference used for automated annotation for this cell annotation set. Users are welcome to write out context which may be useful for other researchers.
source file or UI
required for publication on CAP no
example 'Tabula Muris Senis: a single cell transcriptomic atlas across the life span of Mus musculus which includes data from 18 tissues and organs.'or if 'manual' then 'NA'

Appendix: Changelog

schema version 1.0.0

  • Renamed dataset_title to title
  • Renamed dataset_description to description
  • Renamed cellannotation_setdescription to description