Contact: [...]
Version: 1.0.0
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.
Much of this schema has been based on the v4.0.0 CELLxGENE schema, documented at this GitHub commit here.
The following ontologies are utilized within this schema:
Ontology | OBO Prefix | Download |
---|---|---|
Cell Ontology | CL | cl.owl |
Experimental Factor Ontology | EFO | efo.owl |
Mondo Disease Ontology | MONDO | mondo.owl |
NCBI organismal classification | NCBITaxon | ncbitaxon.owl |
Uberon multi-species anatomy ontology | UBERON | uberon.owl |
Sections are as follows, based on the AnnData file format:
X
(Matrix layers)obs
(Cell metadata), metadata on each cellvar
andraw.var
(Gene metadata), metadata on each geneobsm
(Embeddings)uns
(Dataset metadata), metadata related to the dataset itself
This is the standard X layer within the AnnData file, see here.
This is the data matrix of the dimension (#observations, #variables)
data matrix.
Users MUST provide the raw count matrix either in .X
or .raw.X
fields. If the user provides the raw count matrix in .X
the .raw
layer MUST be empty.
Users MAY provide the normalized matrix in the .X
field. We STRONGLY RECOMMEND users to provide both a raw count matrix and a normalized one. We STRONGLY RECOMMEND users to normalize the matrix with the following algorithm:
- Normalize counts per cell up to 10 000 reads per cell (NOTE: this is done simply so that counts become comparable among cells. See the default values used in the ScanPy tutorial here or the Seurat function
NormalizeData()
here) - Use
log(1+x)
transformation
If the normalized matrix provided by the user differs from the expected one based on the algorithm above, then the user provided matrix will be moved to the additional AnnData layer named AnnData.layers['user-provided']
. In this case, the .X
field will be filled by a re-normalized raw count matrix using the algorithm above. NOTE: The re-normalization is needed for reliable work of the CAP calculations.
After the CAP preprocessing, any AnnData file downloadable via CAP:
- MUST have the raw count matrix in
.raw.X
, - MUST have a normalized (by the algorithms above) matrix in
.X
- MAY have a normalized (by another algorithm) matrix in the
.layers['user-provided']
In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a scipy.sparse.csr_matrix
.
obs
is a pandas.DataFrame
.
column | organism_ontology_term_id |
dtype | string |
value | This MUST be a child of NCBITaxon:33208 for Metazoa. |
source | file or UI |
required for publication on CAP | yes |
example | 'NCBITaxon:9606' or 'NCBITaxon:10090' |
column | organism |
dtype | string |
value | This MUST be the human-readable term assigned to the value of organism_ontology_term_id . The ontology term and ontology term ID MUST match. |
source | file |
required for publication on CAP | yes |
example | 'Homo sapiens' or 'Mus musculus' |
column | disease_ontology_term_id |
dtype | string |
value | This MUST be a MONDO term or 'PATO:0000461' for normal or healthy. |
source | file or UI |
required for publication on CAP | yes |
example | 'MONDO:0004975' or 'MONDO:0018177' or 'PATO:0000461' |
column | disease |
dtype | string |
value | This MUST be the human-readable term which corresponds to the value of disease_ontology_term_id . The ontology term and ontology term ID MUST match. |
source | file |
required for publication on CAP | yes |
example | 'Alzheimer's Disease' or 'Adult Brain Glioblastoma' or 'normal' |
column | assay_ontology_term_id |
dtype | string |
value | This MUST be an EFO term and SHOULD be the most accurate EFO term for this assay. The two options are more specific terms under "assay by molecule" i.e. 'EFO:0002772' or "single cell library construction" i.e. 'EFO:0010183' . Recommended values for commonly-used assays:
|
source | file or UI |
required for publication on CAP | yes |
example | 'EFO:0009922' or 'EFO:0008931' |
column | assay |
dtype | string |
value | This MUST be the human-readable term which corresponds to the value of assay_ontology_term_id . The ontology term and ontology term ID MUST match. |
source | file |
required for publication on CAP | yes |
example | '10x 3' v3' or 'Smart-seq2' |
column | tissue_ontology_term_id |
dtype | string |
value | This MUST be the most accurate child of UBERON:0001062 for anatomical entity. |
source | file or UI |
required for publication on CAP | yes |
example | 'UBERON:0000451' or 'UBERON:0000966' |
column | tissue |
dtype | string |
value | This MUST be the human-readable term assigned to the value of tissue_ontology_term_id . The ontology term and ontology term ID MUST match. |
source | file |
required for publication on CAP | yes |
example | 'prefrontal cortex' or 'retina' |
Users may OPTIONALLY include a single field for clustering within AnnData files, or multiple fields denoting clustering, e.g. different clustering algorithms, multiple resolutions of clustering, etc.
We therefore REQUIRE that clustering is clearly denoted within the AnnData file if it contains clustering fields.
ScanPy has set an AnnData community standard of defining the *.obs
value by the type of algorithm. e.g. the function scanpy.tl.louvain
(documented here) by default saves the clustering as anndata.obs['louvain']
. Similarly, leiden
(documented here) is often encoded as anndata.obs['leiden']
.
column | 'cluster' , 'leiden' , 'louvain' or 'cluster + _ + [ALGORITHM_TYPE] + _ + [SUFFIX]' whereby [ALGORITHM_TYPE] and [SUFFIX] are OPTIONAL. |
dtype | string |
value |
|
source | file |
required for publication on CAP | no |
example column name | 'cluster_leiden' or 'cluster_leiden_broad' or 'louvain' or 'leiden_fine' |
example value | '0' or '1' or '2' |
The string specified by the user for [cellannotation_setname]
will be used as the pandas DataFrame column name (key) to encode the following cell annotation metadata columns in *.obs
.
NOTE: A dataset may have multiple sets of cell annotations each with a cooresponding set of cell annotation metadata, e.g. 'cell_type'
and 'broadclustering_celltype'
.
Format: The column name is the string [cellannotation_setname]
and the values are the strings of cell_label
. Refer to the fields cellannotation_setname
and cell_label
in the JSON Schema.
column | [cellannotation_set] |
index | Cell barcode names |
dtype | string |
value | Any free-text term which the author uses to annotate cells, the preferred cell label name used by the author. |
source | file or UI |
required for publication on CAP | yes |
example | 'HBC2' or 'rod bipolar' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'cell_fullname'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_fullname'
For example, if the user specified the cell annotation as broad_cells1
, then the name of the column in the pandas DataFrame will be broad_cells1--cell_fullname
.
column | [cellannotation_set]--cell_fullname |
index | Cell barcode names |
dtype | string |
value | The full-length name for the biological entity listed in [cellannotation_setname] by the author. |
source | file or UI |
required for publication on CAP | yes |
example | 'rod bipolar' |
format: The column name is the string prefix [cellannotation_setname]--
concatenated with the string value cell_ontology_exists
, i.e. [cellannotation_setname] + '--' + 'cell_ontology_exists'
column | [cellannotation_set]--cell_ontology_exists |
index | Cell barcode names |
dtype | boolean |
value | Boolean value in Python (either True or False). |
source | file or UI |
required for publication on CAP | yes |
example | 'True' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'cell_ontology_term_id'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term_id'
column | [cellannotation_set]--cell_ontology_term_id |
index | Cell barcode names |
dtype | string |
value | This MUST be a term from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology e.g. the Provisional Cell Ontology or the Drosophila Anatomy Ontology (DAO). |
source | file or UI |
required for publication on CAP | yes |
example | 'CL:0000751' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'cell_ontology_term'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'cell_ontology_term'
column | [cellannotation_set]--cell_ontology_term |
index | Cell barcode names |
dtype | string |
value | The human-readable name assigned to the value of 'cell_ontology_term_id' . |
source | file or UI |
required for publication on CAP | yes |
example | 'rod bipolar cell' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'rationale'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale'
column | [cellannotation_set]--rationale |
index | Cell barcode names |
dtype | string |
value | The free-text rationale which users provide as justification/evidence for their cell annotations. |
source | file or UI |
required for publication on CAP | yes |
example | 'This cell was annotated with [blank] given the canonical markers in the field [X], [Y], [Z]. We noticed [X] and [Y] running differential expression.' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'rationale_dois'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'rationale_dois'
column | [cellannotation_set]--rationale_dois |
index | Cell barcode names |
dtype | string |
value | Comma-separated string of valid publication DOIs cited by the author to support or provide justification/evidence/context for cell_label . |
source | file or UI |
required for publication on CAP | no |
example | '10.1038/s41587-022-01468-y, 10.1038/s41556-021-00787-7, 10.1038/s41586-021-03465-8' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'marker_gene_evidence'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'marker_gene_evidence'
column | [cellannotation_set]--marker_gene_evidence |
index | Cell barcode names |
dtype | string |
value | Comma-separated string of gene names explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file. |
source | file or UI |
required for publication on CAP | yes |
example | 'TP53, KRAS, BRCA1' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'canonical_marker_genes'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'canonical_marker_genes'
column | [cellannotation_setname]--canonical_marker_genes |
index | Cell barcode names |
dtype | string |
value | Comma-separated string of gene names considered to be canonical markers for the biological entity used in the cell annotation. |
source | file or UI |
required for publication on CAP | yes |
example | 'GATA3, CD3D, CD3E' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'synonyms'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'synonyms'
column | [cellannotation_set]--synonyms |
index | Cell barcode names |
dtype | string |
value | Comma-separated string of synonyms for values in [cellannotation_setname] . Abbreviations are acceptable. |
source | file or UI |
required for publication on CAP | yes |
example | 'neuroglial cell, glial cell, neuroglia' or 'amacrine cell' or 'FMB cell' |
Format: The column name is the string prefix [cellannotation_setname]--
concatenated with the string value category_fullname
, i.e. [cellannotation_setname] + '--' + 'category_fullname'
. This MUST be the full-length name for the biological entity, not an abbreviation.
column | [cellannotation_set]--category_fullname |
index | Cell barcode names |
dtype | string |
value | A single value of the category/parent term for the cell label value in [cellannotation_setname] . |
source | file or UI |
required for publication on CAP | yes |
example | 'ON-bipolar cell' |
Format: The column name is the string prefix [cellannotation_setname]--
concatenated with the string value category_cell_ontology_exists
, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_exists'
column | [cellannotation_set]--category_cell_ontology_exists |
index | Cell barcode names |
dtype | boolean |
value | Boolean value in Python (either True or False). |
source | file or UI |
required for publication on CAP | yes |
example | 'True' |
Format: The column name is the value [cellannotation_setname]
concatenated with the string 'synonyms'
and two hyphens, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term_id'
column | [cellannotation_set]--category_cell_ontology_term_id |
index | Cell barcode names |
dtype | string |
value | The ID from either the Cell Ontology or from some ontology that extends it by classifying cell types under terms from the Cell Ontology. |
source | file or UI |
required for publication on CAP | yes |
example | 'CL:0000749' |
format: The column name is the string prefix [cellannotation_setname]--
concatenated with the string value category_cell_ontology_term
, i.e. [cellannotation_setname] + '--' + 'category_cell_ontology_term'
column | [cellannotation_set]--category_cell_ontology_term |
index | Cell barcode names |
dtype | string |
value | The human-readable name assigned to the value of 'category_cell_ontology_term_id' . |
source | file or UI |
required for publication on CAP | yes |
example | 'ON-bipolar cell' |
Format: The column name is the string prefix [cellannotation_setname]--
concatenated with the string value cell_ontology_assessment
, i.e. [cellannotation_setname] + '--' + 'cell_ontology_assessment'
column | [cellannotation_set]--cell_ontology_assessment |
index | Cell barcode names |
dtype | string |
value | Free-text field for researchers to express disagreements with any aspect of the Cell Ontology for this cell annotation. |
source | file or UI |
required for publication on CAP | no |
example | 'Amacrine cell should have four child terms: glycinergic, GABAergic, GABAergic Glycinergic amacrine cells and non-GABAergic non-glycinergic amacrine cells; which then contain their cooresponding child terms' |
CAP requires that gene names be provided by ENSEMBL terms. These MUST be encoded in the index of the var
fields following the AnnData standard.
NOTE: var
is a pandas.DataFrame
object. ENSEMBL terms MUST be used to index these rows, i.e. pandas.DataFrame.index
.
NOTE: the UI will convert the ENSEMBL terms to common gene names based on the organism specified. We currently support Homo sapiens
and Mus musculus
.
If there are other species you wish to upload to CAP, please contact support@celltpye.info
and we will work to accommodate your request.
Users MUST include at least one two-dimensional embedding, which must be encoded by X_
.
That is, given a data matrix X
of the dimension (#observations, #variables)
data matrix, the dimensions of all embeddings MUST be (#observations, 2)
.
NOTE: Embeddings of higher dimensions >=2 may be encoded in the AnnData file, but these embeddings will not be accessible via the CAP UI.
The format for the name of embeddings in obsm
is RECOMMENDED to be the following format:
'X + _ + [EMBEDDING_TYPE] + _ + [SUFFIX]'
whereby:
'X_'
: MUST be used to denote embeddings inAnnData.obsm
. REQUIRED.[EMBEDDING TYPE]
: MUST denote the algorithm used to generate the embedding (e.g.UMAP
,tSNE
,pca
, etc.). REQUIRED.[SUFFIX]
: Denotes a descriptive tag informative enough for third-party users; used to distinguish between multiple embeddings of the same type. OPTIONAL.
examples: 'X_pca'
, 'X_tsne'
, 'X_tSNE'
, 'X_umap'
, 'X_UMAP_nneigbors15'
, 'X_umap_2'
NOTE: Each time a cell annotation cellannotation_setname
is modified, these values potentially change.
Key-value pair in the uns
dictionary
key | cellannotation_schema_version |
type | string |
value | The schema version, the cell annotation open standard.
This versioning MUST follow the format '[MAJOR].[MINOR].[PATCH]' as defined by Semantic Versioning 2.0.0. Current version MUST follow 0.1.0 |
source | software |
required for publication on CAP | yes |
example | '0.1.0' |
Key-value pair in the uns
dictionary
key | publication_timestamp |
type | string |
value | The timestamp of the dataset published on CAP. This MUST be a string in the format %yyyy-%MM-%dd'T'%hh:%mm:%ss . |
source | software |
required for publication on CAP | yes |
example | '2023-11-21T04:12:36' |
Key-value pair in the uns
dictionary
key | publication_version |
type | string of 'v' + '[integer]' |
value | This versioning MUST follow the format 'v' + '[integer]' , whereby newer versions must be naturally incremented. |
source | software |
required for publication on CAP | yes |
example | 'v1' or 'v3' |
Key-value pair in the uns
dictionary
key | title |
type | string |
value | The title of the dataset on CAP. This MUST be less than or equal to 200 characters. |
source | file or UI |
required for publication on CAP | yes |
example | 'Human retina cell atlas - retinal ganglion cells' |
Key-value pair in the uns
dictionary
key | description |
type | string |
value | The description of the dataset on CAP. |
source | file or UI |
required for publication on CAP | yes |
example | 'A total of 15 retinal ganglion cell clusters were identified from over 99K retinal ganglion cell nulcei in the current atlas. Utilizing previoulsy characterized markers from macaque, 5 clusters can be annotated.' |
Key-value pair in the uns
dictionary
key | dataset_url |
type | string |
value | A persistent URL of the dataset on CAP. |
source | software |
required for publication on CAP | yes |
example | 'https://celltype.info/CAP_DataCuration/A-Single-Cell-Transcriptome-Atlas-of-the-Human-Pancreas/1/dataset/20' |
Key-value pair in the uns
dictionary
key | cap_publication_title |
type | string |
value | The title of the publication on CAP. This MUST be less than or equal to 200 characters. NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp. |
source | file or UI |
required for publication on CAP | yes |
example | 'Integrated multi-omics single cell atlas of the human retina' |
Key-value pair in the uns
dictionary
key | cap_publication_description |
type | string |
value | The description of the publication on CAP. NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp. |
source | file or UI |
required for publication on CAP | yes |
example | Similar to an abstract in a scientific publication, the cap_publication_description should provide enough information for other scientists unfamilar with the work. |
Key-value pair in the uns
dictionary
key | cap_publication_url |
type | string |
value | A persistent URL of the publication on CAP. NOTE: the term "publication" refers to the workspace published on CAP with a version and timestamp. |
source | software |
required for publication on CAP | yes |
example | 'https://celltype.info/CAP_DataCuration/A-Single-Cell-Transcriptome-Atlas-of-the-Human-Pancreas/1' |
Key-value pair in the uns
dictionary
key | authors_list |
type | string |
value | This field stores a list of CAP users who are included in the CAP project as collaborators, regardless of their specific role (Viewer, Editor, or Owner). |
source | software |
required for publication on CAP | yes |
example | '['John Smith', 'Cody Miller', 'Sarah Jones']' |
Key-value pair in the uns
dictionary
key | author_name |
type | string |
value | This MUST be a string in the format [FIRST NAME] [LAST NAME] . |
source | file or UI |
required for publication on CAP | yes |
example | 'John Smith' |
Key-value pair in the uns
dictionary
key | author_contact |
type | string |
value | This MUST be a valid email address of the author. |
source | file or UI |
required for publication on CAP | yes |
example | 'jsmith@university.edu' |
Key-value pair in the uns
dictionary
key | author_orcid |
type | string |
value | This MUST be a valid ORCID for the author. |
source | file or UI |
required for publication on CAP | yes |
example | '0000-0002-3843-3472' |
Python dictionary within the uns
dictionary, with the key the string [cellannotation_setname]
key | [cellannotation_setname] |
type | python dictionary |
value | The rest of the dictionary as defined below. |
source | file or UI |
required for publication on CAP | yes |
example | '{'cell_type': {'annotation_method':'algorithmic', ...}}' |
key | 'description' |
type | string |
value | Description of the cellannotation_set created. This is free-text for collaborators and third-parties to understand the context/background for the creation of this cell annotation set.We STRONGLY recommend this field be descriptive for other scientists unfamiliar with this project to understand why this set of cell annotations exist. |
source | file or UI |
required for publication on CAP | yes |
example | 'Cell annotations based on resolution broad clustering using the Leiden algorithm.' |
key | 'annotation_method' |
type | string |
value | 'algorithmic' , 'manual' , or 'both' NOTE: If 'algorithmic' or 'both' , more details are required. If 'manual' , the values in the following 'algorithm_' and 'reference_' fields will be 'NA' . |
source | file or UI |
required for publication on CAP | yes |
example | 'algorithmic' or 'manual' or 'both' |
key | 'algorithm_name' |
type | string |
value | The name of the algorithm used. |
source | file or UI |
required for publication on CAP | yes |
example | 'scArches' or if 'manual' then 'NA' |
key | 'algorithm_version' |
type | string |
value | The string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]' , but other versioning systems are permitted based on the algorithm's versioning. |
source | file or UI |
required for publication on CAP | yes |
example | '0.5.9' or if 'manual' then 'NA' |
key | 'algorithm_repo_url' |
type | string |
value | The string of the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL. |
source | file or UI |
required for publication on CAP | yes |
example | 'https://github.com/theislab/scarches' or if 'manual' then 'NA' |
key | 'reference_location' |
type | string |
value | The string of the URL pointing to the reference dataset. |
source | file or UI |
required for publication on CAP | no |
example | 'https://figshare.com/projects/Tabula_Muris_Senis/64982' or if 'manual' then 'NA' |
key | 'reference_description' |
type | string |
value | Free-text description of the reference used for automated annotation for this cell annotation set. Users are welcome to write out context which may be useful for other researchers. |
source | file or UI |
required for publication on CAP | no |
example | 'Tabula Muris Senis: a single cell transcriptomic atlas across the life span of Mus musculus which includes data from 18 tissues and organs.' or if 'manual' then 'NA' |
schema version 1.0.0
- Renamed
dataset_title
totitle
- Renamed
dataset_description
todescription
- Renamed
cellannotation_setdescription
todescription