-
Notifications
You must be signed in to change notification settings - Fork 1
API Documentation
The UploadValidator class performs a series of validations on the uploaded AnnData file to ensure it meets the required criteria.
class UploadValidator(adata_path: str) -> None-
adata_path: string, the path to the target AnnData file.
-
adata_path: str, returns the path of the AnnData file. -
organism: str, returns the detected organism from the dataset. It is filled during validation. -
ensembl_ids: pandas.Index, returns a pandas Index of cleaned Ensembl IDs from the dataset with gene version removed.
-
validate(report_success: bool = True) -> NoneValidates the input AnnData file. This method:- Checks if the file has a proper
.h5adextension. - Calls internal methods to check the count matrix, embeddings, obs columns, and var index.
- Raises a
CapMultiExceptionif any validation errors are encountered. - Print success message if
report_success=True; otherwise, silently finish validation if no errors are found.
- Checks if the file has a proper
Examples
>>> from cap_upload_validator import UploadValidator
>>> uv = UploadValidator("multiple_error.h5ad")
>>> uv.validate()
Traceback (most recent call last):
File "<python-input-2>", line 1, in <module>
uv.validate()
~~~~~~~~~~~^^
File "/home/rm/.pyenv/versions/test/lib/python3.13/site-packages/cap_upload_validator/upload_validator.py", line 59, in validate
raise self.multi_exception
cap_upload_validator.errors.CapMultiException: CapMultiException:
AnnDataFileMissingCountMatrix: DataFile Incorrect format: raw data matrix is missing in .raw.X or .X.
AnnDataMissingEmbeddings:
The embedding is missing or is incorrectly named: embeddings must be a [n_cells x 2]
numpy array saved with the prefix X_, for example: X_tsne, X_pca or X_umap.
AnnDataMisingObsColumns:
Required obs column(s) missing: file must contain
'assay', 'disease', 'organism' and 'tissue' fields with valid values,
see (link to obs section of upload requirements) for more information.
AnnDataNonStandardVarError:
File does not contain valid ENSEMBL terms in var.
We currently support Homo sapiens and Mus musculus.
In the case of multiple species in the dataset, orthologous Homo sapiens genes are required.
If there are other species you wish to upload to CAP, please contact
support@celltype.info and we will work to accommodate your request.
>>>
>>>
>>> uv = UploadValidator("plateletOutput.h5ad")
>>> uv.validate()
Validation passed!find_missing_genes() -> Optional[pd.DataFrame]
- Finds missing genes from the gene map for the validated AnnData var section.
Added in v1.5.0
from cap_upload_validator import UploadValidator
df = UploadValidator("anndata_file.h5ad").find_missing_genes()
print(df)This module defines a hierarchy of custom exceptions that are raised during the validation process. Key exceptions include:
-
BadAnnDataFile: Raised when the provided file does not have the correct .h5ad extension. -
AnnDataFileMissingCountMatrix: Raised if the count matrix is missing or invalid. -
AnnDataMissingEmbeddings: Raised if valid embeddings (e.g., X_tsne, X_pca) are not found in the file. -
AnnDataMisingObsColumns: Raised when one or more required general metadataadata.obscolumns are missing. -
AnnDataNonStandardVarError: Raised whenvar.indexis not a list of proper gene ENSEMBL ids.
The CapMultiException class is used to collect and raise multiple exceptions at once, providing a consolidated report of all validation errors encountered.
Basic class to represent abstract organism. Should not be used directly by user.
@dataclass(frozen=True)
class Organism:
name: str
ontology_id: Union[str, None]
gene_prefix: Union[str, None]
gene_map_path: Union[str, None]Class to represent Homo Sapiens.
@dataclass(frozen=True)
class HomoSapiens(Organism):
name = "Homo sapiens"
ontology_id = "NCBITaxon:9606"
gene_prefix = "ENSG"
gene_map_path = HUMAN_GENE_MAP_PATHClass to represent Mus Musculus.
@dataclass(frozen=True)
class MusMusculus(Organism):
name = "Mus musculus"
ontology_id = "NCBITaxon:10090"
gene_prefix = "ENSMUSG"
gene_map_path = MOUSE_GENE_MAP_PATHClass to represent a multiple datasets. Accoring CAP validation rules, the multi-species datasets will be validated with the same gene map as Homo Sapiens datasets.
@dataclass(frozen=True)
class MultiSpecies(HomoSapiens):
name = "Multi species"
ontology_id = NoneClass to represent unsupported organism - no gene validation will be applied.
@dataclass(frozen=True)
class UnsupportedOrganism(Organism):
name = "Unsupported"
ontology_id = None
gene_prefix = None
gene_map_path = NoneThe GeneMap class returns the gene map pandas.DataFrame for a given organism if it is supported on CAP. CAP requires that all ENSEMBL ids placed in adata.var.index be from those mapping files.
class GeneMap() -> Nonedata_frame(organisms: str | Organism = None, index_col: int | None = None) -> pd.DataFrame
Parameters
-
organism: string | Organism - the valid organism class or string name, must match with one of Organism Classes values. If None (default) both Homo Sapiens and Mus Musculus maps will be returned concatenated. -
index_col: int | None - the index column passed topandas.read_csv. If None (default), the ENSEMBL ids column will be read as a dedicated column, set 0 to put it in the DataFrame index.
Returns
- Gene Map: pandas.DataFrame - the DataFrame with columns:
-
ENSEMBL_gene- ENSEMBL ids. Could be placed to index based onindex_colvalue -
HGNC_symbol- Corresponding gene symbol. -
HGNC_symbol_unique- Gene symbol fromHGNC_symbolconcatenated with integer part of ENSEMBL id. Used for gene mapping when a few genes in the dataset have the same HGNC_symbol.
-
>>> from cap_upload_validator.gene_mapping import GeneMap, HomoSapiens
>>> GeneMap.data_frame(HomoSapiens).head()
ENSEMBL_gene HGNC_symbol a b HGNC_symbol_unique
0 ENSG00000290825 DDX11L2 1 1657 DDX11L2-290825
1 ENSG00000223972 DDX11L1 6 632 DDX11L1-223972
2 ENSG00000227232 WASH7P 5 1351 WASH7P-227232
3 ENSG00000278267 MIR6859-1 1 68 MIR6859-1-278267
4 ENSG00000243485 MIR1302-2HG 5 1021 MIR1302-2HG-243485