In [1]:
from pprint import pprint
import warnings

import numpy as np
import pandas as pd
import scanpy as sc

warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", FutureWarning)

#metadata related functions are in cellhive.md
import cellhive.md as ch

# or - for development:
# %reload_ext autoreload
# %autoreload 2
# %aimport cellhive
# ch = cellhive

hi


In [2]:
f"Using cellhive version: {ch.__version__}"

'Using cellhive version: 0.1.1'

### Load an h5ad file

In [3]:
# using the demonstration pbmc file
raw = sc.datasets.pbmc3k()

adata = sc.datasets.pbmc3k_processed()
adata.layers['counts'] = raw[adata.obs_names, adata.var_names].X.copy()

## Cellhive annotation demonstration

We use cellhive to add structured metadata to an adata file.

First, experimental metadata. Although this is free form key/value, I suggest the following fields.

- **author**
- **title**: Short title
- **organism**: human, mouse, etc.
- **year**: (integer) - the year of the study.
- **url**: URL source of this dataset
- **description**: Long form description of this dataset.
- **version**: (string) - the version number of this dataset - if it needs to be updated - approximately following semantic versioning.
- **study**: short identifier for the overarching study - this is to group a number of experiments together.
- **experiment**: Short unique identifier to identify this dataset. Note - different versions of a dataset should have a different experiment identifier.
- **pubmed**: Pubmed id for a paper. Note - if you specifiy just the pubmed id, author, abstract, year and title will be downloaded from the internet.


In [4]:
ch.md(adata,
      author='10x',
      title='3k PBMCs from a Healthy Donor',
      year=2016,
      organism='human',
      study='cellhive_demo',
      url='https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k?',
      experiment='pbmc',
      version='1.0'
      )

{'author': '10x',
 'title': '3k PBMCs from a Healthy Donor',
 'year': 2016,
 'organism': 'human',
 'study': 'cellhive_demo',
 'url': 'https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k?',
 'experiment': 'pbmc',
 'version': '1.0'}

### Layer annotation

Add some metadata to they layers.

By running `ch.layers(adata)` without any further arguments you get an overview of the layers (including the main `.X` in this adata object). Cellhive will try to assign layer types. It recognizes the following layer types (although the field is freeform):

- **count**: for raw counts (integer)
- **rpm**: reads per million - library size normalized counts
- **logrpm**: library size normalized, and converted to log space
- **cell_abundance**: For cell abundance analysis.

In [5]:
ch.layers(adata)



False 10.0
True 190.0


Unnamed: 0,0,1
name,X,counts
dtype,float32,float32
rows,2638,2638
columns,1838,1838
entries,4848644,4848644
min,-2.849105,0.0
max,10.0,190.0
no_zeros,0,4439901
% zeros,0.0,91.569952
ignore?,-,-


it is possible to force layer types and/or description using the same function. The function recognizes the following arguments:

- **ltype**: Layer type (see above)
- **description**: longer description of what this layer is
- **ignore**: (boolean) - specify if this layer is to be ignored for database import.

In [6]:
ch.layers(adata, 'counts', ltype='count', description='Raw counts')

Unnamed: 0,0,1
name,X,counts
dtype,float32,float32
rows,2638,2638
columns,1838,1838
entries,4848644,4848644
min,-2.849105,0.0
max,10.0,190.0
no_zeros,0,4439901
% zeros,0.0,91.569952
ignore?,-,-


## Dim. Reduction annotation (`obsm`)

Cellhive can annotate the dimensionality reduction data using the `ch.obsm` function with the following arguments:

- **description**: longer description of what this layer is
- **ignore**: (boolean) - specify if this dim.red. is to be ignored for database import.

without any extra arguments the function prints information on the obsm data.

In [7]:
ch.obsm(adata)

name,X_pca,X_tsne,X_umap,X_draw_graph_fr
dim,50,2,2,2
ignore,-,-,-,-
description,-,-,-,-


In [10]:
ch.obsm(adata, 'X_draw_graph_fr', ignore=True)

name,X_pca,X_tsne,X_umap,X_draw_graph_fr
dim,50,2,2,2
ignore,-,-,-,True
description,-,-,-,-


## Cell annotation (`obs` table)

In the same vein as above the cell metadata is annotated using the `ch.obs`. 

Output shows data type, the number of unique entries, and as an example three unique entries from that field.

- **dtype**: force datatype
- **description**: longer description of what this layer is
- **ignore**: (boolean) - specify if this obs column is to be ignored for database import.

In [11]:
ch.obs(adata, 'louvain', description='Louvain clustered cells')
ch.obs(adata, 'n_genes', dtype='int')   # superfluous!
ch.obs(adata)

INFO cellhive.metadata:416 - Convert obs column n_genes to int [09:05:06] 


Unnamed: 0,name,dtype,no_uniq,example,description,ignore
0,n_genes,int,935,"781, 1352, 1131",-,-
1,percent_mito,float,2540,"0.0302, 0.0379, 0.0089",-,-
2,n_counts,float,1736,"2.42e+03, 4.9e+03, 3.15e+03",-,-
3,louvain,cat,8,"Megakaryocytes, Dendritic cells, FCGR3A+ Monoc...",Louvain clustered cells,-


## Check 

To check if all required fields are present:

In [12]:
ch.check(adata)

All seems fine


## metadata storage

All metadata is stored in `adata.uns['cellhive']`:

In [13]:
pprint(adata.uns['cellhive'])

{'layers': {'X': {'type': 'logrpm'},
            'counts': {'description': 'Raw counts', 'type': 'count'}},
 'metadata': {'author': '10x',
              'experiment': 'pbmc',
              'organism': 'human',
              'study': 'cellhive_demo',
              'title': '3k PBMCs from a Healthy Donor',
              'url': 'https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k?',
              'version': '1.0',
              'year': 2016},
 'obs': {'louvain': {'description': 'Louvain clustered cells'},
         'n_genes': {'dtype': 'int'}},
 'obsm': {'X_draw_graph_fr': {'ignore': True}}}


## Save

Saving the h5ad file like normal stores the metadata as well:

In [14]:
adata.write_h5ad('pbmc.annotated.h5ad')