## Getting started with anndata

Following the [official tutorial](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html).

`AnnData` is specifically designed for matrix-like data. By this we mean that we have $n$ observations, each of which can be represented as $d$-dimensional vectors, where each dimension corresponds to a variable or feature. Both the rows and columns of this $n \times d$ matrix are special in the sense that they are indexed.

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
print(ad.__version__)

0.10.7


Let’s start by building a basic AnnData object with some sparse count information, perhaps representing gene expression counts.

In [2]:
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

The initial data we passed are accessible as a sparse matrix using `adata.X`.

In [3]:
adata.X

<100x2000 sparse matrix of type '<class 'numpy.float32'>'
	with 126320 stored elements in Compressed Sparse Row format>

Now, we provide the index to both the `obs` and `var` axes using `.obs_names` (resp. `.var_names`).

In [4]:
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')


Subsetting.

In [5]:
adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]

View of AnnData object with n_obs × n_vars = 2 × 2

Add metadata.

In [10]:
np.random.seed(1984)
ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = pd.Categorical(ct)  # Categoricals are preferred for efficiency
adata.obs

Unnamed: 0,cell_type
Cell_0,B
Cell_1,T
Cell_2,B
Cell_3,Monocyte
Cell_4,B
...,...
Cell_95,Monocyte
Cell_96,T
Cell_97,T
Cell_98,B


Subsetting using metadata.

In [11]:
bdata = adata[adata.obs.cell_type == "B"]
bdata

View of AnnData object with n_obs × n_vars = 37 × 2000
    obs: 'cell_type'

Randomly generated matrix that we can interpret as a UMAP embedding of the data we’d like to store, as well as some random gene-level metadata.

In [12]:
np.random.seed(1984)
adata.obsm["X_umap"] = np.random.normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.normal(0, 1, size=(adata.n_vars, 5))
adata.obsm

AxisArrays with keys: X_umap

AnnData has `.uns`, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.

In [13]:
adata.uns["random"] = [1, 2, 3]
adata.uns

OrderedDict([('random', [1, 2, 3])])

Finally, we may have different forms of our original core data, perhaps one that is normalised and one that is not. These can be stored in different layers in `AnnData`. For example, let’s log transform the original data and store it in a layer.

In [14]:
adata.layers["log_transformed"] = np.log1p(adata.X)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

Convert to DataFrames.

In [15]:
adata.to_df(layer="log_transformed")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_1990,Gene_1991,Gene_1992,Gene_1993,Gene_1994,Gene_1995,Gene_1996,Gene_1997,Gene_1998,Gene_1999
Cell_0,0.693147,1.098612,0.693147,0.693147,0.000000,0.693147,0.693147,1.098612,0.693147,0.693147,...,0.000000,0.000000,0.693147,1.098612,0.693147,0.000000,0.693147,1.098612,0.693147,0.693147
Cell_1,0.000000,0.000000,0.693147,1.098612,0.693147,0.693147,0.000000,1.098612,0.693147,1.098612,...,0.693147,0.693147,0.000000,1.386294,1.098612,0.693147,0.000000,0.000000,0.000000,1.098612
Cell_2,1.386294,1.609438,0.693147,1.098612,0.000000,0.000000,0.000000,1.386294,0.000000,0.000000,...,0.000000,0.000000,1.386294,0.000000,0.693147,0.693147,0.000000,0.000000,1.098612,0.000000
Cell_3,0.693147,0.000000,1.098612,0.693147,0.693147,0.000000,0.693147,1.609438,0.000000,0.000000,...,0.000000,0.693147,1.098612,0.693147,0.000000,0.000000,0.000000,1.098612,1.386294,0.693147
Cell_4,1.098612,0.693147,1.098612,1.386294,1.098612,0.000000,0.693147,1.098612,0.693147,1.098612,...,0.693147,0.000000,0.693147,0.693147,1.098612,0.693147,0.000000,0.000000,0.693147,1.098612
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_95,0.000000,1.609438,0.693147,1.098612,0.000000,1.098612,0.693147,0.000000,0.693147,0.693147,...,1.098612,0.000000,1.609438,0.693147,1.609438,0.000000,0.693147,0.693147,0.693147,1.386294
Cell_96,0.000000,1.386294,1.386294,0.693147,0.000000,1.098612,0.693147,1.098612,0.000000,0.693147,...,1.098612,1.386294,1.098612,1.098612,0.693147,0.693147,1.098612,1.098612,0.000000,1.098612
Cell_97,1.098612,0.693147,0.000000,0.693147,0.693147,0.693147,0.693147,0.693147,1.098612,1.386294,...,1.098612,1.386294,0.693147,1.386294,0.000000,1.098612,0.693147,0.693147,0.693147,1.098612
Cell_98,1.386294,0.693147,0.000000,0.693147,0.000000,0.000000,0.000000,0.000000,0.000000,0.693147,...,0.693147,0.000000,0.693147,0.000000,0.000000,0.693147,1.609438,0.693147,0.000000,0.693147


`AnnData` comes with its own persistent HDF5-based file format: `h5ad`. If string columns with small number of categories aren't yet categoricals, `AnnData` will auto-transform to categoricals.

In [None]:
adata.write('my_results.h5ad', compression="gzip")

`AnnData` has become the standard for single-cell analysis in Python and for good reason – it's straightforward to use and faciliatates more reproducible analyses with it's key-based storage. It's even becoming easier to convert to the popular R-based formats for single-cell analysis.