# Data structures for single-cell and multi-modal data

In this exercise, we'll get familiar with the AnnData and MuData frameworks for working with single-cell and multi-modal data in Python.
These frameworks make working with data much more convenient compared to using for example plain Numpy arrays or Pandas DataFrames. In the last part of the exercise, we'll process some actual data using these frameworks.

## AnnData

You can think of an AnnData object as a data matrix of observations x variables with additional metadata. Visit the [documentation](https://anndata.readthedocs.io/en/latest/index.html) for a [quick introduction](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html)

To demonstrate basic usage, we first import relevant Python packages and then create an AnnData object with random data.

In [1]:
import anndata
import numpy as np

# Simulate count matrix with 100 rows (observations) and 2000 columns (features)
counts = np.random.poisson(1, size=(100, 2000))
adata = anndata.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

The .X attribute contains the data matrix.

In [2]:
adata.X

array([[0, 1, 0, ..., 0, 1, 0],
       [1, 2, 2, ..., 3, 0, 2],
       [0, 1, 2, ..., 1, 0, 1],
       ...,
       [1, 1, 1, ..., 2, 1, 0],
       [3, 1, 1, ..., 1, 0, 0],
       [2, 1, 1, ..., 1, 1, 2]])

For very large and sparse data matrices (many zeros), it's recommended to convert the numpy array to a sparse matrix object first, this is more memory efficient.

In [3]:
from scipy.sparse import csr_matrix

# convert counts to a sparse matrix
counts_sparse = csr_matrix(counts)
adata_sparse = anndata.AnnData(counts_sparse)
adata_sparse

AnnData object with n_obs × n_vars = 100 × 2000

Now the .X attribute is a sparse matrix.

In [4]:
adata_sparse.X

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 126636 stored elements and shape (100, 2000)>

Our AnnData object does not have any metadata yet. We can assign names to observations (cells) and variables (genes).

In [5]:
adata.obs_names = [f"Cell_{i}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i}" for i in range(adata.n_vars)]
adata.obs_names

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9', 'Cell_10', 'Cell_11', 'Cell_12',
       'Cell_13', 'Cell_14', 'Cell_15', 'Cell_16', 'Cell_17', 'Cell_18',
       'Cell_19', 'Cell_20', 'Cell_21', 'Cell_22', 'Cell_23', 'Cell_24',
       'Cell_25', 'Cell_26', 'Cell_27', 'Cell_28', 'Cell_29', 'Cell_30',
       'Cell_31', 'Cell_32', 'Cell_33', 'Cell_34', 'Cell_35', 'Cell_36',
       'Cell_37', 'Cell_38', 'Cell_39', 'Cell_40', 'Cell_41', 'Cell_42',
       'Cell_43', 'Cell_44', 'Cell_45', 'Cell_46', 'Cell_47', 'Cell_48',
       'Cell_49', 'Cell_50', 'Cell_51', 'Cell_52', 'Cell_53', 'Cell_54',
       'Cell_55', 'Cell_56', 'Cell_57', 'Cell_58', 'Cell_59', 'Cell_60',
       'Cell_61', 'Cell_62', 'Cell_63', 'Cell_64', 'Cell_65', 'Cell_66',
       'Cell_67', 'Cell_68', 'Cell_69', 'Cell_70', 'Cell_71', 'Cell_72',
       'Cell_73', 'Cell_74', 'Cell_75', 'Cell_76', 'Cell_77', 'Cell_78',
       'Cell_79', 'Cell_80', 'Cell_81', 'Cell_82',

Now we can subset the AnnData object using cell and gene names.

In [6]:
adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]

View of AnnData object with n_obs × n_vars = 2 × 2

Note that we obtained a view of an AnnData object. The view does not store any data but points to the original AnnData object instead to save memory. The view will automatically convert itself to a full AnnData object by copying the relevant subset of the original object if we try to add some data to the view.

We can also add some metadata describing the observations. To demonstrate this, we randomly assign one of three random cell types to a column in the `.obs` attribute which is a Pandas DataFrame.

In [7]:
adata.obs["cell_type"] = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs

Unnamed: 0,cell_type
Cell_0,T
Cell_1,B
Cell_2,T
Cell_3,B
Cell_4,B
...,...
Cell_95,T
Cell_96,Monocyte
Cell_97,Monocyte
Cell_98,B


Note that the description of the AnnData object now includes the new column.

In [8]:
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

We can also subset the AnnData object using boolean indexing. For example, we can get all B cells.

In [9]:
bdata = adata[adata.obs["cell_type"] == "B"]
bdata

View of AnnData object with n_obs × n_vars = 38 × 2000
    obs: 'cell_type'

AnnData can store multiple data matrices in the `.layers` attribute. This is useful if one wants to keep both original and transformed, e.g. normalized, data. All matrices must have the same shape as the original `.X`.

In [10]:
adata.layers["log_transformed"] = np.log1p(adata.X)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    layers: 'log_transformed'

### Exercise
Try to add a new layer called `counts` to the AnnData object with the reverse transformation, `np.expm1()`.

In [11]:
# Your solution here
# ...


Inspect the resulting matrix. Is it different to the original values in `.X`?

In [12]:
# Your solution here
# ...

Now repeat the transformation steps with the `adata_sparse` object. To display the values of a sparse matrix, you can use the `.toarray()` method.

## MuData

MuData objects contain a dictionary of AnnData objects and are used for multimodal data. AnnData objects within a MuData container are aligned and can be jointly subsetted.

To demonstrate this, we will use the `adata` and `bdata` objects from above. Since MuData assumes that variables are unique to each modality, we first change `bdata`'s variable names and then create a MuData object.

In [13]:
import mudata
mudata.set_options()

# rename the variable names in bdata to "Protein_0", "Protein_1", ...
bdata.var_names = [f"Protein_{i}" for i in range(bdata.n_vars)]
bdata.var_names

Index(['Protein_0', 'Protein_1', 'Protein_2', 'Protein_3', 'Protein_4',
       'Protein_5', 'Protein_6', 'Protein_7', 'Protein_8', 'Protein_9',
       ...
       'Protein_1990', 'Protein_1991', 'Protein_1992', 'Protein_1993',
       'Protein_1994', 'Protein_1995', 'Protein_1996', 'Protein_1997',
       'Protein_1998', 'Protein_1999'],
      dtype='object', length=2000)

In [14]:
# create a multimodal MuData object from the two AnnData objects
mdata = mudata.MuData({"rna": adata, "prot": bdata})
mdata

We can access individual modalities using a dictionary-like interface.

In [15]:
mdata["rna"]

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    layers: 'log_transformed'

We can subset the MuData object using integer or string based indexing.

In [16]:
mdata[20:42, ["Gene_42", "Protein_42"]]

Note that less than the expected 22 cells from the `prot` modality are included in the subset. This is because we created the `prot` modality from only a subset of all cells, and MuData uses the cell names (`.obs_names`) to mach observations in different modalities to each other.

### Exercise

Pick two random Protein features and plot them against each other using a scatterplot.

In [22]:
# Your solution here
# ...

Simulate missing measurements by placing `np.nan` in the first 10 cells of the "prot" modality.

In [None]:
# Your solution here
# ...