# `proteodata`: the ProteoPy data format

This tutorial explains the **proteodata** convention — the set of assumptions that every ProteoPy function relies on when working with an `AnnData` object. You will learn:

- What constitutes a valid proteodata object at the protein level and at the peptide level
- How to construct proteodata from scratch
- How to use `is_proteodata()` and `check_proteodata()` to validate your data
- Common pitfalls that break the format, and how to avoid them

**Prerequisites:** Basic familiarity with `AnnData` (observations, variables, `.obs`, `.var`, `.X`).

In [None]:
import numpy as np
import pandas as pd
from anndata import AnnData

from proteopy.utils.anndata import is_proteodata, check_proteodata

---

## 1. Anatomy of a proteodata object

ProteoPy stores proteomics data in the [AnnData](https://anndata.readthedocs.io/) format, where:

| Slot | Content |
|------|---------|
| `.X` | Intensity matrix (samples x proteins/peptides), containing `float`, `int`, or `np.nan` |
| `.obs` | Sample metadata — **must** include a `sample_id` column |
| `.var` | Protein / peptide metadata — **must** include `protein_id` (and `peptide_id` for peptide-level data) |
| `.obs_names` | Sample index — must be unique |
| `.var_names` | Protein/peptide index — must be unique |

The `is_proteodata()` function checks all of these assumptions and returns a tuple: `(True, "protein")`, `(True, "peptide")`, or `(False, None)`.

---

## 2. Building a valid protein-level proteodata

In [None]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B", "sample_C"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Protein metadata --
protein_ids = ["P12345", "Q67890", "O11223", "P44556"]
var = pd.DataFrame({"protein_id": protein_ids}, index=protein_ids)

# -- Intensity matrix (3 samples x 4 proteins) --
X = np.array([
    [100.0, 200.0,  50.0, 300.0],
    [110.0, np.nan, 55.0, 280.0],
    [ 95.0, 210.0,  48.0, 310.0],
])

adata_protein = AnnData(X=X, obs=obs, var=var)
adata_protein

In [None]:
is_proteodata(adata_protein)

The key rules for protein-level proteodata:

1. `.var["protein_id"]` must exist
2. `.var["protein_id"]` values must exactly match `.var_names` (same values, same order)
3. `.obs["sample_id"]` must exist
4. All indices must be unique
5. `.X` must not contain `np.inf` or `-np.inf`
6. `protein_id` must not contain `NaN`

---

## 3. Building a valid peptide-level proteodata

Peptide-level data requires an additional `peptide_id` column in `.var`. Each peptide must map to exactly **one** protein.

In [None]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Peptide metadata --
# Two peptides from PROT_X and one from PROT_Y
peptide_ids = ["PEPTIDE_1", "PEPTIDE_2", "PEPTIDE_3"]
protein_ids = ["PROT_X", "PROT_X", "PROT_Y"]
var = pd.DataFrame(
    {"peptide_id": peptide_ids, "protein_id": protein_ids},
    index=peptide_ids,
)

# -- Intensity matrix (2 samples x 3 peptides) --
X = np.array([
    [500.0, 300.0, 800.0],
    [520.0, 310.0, 790.0],
])

adata_peptide = AnnData(X=X, obs=obs, var=var)
is_proteodata(adata_peptide)

The additional rules for peptide-level proteodata:

1. `.var["peptide_id"]` must exist and match `.var_names` exactly
2. `.var["protein_id"]` must exist — every peptide needs a parent protein
3. Each peptide maps to **exactly one** protein (no multi-mapping like `"PROT_A;PROT_B"`)
4. Neither `peptide_id` nor `protein_id` may contain `NaN`

---

## 4. `is_proteodata` vs `check_proteodata`

ProteoPy provides two validation functions:

| Function | On failure | Use case |
|----------|------------|----------|
| `is_proteodata(adata)` | Returns `(False, None)` | Conditional logic — "is this proteodata?" |
| `check_proteodata(adata)` | Raises `ValueError` | Guard clauses — "this *must* be proteodata" |

`is_proteodata` also accepts `raise_error=True` to behave like `check_proteodata`.

Both accept a `layers` parameter to additionally validate layer matrices for infinite values.

In [None]:
# is_proteodata: soft check — returns a tuple
result = is_proteodata(adata_protein)
print(f"Valid: {result[0]}, Level: {result[1]}")

In [None]:
# check_proteodata: hard check — raises on failure
try:
    check_proteodata(adata_protein)
    print("Validation passed!")
except ValueError as e:
    print(f"Validation failed: {e}")

---

## 5. Pitfalls: how valid proteodata can break

Even if you start with a valid proteodata object, common operations can silently break the format.

### 5.1 Renaming the index without updating `protein_id`

If you rename `.var_names` (the index), the `protein_id` column no longer matches (and vice versa).

In [None]:
# Start with a valid proteodata
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2"]},
    index=["s1", "s2"],
)
proteins = ["PROT_A", "PROT_B", "PROT_C"]
var = pd.DataFrame(
    {"protein_id": proteins},
    index=proteins,
)
adata = AnnData(X=np.ones((2, 3)), obs=obs, var=var)

print("Before:", is_proteodata(adata))

In [None]:
# Rename the index to gene names
adata.var_names = ["GeneA", "GeneB", "GeneC"]

print("After:", is_proteodata(adata))

The `protein_id` column still contains `["PROT_A", "PROT_B", "PROT_C"]`, but the index now says `["GeneA", "GeneB", "GeneC"]`. To fix this, always update **both** the index and the ID column together.

In [None]:
# Fix 
adata.var["protein_id"] = adata.var_names

print("Repaired:", is_proteodata(adata))

### 5.2 Renaming `sample_id` without updating `obs_names` (or vice versa)

In [None]:
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2"]},
    index=["s1", "s2"],
)
proteins = ["P1", "P2"]
var = pd.DataFrame({"protein_id": proteins}, index=proteins)
adata = AnnData(X=np.ones((2, 2)), obs=obs, var=var)

print("Before:", is_proteodata(adata))

In [None]:
# Rename sample_id — obs_names is now stale
adata.obs["sample_id"] = ["new_s1", "new_s2"]
print("obs_names:", list(adata.obs_names))
print("sample_id:", list(adata.obs["sample_id"]))

print("After:", is_proteodata(adata))

In [None]:
# Rename observations — sample_id column is now stale
adata.obs_names = ["new_s1", "new_s2"]
print("obs_names:", list(adata.obs_names))
print("sample_id:", list(adata.obs["sample_id"]))

print("Fix:", is_proteodata(adata))

Although `is_proteodata` does **not** enforce that `sample_id` matches `obs_names` — this is by design, since sample IDs may legitimately differ from the AnnData index. However, downstream ProteoPy functions (`pr.read`, `pr.ann`) assume `proteodata` format and consequently for `sample_id` and the index/`obs_names` to be equal, and many plotting functions use `sample_id` for labelling. Keeping them in sync avoids confusion.

### 5.3 Infinite values from mathematical operations

Infinite values can easily arise from common data transformations. A typical example is log-transforming data that contains zeros:

In [None]:
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2"]},
    index=["s1", "s2"],
)
proteins = ["P1", "P2"]
var = pd.DataFrame({"protein_id": proteins}, index=proteins)

X = np.array([[100.0, 0.0], [200.0, 50.0]])
adata = AnnData(X=X, obs=obs, var=var)

print("Before log2:", is_proteodata(adata))

In [None]:
# log2(0) = -inf!
adata.X = np.log2(adata.X)
print("Matrix after log2:")
print(adata.X)

print("After log2:", is_proteodata(adata))

In [None]:
# The detailed error message tells you what went wrong
try:
    is_proteodata(adata, raise_error=True)
except ValueError as e:
    print(e)

**Fix:** Replace zeros with `np.nan` **before** log-transforming.

```python
adata.X[adata.X == 0] = np.nan
adata.X = np.log2(adata.X)  # log2(NaN) = NaN, safe
```

Another source of `inf` is division by zero, for example when computing fold changes:

In [None]:
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2"]},
    index=["s1", "s2"],
)
proteins = ["P1", "P2"]
var = pd.DataFrame({"protein_id": proteins}, index=proteins)

# Condition A and condition B intensities
X_a = np.array([[100.0, 50.0]])
X_b = np.array([[0.0, 50.0]])  # P1 has zero intensity in condition B

# Fold change = A / B — division by zero yields inf
fold_change = X_a / X_b
print("Fold change matrix:", fold_change)

In [None]:
adata = AnnData(X=fold_change, obs=obs[:1], var=var)
print(is_proteodata(adata))

---

## 6. Validating layers

ProteoPy functions sometimes store intermediate results in `adata.layers` (e.g., raw intensities before transformation). Both `is_proteodata` and `check_proteodata` accept a `layers` parameter to validate those matrices as well.

In [None]:
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2"]},
    index=["s1", "s2"],
)
proteins = ["P1", "P2"]
var = pd.DataFrame({"protein_id": proteins}, index=proteins)
X = np.array([[10.0, 20.0], [30.0, 40.0]])

adata = AnnData(X=X, obs=obs, var=var)

# .X is fine
print("Without layers check:", is_proteodata(adata))

In [None]:
# Add a layer with an infinite value
bad_layer = X.copy()
bad_layer[0, 0] = np.inf
adata.layers["transformed"] = bad_layer

# Still passes without the layers parameter
print("Ignoring layers:    ", is_proteodata(adata))

In [None]:
# Fails when we explicitly check that layer
print("Checking layer:     ", is_proteodata(adata, layers="transformed"))

In [None]:
try:
    check_proteodata(adata, layers="transformed")
except ValueError as e:
    print(e)

In [None]:
# You can check multiple layers at once
adata.layers["raw"] = X.copy()  # this one is fine

try:
    check_proteodata(adata, layers=["raw", "transformed"])
except ValueError as e:
    print(e)

---

## 7. Quick-reference checklist

Use this checklist when constructing or debugging a proteodata object:

| # | Check | Applies to |
|---|-------|------------|
| 1 | `.obs["sample_id"]` exists | Both |
| 2 | `.var["protein_id"]` exists | Both |
| 3 | `.var["peptide_id"]` exists and matches `.var_names` | Peptide only |
| 4 | `.var["protein_id"]` matches `.var_names` | Protein only |
| 5 | Each peptide maps to exactly one protein | Peptide only |
| 6 | No `NaN` in `protein_id` or `peptide_id` | Both |
| 7 | No `np.inf` or `-np.inf` in `.X` or checked layers | Both |
| 8 | All indices (obs, var) are unique | Both |
| 9 | `.X` is 2-dimensional | Both |
| 10 | `protein_id`/`peptide_id` not in `.obs`; `sample_id` not in `.var` | Both |

---

## Summary

The proteodata format is a thin but strict contract on top of AnnData that ensures ProteoPy functions can safely assume:

- **Who are the samples?** — defined by `.obs["sample_id"]`
- **What are the variables?** — defined by `.var["protein_id"]` (protein-level) or `.var["peptide_id"]` + `.var["protein_id"]` (peptide-level)
- **Is the data clean?** — no infinite values, no NaN identifiers, no duplicates

Always validate with `check_proteodata()` after constructing or modifying an AnnData. ProteoPy does this automatically in every public function, so you will get a clear error message if something is wrong.