# Metabolomics Workbench API Annotator Demo

This notebook demonstrates the **MetabolomicsWorkbenchAnnotator** implemented for GitHub Issue #20.

The annotator queries the [Metabolomics Workbench RefMet API](https://www.metabolomicsworkbench.org/databases/refmet/index.php) to retrieve vocabulary IDs for metabolite entities.

## API Fields Captured

The annotator returns **raw API field names** - the Normalizer handles mapping to standard vocabularies:

| API Field | Description |
|-----------|-------------|
| `pubchem_cid` | PubChem Compound ID |
| `inchi_key` | Standard InChIKey |
| `smiles` | SMILES notation |
| `refmet_id` | RefMet ID (with "RM" prefix) |
| `ChEBI_ID` | ChEBI ID |
| `HMDB_ID` | HMDB ID |
| `LM_ID` | LipidMaps ID |
| `KEGG_ID` | KEGG ID |

In [1]:
# Imports
import pandas as pd

from biomapper2.core.annotation_engine import AnnotationEngine
from biomapper2.core.annotators.metabolomics_workbench import MetabolomicsWorkbenchAnnotator

## 1. Basic Usage - Single Entity Annotation

The simplest way to use the annotator is to annotate a single metabolite by name.

In [2]:
# Create the annotator
annotator = MetabolomicsWorkbenchAnnotator()

# Annotate a single metabolite
entity = {"name": "Carnitine"}
result = annotator.get_annotations(entity, name_field="name")

print("Annotator slug:", annotator.slug)
print("\nAnnotation result:")
result

Annotator slug: metabolomics-workbench

Annotation result:


{'metabolomics-workbench': {'pubchem.compound': defaultdict(dict,
              {'10917': {}}),
  'inchikey': defaultdict(dict, {'PHIQHXFUZVPYII-ZCFIWIBFSA-N': {}}),
  'smiles': defaultdict(dict, {'C[N+](C)(C)C[C@@H](CC(=O)[O-])O': {}}),
  'rm': defaultdict(dict, {'0008606': {}})}}

In [None]:
# Extract individual vocabulary IDs (raw API field names)
mw_annotations = result["metabolomics-workbench"]

print("PubChem CID:", list(mw_annotations.get("pubchem_cid", {}).keys()))
print("InChIKey:", list(mw_annotations.get("inchi_key", {}).keys()))
print("SMILES:", list(mw_annotations.get("smiles", {}).keys()))
print("RefMet ID:", list(mw_annotations.get("refmet_id", {}).keys()))

## 2. Annotating Multiple Metabolites

Try annotating several common metabolites:

In [None]:
metabolites = ["Glucose", "Alanine", "ATP", "Cholesterol", "Caffeine"]

for name in metabolites:
    result = annotator.get_annotations({"name": name}, name_field="name")
    if "metabolomics-workbench" in result:
        vocabs = list(result["metabolomics-workbench"].keys())
        pubchem = list(result["metabolomics-workbench"].get("pubchem_cid", {}).keys())
        print(f"{name:15} -> PubChem: {pubchem[0] if pubchem else 'N/A':10} | Fields: {vocabs}")
    else:
        print(f"{name:15} -> No annotations found")

## 3. Bulk Annotation with DataFrames

For efficiency, use `get_annotations_bulk()` to annotate multiple entities at once. This method:
- Deduplicates API calls for repeated names
- Returns a Series with the same index as the input DataFrame

In [5]:
# Create a DataFrame with metabolites
df = pd.DataFrame({
    "name": ["Carnitine", "Glucose", "Alanine", "Carnitine", "Unknown123"],
    "sample_id": ["S1", "S2", "S3", "S4", "S5"]
})

print("Input DataFrame:")
display(df)

Input DataFrame:


Unnamed: 0,name,sample_id
0,Carnitine,S1
1,Glucose,S2
2,Alanine,S3
3,Carnitine,S4
4,Unknown123,S5


In [6]:
# Bulk annotate
annotations = annotator.get_annotations_bulk(df, name_field="name")

# Add annotations to DataFrame
df["annotations"] = annotations

print("DataFrame with annotations:")
display(df)

DataFrame with annotations:


Unnamed: 0,name,sample_id,annotations
0,Carnitine,S1,{'metabolomics-workbench': {'pubchem.compound'...
1,Glucose,S2,{'metabolomics-workbench': {'pubchem.compound'...
2,Alanine,S3,{'metabolomics-workbench': {'pubchem.compound'...
3,Carnitine,S4,{'metabolomics-workbench': {'pubchem.compound'...
4,Unknown123,S5,{'metabolomics-workbench': {}}


In [None]:
# Extract PubChem IDs from annotations (raw field name: pubchem_cid)
def extract_pubchem(annot):
    if "metabolomics-workbench" in annot:
        pubchem = annot["metabolomics-workbench"].get("pubchem_cid", {})
        return list(pubchem.keys())[0] if pubchem else None
    return None

df["pubchem_cid"] = df["annotations"].apply(extract_pubchem)
print("Extracted PubChem IDs:")
display(df[["name", "sample_id", "pubchem_cid"]])

## 4. Using the Annotation Engine

The `AnnotationEngine` automatically selects the MetabolomicsWorkbenchAnnotator for metabolite, lipid, and small molecule entity types.

In [8]:
# Create the annotation engine
engine = AnnotationEngine()

# Check which annotators are selected for metabolites
annotators = engine._select_annotators("metabolite")
print("Annotators for 'metabolite':")
for a in annotators:
    print(f"  - {type(a).__name__} (slug: {a.slug})")

Annotators for 'metabolite':
  - MetabolomicsWorkbenchAnnotator (slug: metabolomics-workbench)


In [9]:
# Annotate a single entity through the engine
entity = {"name": "Dopamine"}

result = engine.annotate(
    item=entity,
    name_field="name",
    provided_id_fields=[],
    entity_type="metabolite",
    mode="all"
)

print("Annotation result for Dopamine:")
result["assigned_ids"]

Annotation result for Dopamine:


{'metabolomics-workbench': {'pubchem.compound': defaultdict(dict, {'681': {}}),
  'inchikey': defaultdict(dict, {'VYFYYTLLBUKUHU-UHFFFAOYSA-N': {}}),
  'smiles': defaultdict(dict, {'c1cc(c(cc1CCN)O)O': {}}),
  'rm': defaultdict(dict, {'0135878': {}})}}

In [10]:
# Annotate a DataFrame through the engine
df_lipids = pd.DataFrame({
    "name": ["Palmitic acid", "Oleic acid", "Linoleic acid"],
})

result_df = engine.annotate(
    item=df_lipids,
    name_field="name",
    provided_id_fields=[],
    entity_type="lipid",
    mode="all"
)

# Combine with original data
df_lipids["assigned_ids"] = result_df["assigned_ids"]
print("Lipid annotations:")
display(df_lipids)

Lipid annotations:


Unnamed: 0,name,assigned_ids
0,Palmitic acid,{'metabolomics-workbench': {'pubchem.compound'...
1,Oleic acid,{'metabolomics-workbench': {'pubchem.compound'...
2,Linoleic acid,{'metabolomics-workbench': {'pubchem.compound'...


## 5. Edge Cases

The annotator handles various edge cases gracefully:

In [11]:
# Unknown metabolite - returns empty annotations
result = annotator.get_annotations({"name": "ThisDoesNotExist12345"}, name_field="name")
print("Unknown metabolite:", result)

Unknown metabolite: {'metabolomics-workbench': {}}


In [12]:
# Missing name field - returns empty dict
result = annotator.get_annotations({"other_field": "value"}, name_field="name")
print("Missing name field:", result)

Missing name field: {}


In [13]:
# Empty name - returns empty dict
result = annotator.get_annotations({"name": ""}, name_field="name")
print("Empty name:", result)

Empty name: {}


In [None]:
# Special characters in name - handled via URL encoding
result = annotator.get_annotations({"name": "5-hydroxyindoleacetic acid"}, name_field="name")
print("Special characters (5-hydroxyindoleacetic acid):")
if "metabolomics-workbench" in result:
    print(f"  PubChem: {list(result['metabolomics-workbench'].get('pubchem_cid', {}).keys())}")

## 6. API Details

The annotator uses the following API endpoint:

```
GET https://www.metabolomicsworkbench.org/rest/refmet/name/{metabolite_name}/all/
```

Example response for "Carnitine":

In [15]:
import requests

# Direct API call example
response = requests.get(
    "https://www.metabolomicsworkbench.org/rest/refmet/name/Carnitine/all/"
)
print("Raw API response:")
response.json()

Raw API response:


{'name': 'Carnitine',
 'pubchem_cid': '10917',
 'inchi_key': 'PHIQHXFUZVPYII-ZCFIWIBFSA-N',
 'exactmass': '161.105194',
 'formula': 'C7H15NO3',
 'super_class': 'Organic nitrogen compounds',
 'main_class': 'Carnitines',
 'sub_class': 'Carnitines',
 'smiles': 'C[N+](C)(C)C[C@@H](CC(=O)[O-])O',
 'regno': '42914',
 'refmet_id': 'RM0008606'}

## 7. Kestrel API Integration

The biomapper2 pipeline uses **two APIs**:

1. **Metabolomics Workbench RefMet API** - For annotating metabolites/lipids with vocabulary IDs
2. **Kestrel API** - For text search (fallback annotator) and linking CURIEs to Knowledge Graph nodes

### Kestrel Text Search Annotator

For entity types other than metabolite/lipid/smallmolecule, the `KestrelTextSearchAnnotator` is used as a fallback:

In [16]:
# Check which annotators are selected for different entity types
engine = AnnotationEngine()

print("Annotators by entity type:")
for entity_type in ["metabolite", "lipid", "protein", "gene", "disease"]:
    annotators = engine._select_annotators(entity_type)
    names = [type(a).__name__ for a in annotators]
    print(f"  {entity_type:12} -> {names}")

Annotators by entity type:
  metabolite   -> ['MetabolomicsWorkbenchAnnotator']
  lipid        -> ['MetabolomicsWorkbenchAnnotator']
  protein      -> ['KestrelTextSearchAnnotator']
  gene         -> ['KestrelTextSearchAnnotator']
  disease      -> ['KestrelTextSearchAnnotator']


In [17]:
# Use Kestrel text search for a protein (non-metabolite entity type)
from biomapper2.core.annotators.kestrel_text import KestrelTextSearchAnnotator

kestrel_annotator = KestrelTextSearchAnnotator()
print("Kestrel annotator slug:", kestrel_annotator.slug)

# Annotate a protein by name
protein_entity = {"name": "insulin"}
protein_result = kestrel_annotator.get_annotations(protein_entity, name_field="name")

print("\nKestrel text search result for 'insulin':")
protein_result

Kestrel annotator slug: kestrel-text-search

Kestrel text search result for 'insulin':


{'kestrel-text-search': defaultdict(<function biomapper2.core.annotators.kestrel_text.KestrelTextSearchAnnotator.get_annotations.<locals>.<lambda>()>,
             {'CHEBI': defaultdict(dict,
                          {'5931': {'score': 1838.9493701473646}})})}

### Full Pipeline with Mapper

The `Mapper` class orchestrates the full pipeline:
1. **Annotation** - Add vocabulary IDs (using MW API for metabolites, Kestrel for others)
2. **Normalization** - Convert IDs to standard CURIEs
3. **Linking** - Map CURIEs to Knowledge Graph node IDs (using Kestrel API)
4. **Resolution** - Resolve multiple matches to best KG node

In [18]:
# Full pipeline: Map a metabolite to Knowledge Graph
from biomapper2.mapper import Mapper

mapper = Mapper()

# Map Carnitine through the full pipeline
metabolite = {"name": "Carnitine"}
mapped_result = mapper.map_entity_to_kg(
    item=metabolite,
    name_field="name",
    provided_id_fields=[],
    entity_type="metabolite"
)

print("Full pipeline result for Carnitine:")
print(f"  Name: {mapped_result.get('name')}")

curies = list(mapped_result.get('curies', []))
print(f"  CURIEs ({len(curies)} total): {curies}")

# kg_ids is a dict mapping KG node ID -> list of CURIEs that linked to it
kg_ids_dict = mapped_result.get('kg_ids', {})
print(f"  KG IDs: {list(kg_ids_dict.keys())}")

# chosen_kg_id is the resolved best match
print(f"  Resolved KG ID: {mapped_result.get('chosen_kg_id')}")

Full pipeline result for Carnitine:
  Name: Carnitine
  CURIEs (4 total): ['SMILES:C[N+](C)(C)C[C@H](O)CC(=O)[O-]', 'PUBCHEM.COMPOUND:10917', 'RM:0008606', 'INCHIKEY:PHIQHXFUZVPYII-ZCFIWIBFSA-N']
  KG IDs: ['CHEBI:16347']
  Resolved KG ID: CHEBI:16347


In [19]:
# Map multiple metabolites through the full pipeline
df_metabolites = pd.DataFrame({
    "name": ["Carnitine", "Glucose", "Dopamine"],
})

# Use map_dataset_to_kg for DataFrames (writes results to TSV)
# For this demo, we'll map entities individually
results = []
for _, row in df_metabolites.iterrows():
    result = mapper.map_entity_to_kg(
        item=row.to_dict(),
        name_field="name",
        provided_id_fields=[],
        entity_type="metabolite"
    )
    results.append({
        "name": result.get("name"),
        "chosen_kg_id": result.get("chosen_kg_id"),
        "num_curies": len(result.get("curies", [])),
        "kg_ids": list(result.get("kg_ids", {}).keys())
    })

df_results = pd.DataFrame(results)
print("Mapped metabolites to Knowledge Graph:")
display(df_results)

Mapped metabolites to Knowledge Graph:


Unnamed: 0,name,chosen_kg_id,num_curies,kg_ids
0,Carnitine,CHEBI:16347,4,[CHEBI:16347]
1,Glucose,CHEBI:4167,4,[CHEBI:4167]
2,Dopamine,CHEBI:4698,4,[CHEBI:4698]
