# Standalone Extractor Usage

The NexusLIMS extractor system can be used as a **standalone Python library**
without a full NexusLIMS deployment.  No `.env` file, database, NEMO instance,
or CDCS server is required.

This notebook demonstrates:

1. Downloading example microscopy files from the project repository
2. Extracting metadata with the high-level `parse_metadata()` API
3. Using the low-level registry API (`ExtractionContext` + `get_registry()`)
4. What degrades gracefully when no NexusLIMS config is present

:::{note}
{download}`Download this notebook <standalone_extractor_usage.ipynb>` to run it locally.
:::

## Installation

```bash
pip install nexusLIMS
# or, if you use uv:
uv add nexusLIMS
```

## 1. Download example data files

We pull a handful of microscopy files straight from the NexusLIMS test suite
on GitHub so this notebook is self-contained.  Some larger files are stored as
`.tar.gz` archives in the repository; those are downloaded and extracted here.

In [1]:
import tarfile
import urllib.request
from pathlib import Path

# Raw-content base URL for test files in the NexusLIMS repository
_BASE = "https://raw.githubusercontent.com/datasophos/NexusLIMS/main/tests/unit/files/"

data_dir = Path("example_data")
data_dir.mkdir(exist_ok=True)


def fetch(remote_name: str, extract_to: Path | None = None) -> Path:
    """Download a file from the NexusLIMS test suite.

    If *extract_to* is given the file is treated as a .tar.gz archive:
    it is downloaded to a temp path and extracted into *extract_to*.
    Returns the path of the (extracted) file.
    """
    url = _BASE + remote_name
    dest = data_dir / remote_name

    if extract_to is None:
        # Plain file download
        if dest.exists():
            print(f"  already present : {remote_name}")
        else:
            print(f"  downloading {remote_name} …", end=" ", flush=True)
            urllib.request.urlretrieve(url, dest)
            print(f"done ({dest.stat().st_size:,} bytes)")
        return dest
    else:
        # Archive download + extraction
        # The member file has the same stem as the archive (e.g., foo.dm3.tar.gz → foo.dm3)
        stem = remote_name.replace(".tar.gz", "")
        extracted = extract_to / stem
        if extracted.exists():
            print(f"  already present : {stem}")
        else:
            archive_path = data_dir / remote_name
            if not archive_path.exists():
                print(f"  downloading {remote_name} …", end=" ", flush=True)
                urllib.request.urlretrieve(url, archive_path)
                print(f"done ({archive_path.stat().st_size:,} bytes)")
            print(f"  extracting  {stem} …", end=" ", flush=True)
            with tarfile.open(archive_path) as tf:
                tf.extract(stem, path=extract_to, filter="data")
            print(f"done ({extracted.stat().st_size:,} bytes)")
        return extracted


# --- Files stored directly in the repo ---
dm3_file = fetch("test_STEM_image.dm3")
orion_tif = fetch("orion-zeiss_dataZeroed.tif")
msa_file = fetch("leo_edax_test.msa")
spc_file = fetch("leo_edax_test.spc")

# --- Files stored as archives ---
multi_dm3 = fetch("TEM_list_signal_dataZeroed.dm3.tar.gz", extract_to=data_dir)
neoarm_dm4 = fetch("neoarm-gatan_SI_dataZeroed.dm4.tar.gz", extract_to=data_dir)

print("\nAll files ready.")

  already present : test_STEM_image.dm3
  already present : orion-zeiss_dataZeroed.tif
  already present : leo_edax_test.msa
  already present : leo_edax_test.spc
  already present : TEM_list_signal_dataZeroed.dm3
  already present : neoarm-gatan_SI_dataZeroed.dm4

All files ready.


## 2. High-level API — `parse_metadata()`

`parse_metadata()` is the main entry point.  Pass
`write_output=False, generate_preview=False` to skip the NexusLIMS
deployment-specific steps (writing JSON sidecars and generating thumbnails).

### 2a. Gatan DM3 — single STEM image

In [2]:
from nexusLIMS.extractors import parse_metadata

metadata_list, _previews = parse_metadata(
    dm3_file,
    write_output=False,
    generate_preview=False,
)

print(f"Signals found : {len(metadata_list)}")
nx = metadata_list[0]["nx_meta"]
print(f"DatasetType   : {nx['DatasetType']}")
print(f"Data Type     : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")

Signals found : 1
DatasetType   : Image
Data Type     : STEM_Imaging
Creation Time : 2026-02-18T14:54:03.330955-07:00


The full `nx_meta` dictionary contains everything the extractor could pull from the file.
A small helper makes nested dicts and Pint Quantity objects readable:

In [3]:
from decimal import Decimal

import numpy as np


def _to_display(obj):
    """Convert quantities and numpy scalars to printable strings."""
    if hasattr(obj, "magnitude"):  # pint Quantity
        return f"{obj.magnitude} {obj.units}"
    if isinstance(obj, (np.integer, np.floating)):
        return obj.item()
    if isinstance(obj, Decimal):
        return float(obj)
    return obj


def pretty(d, indent=0):
    for k, v in d.items():
        if isinstance(v, dict):
            print(" " * indent + f"{k}:")
            pretty(v, indent + 2)
        else:
            print(" " * indent + f"{k}: {_to_display(v)}")


pretty(nx)

acceleration_voltage: 200000.0 volt
acquisition_device: DigiScan
Creation Time: 2026-02-18T14:54:03.330955-07:00
Data Dimensions: (68, 68)
Data Type: STEM_Imaging
DatasetType: Image
dwell_time: 3.5 microsecond
extensions:
  Cs: 0.0 millimeter
  GMS Version: 2.31.734.0
  Illumination Mode: STEM NANOPROBE
  Imaging Mode: DIFFRACTION
  Microscope: FEI Titan
  Name: TEST Titan Remote
  Operation Mode: SCANNING
  STEM Camera Length: 135.0 millimeter
horizontal_field_width: 0.5090058644612631 micrometer
magnification: 225000.0
stage_position:
  tilt_alpha: 24.950478513002935 degree
  x: -461.276 micrometer
  y: 52.0039 micrometer
  z: 35.033899999999996 millimeter
NexusLIMS Extraction:
  Date: 2026-02-18T15:13:06.884944-07:00
  Module: nexusLIMS.extractors.plugins.dm3_extractor
  Version: 2.4.2.dev0


### 2b. Multi-signal DM3 file

Some DM3/DM4 files contain multiple signals (e.g., a survey image alongside
an EELS spectrum).  `parse_metadata()` returns one metadata dict per signal.

In [4]:
multi_list, _ = parse_metadata(
    multi_dm3,
    write_output=False,
    generate_preview=False,
)

print(f"Signals in file: {len(multi_list)}")
for i, m in enumerate(multi_list):
    nx = m["nx_meta"]
    dims = nx.get("Data Dimensions", "?")
    print(f"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}")

Signals in file: 2
  [0] Image            STEM_Imaging               dims=(512, 512)
  [1] Image            TEM_Imaging                dims=(512,)


### 2c. Gatan DM4 — multi-signal spectrum image (EELS/EDS SI)

In [5]:
dm4_list, _ = parse_metadata(
    neoarm_dm4,
    write_output=False,
    generate_preview=False,
)

print(f"Signals in file: {len(dm4_list)}")
for i, m in enumerate(dm4_list):
    nx = m["nx_meta"]
    dims = nx.get("Data Dimensions", "?")
    print(f"  [{i}] {nx['DatasetType']:15s}  {nx['Data Type']:25s}  dims={dims}")

Signals in file: 4
  [0] Image            STEM_Imaging               dims=(512, 512)
  [1] Image            STEM_Imaging               dims=(52, 106)
  [2] Image            STEM_Imaging               dims=(52, 106)
  [3] SpectrumImage    EDS_Spectrum_Imaging       dims=(52, 106, 2048)


### 2d. Zeiss Orion HIM TIFF

In [6]:
orion_list, _ = parse_metadata(
    orion_tif,
    write_output=False,
    generate_preview=False,
)

nx = orion_list[0]["nx_meta"]
print(f"DatasetType   : {nx['DatasetType']}")
print(f"Data Type     : {nx['Data Type']}")
print(f"Creation Time : {nx['Creation Time']}")

DatasetType   : Image
Data Type     : HIM_Imaging
Creation Time : 2026-02-18T21:54:03.444135+00:00


### 2e. EDAX spectrum files (`.spc` / `.msa`)

In [7]:
for path in [spc_file, msa_file]:
    result, _ = parse_metadata(path, write_output=False, generate_preview=False)
    nx = result[0]["nx_meta"]
    print(f"{path.name}")
    print(f"  DatasetType  : {nx['DatasetType']}")
    print(f"  Data Type    : {nx['Data Type']}")
    print(f"  Creation Time: {nx['Creation Time']}")
    print()

leo_edax_test.spc
  DatasetType  : Spectrum
  Data Type    : EDS_Spectrum
  Creation Time: 2026-02-18T21:54:03.966308+00:00





leo_edax_test.msa
  DatasetType  : Spectrum
  Data Type    : EDS_Spectrum
  Creation Time: 2026-02-18T21:54:03.851536+00:00



## 3. Low-level API — `ExtractionContext` + `get_registry()`

For more control you can work directly with the extractor registry.  This lets
you inspect which extractor was selected, or call `extract()` yourself without
going through `parse_metadata()`.

In [8]:
from nexusLIMS.extractors.base import ExtractionContext
from nexusLIMS.extractors.registry import get_registry

# instrument=None is fine when there is no database to look up
context = ExtractionContext(file_path=dm3_file, instrument=None)

registry = get_registry()
extractor = registry.get_extractor(context)

print(f"Selected extractor  : {extractor.name}")
print(f"Supported extensions: {extractor.supported_extensions}")

Selected extractor  : dm3_extractor
Supported extensions: {'dm4', 'dm3'}


In [9]:
# Call extract() directly — returns list[dict], one dict per signal
raw_list = extractor.extract(context)

print(f"Signals returned: {len(raw_list)}")
nx = raw_list[0]["nx_meta"]
print(f"  DatasetType : {nx['DatasetType']}")
print(f"  Data Type   : {nx['Data Type']}")

Signals returned: 1
  DatasetType : Image
  Data Type   : STEM_Imaging


### Listing all registered extractors

In [10]:
print(f"{'Name':<35} {'Priority':>8}  Extensions")
print("-" * 70)
for ext in registry.all_extractors:
    exts = (
        ", ".join(sorted(ext.supported_extensions))
        if ext.supported_extensions
        else "(any)"
    )
    print(f"{ext.name:<35} {ext.priority:>8}  {exts}")

Name                                Priority  Extensions
----------------------------------------------------------------------
orion_HIM_tif_extractor                  150  tif, tiff
tescan_tif_extractor                     150  tif, tiff
dm3_extractor                            100  dm3, dm4
msa_extractor                            100  msa
spc_extractor                            100  spc
ser_emi_extractor                        100  ser
quanta_tif_extractor                     100  tif, tiff
basic_file_info_extractor                  0  (any)


## 4. Graceful degradation without NexusLIMS configuration

| Feature | Without config | With config |
|---------|:--------------:|:-----------:|
| Metadata extraction | ✅ Full | ✅ Full |
| Schema validation | ✅ Full | ✅ Full |
| Instrument ID from database | ⚠️ Returns `None` | ✅ Looks up DB |
| JSON sidecar write (`write_output`) | ⚠️ Skipped + warning | ✅ Written |
| Preview thumbnail (`generate_preview`) | ⚠️ Skipped + warning | ✅ Generated |
| Local instrument profiles | ⚠️ Skipped (built-ins active) | ✅ Loaded |

If you call `parse_metadata()` with the defaults (`write_output=True,
generate_preview=True`) and no config is present, NexusLIMS logs a warning but
still returns the metadata dict.  The cell below demonstrates that:

In [11]:
import logging

import nexusLIMS.extractors as _ext

# Show warnings inline
logging.basicConfig(level=logging.WARNING)

# Temporarily pretend config is unavailable
_real = _ext._config_available
_ext._config_available = lambda: False

try:
    result, previews = parse_metadata(
        dm3_file,
        write_output=True,  # normally writes JSON — skipped with a warning
        generate_preview=True,  # normally makes thumbnail — skipped with a warning
    )
    print(f"Metadata returned despite missing config: {result is not None}")
    print(f"Previews list (all None when config missing): {previews}")
finally:
    _ext._config_available = _real  # restore



Metadata returned despite missing config: True
Previews list (all None when config missing): [None]
