# Run Pipeline

This notebook provides a guided, no‑terminal entry point to run the triples extractor.
**Workflow:** run the cells in order, edit the configuration cell, then run the final execution cell.

## 1) Import Libraries and Configure Paths
This cell sets the project root and makes sure the pipeline sources are importable.

In [7]:
from pathlib import Path
import os
import sys
import subprocess
import time

def find_repo_root(start: Path) -> Path:
    """Find repo root by walking up to a folder that contains config.yaml."""
    current = start.resolve()
    for parent in [current] + list(current.parents):
        if (parent / "config.yaml").exists():
            return parent
    return start.resolve()

# Resolve project root (this repo) and pipeline src directory
project_root = find_repo_root(Path.cwd())
pipeline_src = project_root / "pipeline" / "src"
print("Project root:", project_root)
print("Pipeline src:", pipeline_src)

# Make pipeline/src importable
if str(pipeline_src) not in sys.path:
    sys.path.insert(0, str(pipeline_src))

Project root: /Users/fernandafreire/Documents/Projects/Text+/Github/BeyondEntities_DHd26
Pipeline src: /Users/fernandafreire/Documents/Projects/Text+/Github/BeyondEntities_DHd26/pipeline/src


## 2) Validate and Display `main.py` Metadata
This verifies that `main.py` exists and shows basic file details.

In [8]:
main_py = pipeline_src / "main.py"
if not main_py.exists():
    raise FileNotFoundError(f"main.py not found at: {main_py}")

stat = main_py.stat()
print("main.py:", main_py.resolve())
print("Size (bytes):", stat.st_size)
print("Last modified:", time.ctime(stat.st_mtime))

main.py: /Users/fernandafreire/Documents/Projects/Text+/Github/BeyondEntities_DHd26/pipeline/src/main.py
Size (bytes): 8939
Last modified: Fri Feb  6 15:23:30 2026


## 3) Define a Programmatic Entry Point
This function runs `main.py` with arguments and streams output back to the notebook.

In [9]:
def run_main(args, env=None, on_line=None):
    """Run main.py with a list of CLI args. Optionally stream output lines."""
    cmd = [sys.executable, str(main_py)] + args
    merged_env = os.environ.copy()
    if env:
        merged_env.update(env)
    print("Running:", " ".join(cmd))
    process = subprocess.Popen(
        cmd,
        cwd=str(project_root),
        env=merged_env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1
    )
    for line in process.stdout:
        line = line.rstrip("\n")
        if on_line:
            on_line(line)
        else:
            print(line)
    return process.wait()

## 4) Configure Run Settings
Edit the values in the next cell to match the run configuration from README_cf.md (e.g., a specific filename, batch size via `limit`, granularity, etc.).

### Project Structure (Quick Map)
- Entry point: [pipeline/src/main.py](pipeline/src/main.py)
- Configuration loader: [pipeline/src/config_loader.py](pipeline/src/config_loader.py)
- API client: [pipeline/src/openwebui_client.py](pipeline/src/openwebui_client.py)
- Processing logic: [pipeline/src/processor.py](pipeline/src/processor.py)
- Default config: [config.yaml](config.yaml)
- Prompt text: [prompt.txt](prompt.txt)

### Configuration You Can Change
The variables in the next cell map to CLI flags and `config.yaml`:
- `filename`: run one file (e.g., `I_99.xml`)
- `limit`: batch size (max number of files)
- `granularity`: abstraction level (1–5)
- `skip_existing`: resume without reprocessing
- `no_graphs`: skip HTML graph generation
- `raw_xml`: send raw XML without TEI optimization

### Where to Add API Keys
API keys live in [config.yaml](config.yaml) under the active API profile, you can add yours in: `api_key:`. Make sure the selected profile in `active_profile` matches the credentials you set.

### How to Change the Prompt
Edit the system prompt in [prompt.txt](prompt.txt) or point `prompt_path` to a different file in Section 4.

In [12]:
# ---- Run configuration ----
config_path = "config.yaml"
prompt_path = "prompt.txt"
log_file = "logs/processing.log"

# Data source: "file" or "db"
source = "file"

# Optional: process a single file (leave empty to process all)
filename = "I_96.xml"  # e.g., "I_99.xml"

# Optional: granularity override (1-5). Leave None to use config default.
granularity = 5  # e.g., 3

# Optional: batch size (max number of files). Leave None for all.
limit = None  # e.g., 10

# Flags
skip_existing = False
update_metadata = False
no_graphs = False
raw_xml = False

def build_args_from_config():
    args = []
    if config_path:
        args += ["--config", config_path]
    if prompt_path:
        args += ["--prompt", prompt_path]
    if log_file:
        args += ["--log-file", log_file]
    if source:
        args += ["--source", source]
    if filename:
        args += ["--filename", filename]
    if granularity is not None:
        args += ["--granularity", str(granularity)]
    if limit is not None:
        args += ["--limit", str(limit)]
    if skip_existing:
        args.append("--skip-existing")
    if update_metadata:
        args.append("--update-metadata")
    if no_graphs:
        args.append("--no-graphs")
    if raw_xml:
        args.append("--raw-xml")
    return args

## 5) Run the Pipeline
This cell builds CLI arguments from the configuration above and runs `main.py`.

In [13]:
args = build_args_from_config()
print("Arguments:", args)
exit_code = run_main(args, on_line=print)
print("Exit code:", exit_code)

Arguments: ['--config', 'config.yaml', '--prompt', 'prompt.txt', '--log-file', 'logs/processing.log', '--source', 'file', '--filename', 'I_96.xml', '--granularity', '5']
Running: /Users/fernandafreire/Documents/Projects/Text+/Github/BeyondEntities_DHd26/.venv/bin/python /Users/fernandafreire/Documents/Projects/Text+/Github/BeyondEntities_DHd26/pipeline/src/main.py --config config.yaml --prompt prompt.txt --log-file logs/processing.log --source file --filename I_96.xml --granularity 5
2026-02-06 16:58:39,531 - __main__ - INFO - Starte Beschreibungsverarbeitung
2026-02-06 16:58:39,532 - __main__ - INFO - Lade Konfiguration aus: config.yaml
2026-02-06 16:58:39,533 - __main__ - INFO - Lade Prompt aus: prompt.txt
2026-02-06 16:58:39,533 - __main__ - INFO - Initialisiere File-Client
2026-02-06 16:58:39,533 - __main__ - INFO - Initialisiere OpenWebUI-Client
2026-02-06 16:58:39,533 - __main__ - INFO - Initialisiere Processor
2026-02-06 16:58:39,533 - processor - INFO - Output-Verzeichnis berei

## Outputs and Logs
- JSON results: `output_json/` (and `pipeline/output_json/` depending on your config)
- Logs: `logs/processing.log` (or the `log_file` you set)