# Issue Report Pipeline — `emd:horizontal_grid_cells`

Each stage produces a **result object** with two properties:
- `.data` — structured dict you can inspect or pass downstream
- `.md`   — ready-to-render Markdown section

| Stage | Class | Returns |
|-------|-------|---------|
| 1 | `GraphLoader` | loader object with `.items` list |
| 2 | `LinkAnalyzer` | `LinkResult` — external `@id` links + overlap % |
| 3 | `PydanticValidator` | `ValidationResult` — pre-validated / failed / unmodelled |
| 4 | `TextSimilarityAnalyzer` | `SimilarityResult` — semantic / structural similarity |
| 5 | `ReportBuilder` | full Markdown report string |

**All data is defined once** in the setup cells below, then every stage is a single call.

---
## Setup — imports & pipeline classes

In [None]:
import sys
sys.path.insert(0, '/Users/daniel.ellis/WIPwork/CMIP-LD')

import cmipld
from IPython.display import Markdown, display

from cmipld.utils.similarity import (
    GraphLoader,
    LinkAnalyzer,
    PydanticValidator,
    TextSimilarityAnalyzer,
    ReportBuilder,
    short,          # short('https://.../x_resolution') → 'x_resolution'
)

---
## Definitions — all data lives here

In [None]:
# ── Folder & kind ──────────────────────────────────────────────────────────
FOLDER_URL = 'emd:horizontal_grid_cells'
KIND       = 'horizontal_grid_cell'

# ── Fetch the live graph (one call, reused by every stage) ─────────────────
graph_data = cmipld.expand('emd:horizontal_grid_cells/_graph.json', depth=2)

# ── Derive field URI prefix from the live data ─────────────────────────────
contents_key = next(k for k in graph_data if 'contents' in k.lower())
sample       = graph_data[contents_key][0]
FP           = next(k for k in sample if short(k) == 'description').replace('description', '')

# ── Submitted item — the hypothetical new entry being reviewed ─────────────
#    Mirrors the real data structure (fully-expanded URI keys).
submitted = {
    '@id'  : 'https://emd.mipcvs.dev/horizontal_grid_cells/g107',
    '@type': [f'{FP[:-1]}', 'wcrp:horizontal_grid_cells', 'esgvoc:HorizontalGridCells'],
    # Link-type fields
    f'{FP}grid_type'          : {'@id': 'https://constants.mipcvs.dev/grid_type/regular-latitude-longitude'},
    f'{FP}grid_mapping'       : {'@id': 'https://constants.mipcvs.dev/grid_mapping/latitude-longitude'},
    f'{FP}region'             : {'@id': 'https://constants.mipcvs.dev/region/global'},
    f'{FP}temporal_refinement': {'@id': 'https://constants.mipcvs.dev/temporal_refinement/static'},
    f'{FP}units'              : {'@id': 'https://constants.mipcvs.dev/units/degree'},
    # Text / value fields
    f'{FP}description'           : {'@value': 'Global regular latitude-longitude grid with 0.5° x 0.5° resolution and 259200 cells.'},
    f'{FP}n_cells'               : {'@value': 259200},
    f'{FP}x_resolution'          : {'@value': 0.5},
    f'{FP}y_resolution'          : {'@value': 0.5},
    f'{FP}southernmost_latitude' : {'@value': -89.75},
    f'{FP}westernmost_longitude' : {'@value': 0.0},
    # DRS / identifier (auto-skipped by the pipeline)
    f'{FP}validation_key': {'@value': 'g107'},
    f'{FP}ui_label'      : {'@value': ''},
}

print('Submitted item fields:')
for k, v in submitted.items():
    print(f'  {short(k):<28} {v}')

---
## Stage 1 — `GraphLoader`

Load the folder graph. `graph_data=` reuses the already-fetched dict — no extra network call.

In [None]:
loader = GraphLoader(FOLDER_URL, graph_data=graph_data)
print(loader)
print(f'\nItem IDs: {list(loader.to_data_dict().keys())}')

In [None]:
# Inspect a single item — fields displayed with short()
g100 = loader.get('g100')
for k, v in g100.items():
    print(f'  {short(k):<28} {v}')

---
## Stage 2 — `LinkAnalyzer`

Extracts all external `@id` URI references via **RDFlib** and computes
Jaccard overlap with every item in the folder.

`result.link_fields` feeds into Stage 4 as exclusions.

In [None]:
link_result = LinkAnalyzer(loader).analyze(submitted)
print(link_result)

In [None]:
link_result.data

In [None]:
display(Markdown(link_result.md))

---
## Stage 3 — `PydanticValidator`

Validates via **`pycmipld`** against the esgvoc `HorizontalGridCells` model.

- `validation_key` is translated to `drs_name` internally — both are excluded from similarity.
- `result.validated_fields` feeds into Stage 4 as exclusions.
- `result.unmodelled_fields` are the candidates for text similarity.

In [None]:
val_result = PydanticValidator(KIND, submitted).validate()
print(val_result)

In [None]:
val_result.data

In [None]:
display(Markdown(val_result.md))

---
## Stage 4 — `TextSimilarityAnalyzer`

Compares the remaining content fields using transformer embeddings
(`all-MiniLM-L6-v2`), falling back to structural field comparison.

Always excluded automatically: `@*`, `drs*`, `validation*`, link-carrying fields.  
Also excluded here: pydantic-validated fields from Stage 3.

In [None]:
sim_result = TextSimilarityAnalyzer(
    loader,
    exclude = link_result.link_fields | val_result.validated_fields,
).analyze(submitted)

print(sim_result)

In [None]:
sim_result.data

In [None]:
display(Markdown(sim_result.md))

---
## Stage 5 — `ReportBuilder`

Runs all stages internally and assembles a single Markdown report.
The checklist auto-ticks pydantic-validated fields and leaves the rest for manual review.

`graph_data=` prevents a second network fetch.

In [None]:
report = ReportBuilder(
    folder_url = FOLDER_URL,
    kind       = KIND,
    item       = submitted,
    graph_data = graph_data,
).build()

In [None]:
display(Markdown(report))

In [None]:
ReportBuilder(FOLDER_URL, KIND, submitted, graph_data=graph_data).write('g107_report.md')

---
## Bonus — direct item comparison: g100 vs g104

Both are `regular-latitude-longitude` grids with 55296 cells — nearly identical.
Shows how link and text scores differ for near-duplicates.

In [None]:
from cmipld.utils.similarity import extract_links

g100 = loader.get('g100')
g104 = loader.get('g104')

links_g100 = extract_links(g100)
links_g104 = extract_links(g104)

overlap = len(links_g100 & links_g104) / len(links_g100 | links_g104) * 100
print(f'Link Jaccard overlap: {overlap:.1f}%')
print('Only in g100:', sorted(links_g100 - links_g104))
print('Only in g104:', sorted(links_g104 - links_g100))

In [None]:
from cmipld.utils.similarity import strip_text_fields, analyze_differences

la = LinkAnalyzer(loader)
lf = la.analyze(g100).link_fields

text_g100 = strip_text_fields(g100, exclude=lf)
text_g104 = strip_text_fields(g104, exclude=lf)

print('Fields compared:', [short(k) for k in sorted(text_g100)])

display(Markdown(analyze_differences(text_g100, text_g104, name1='g100', name2='g104')))

---
*All classes live in `cmipld.utils.similarity` and can be used independently or chained.  
Every result object exposes `.data` (dict) and `.md` (Markdown string).*