# Pipeline try

We have selected some PDF samples with already-encoded text to test a complete pipeline worflow until the structured data extraction

## MLFLow experiment setup

Be sure to have run the mlflow server with this command at in the `prompt_enhancing/` directory

```sh
just serve-tracer
```

In [None]:
import mlflow
import pandas as pd
from archaeo_super_prompt.env import getenv_or_throw

EXP_NAME = "Numerical PDFs first evaluation"
mlflow.set_tracking_uri(f"http://{getenv_or_throw("MLFLOW_HOST")}:{getenv_or_throw("MLFLOW_PORT")}")
mlflow.set_experiment(EXP_NAME)
mlflow.dspy.autolog()

pd.set_option('display.max_columns', None)

## Sample selection

In [None]:
from pathlib import Path

from archaeo_super_prompt.dataset.load import MagohDataset

MAX_SAMPLES_FETCHED = 300
SEED = 0.3

ds = MagohDataset(MAX_SAMPLES_FETCHED, SEED, True)
_selected_ids = [
    # very good
    33799, 34439, 38005, 36837, 36937, 37614, 37026, 37971, 36846, 36304, 34423, 36052,
    37043, 36554, 989, 37007, 30897, 36351, 36308, 38013, 36011, 33828, 1221,
    38039, 35429, 37065, 37116, 34452, 33441, 33062, 34939, 35918, 33689, 34508, 31035,
    38220, 38092, 36979, 36854, 36207, 34915, 35688, 36359,
    # not that good
    31164, 32600, 33760, 32714, 31208, 30712,
    ]
selected_ids = set(_selected_ids)
inputs = ds.get_files_for_batch(selected_ids)

## Pipeline run

We use a dataframe-suitable version of the scikit-learn pipelines to pipe each module in this order :

- ocr
- layout text reading + chunking
- strucured data extraction

The LLM calls are traced by the MLFlow intergration and are viewable within links displayed by the cell below.

In [None]:
from feature_engine.pipeline import Pipeline
from typing import cast
from archaeo_super_prompt.pdf_to_text import OCR_Transformer, TextExtractor
from archaeo_super_prompt.main_transformer import MagohDataExtractor
import archaeo_super_prompt.visualization.mlflow_logging as mmlflow
import random

pipeline = Pipeline(
    [
        ("ocr", OCR_Transformer()),
        ("pdf_reader", TextExtractor()),
        ("extractor", MagohDataExtractor()),
    ]

)
extraction_model = cast(MagohDataExtractor, pipeline.named_steps["extractor"])
input_example = ds.get_files_for_batch([
    random.sample(sorted(selected_ids), 1)[0]
])

with mlflow.start_run():
    mmlflow.save_models(pipeline, input_example)
    score_value = pipeline.score(inputs, ds)
    score_results = extraction_model.score_results
    mmlflow.save_metric_scores(score_value, score_results)
    mmlflow.save_table_in_artifacts(score_results)

## Evaluation result inspection

In [None]:
from archaeo_super_prompt.visualization import (
        init_complete_vizualisation_engine, run_display_server
)

init_complete_vizualisation_engine(score_results)

In [None]:
run_display_server()

In [None]:
import pandas as pd
from pathlib import Path
score_results.to_csv(str(Path("./results.csv").resolve()))