# Pipeline try

We have selected some PDF samples with already-encoded text to test a complete pipeline worflow until the structured data extraction

## MLFLow experiment setup

Be sure to have run the mlflow server with this command at in the `prompt_enhancing/` directory

```sh
just serve-tracer
```

In [None]:
import mlflow
import pandas as pd
from archaeo_super_prompt.env import getenv_or_throw

EXP_NAME = "NAIVE CHUNK SELECTION"
mlflow.set_tracking_uri(f"http://{getenv_or_throw("MLFLOW_HOST")}:{getenv_or_throw("MLFLOW_PORT")}")
mlflow.set_experiment(EXP_NAME)
mlflow.dspy.autolog()

pd.set_option('display.max_columns', None)

## Sample selection

In [None]:
from pathlib import Path

from archaeo_super_prompt.dataset.load import MagohDataset
from archaeo_super_prompt.types.pdfpaths import buildPdfPathDataset

ds = MagohDataset(200, 0.8, True)
selected_ids = {
31049, 30913
}

selected_files = [
    (31049, Path(".cache/pdfs/31049/Relazione_storica_Pasquinucci.pdf").resolve()),
    (30913, Path(".cache/pdfs/30913/Relazione_assistenza.pdf").resolve()),
]

## Pipeline run

We use a dataframe-suitable version of the scikit-learn pipelines to pipe each module in this order :

- ocr
- layout text reading + chunking
- strucured data extraction

The LLM calls are traced by the MLFlow intergration and are viewable within links displayed by the cell below.

In [None]:
from feature_engine.pipeline import Pipeline
from typing import cast
from archaeo_super_prompt.pdf_to_text import OCR_Transformer, TextExtractor
from archaeo_super_prompt.main_transformer import MagohDataExtractor

import mlflow

pipeline = Pipeline(
    [
        ("ocr", OCR_Transformer),
        ("pdf_reader", TextExtractor),
        ("extractor", MagohDataExtractor()),
    ]

)
inputs = buildPdfPathDataset(selected_files)
with mlflow.start_run():
    score_value = pipeline.score(inputs, ds)
score_results = cast(MagohDataExtractor,
                     pipeline.named_steps["extractor"]).score_results

## Evaluation result inspection

In [None]:
score_results

In [None]:
from dash import Dash, html, callback, Output, Input, dash_table, dcc
import plotly.express as px

In [None]:
app = Dash()

In [None]:
from IPython.display import Markdown, display

field_grouping_keys = ["field_name", "evaluation_method"]

resultsPerField = {fieldName: {"method": evalMethod, "table": resultForField.drop(columns=field_grouping_keys)}
                   for (fieldName, evalMethod), resultForField in
                   score_results.groupby(field_grouping_keys)}
fieldNames = list(resultsPerField.keys())

app.layout = [
    html.H1(children='Results', style={'textAlign': 'center'}),
    html.H2(children='Global results'),
    dcc.Graph(figure=px.histogram(
        score_results, y='field_name', x='metric_value', histfunc='avg')
             ),
    html.H2(children='Per field results'),
    dcc.Dropdown(fieldNames, 'university__Sigla', id='dropdown-selection'),
    html.H3(children="Evaluation method used"),
    html.Blockquote(id='eval-method-description'),
    dash_table.DataTable(id='table-content', page_size=10)
]

@callback(
    Output('eval-method-description', 'children'),
    Input('dropdown-selection', 'value')
)
def updateEvalMethod(fieldName: str):
    return f"Evaluation method used: {resultsPerField[fieldName]["method"]}"


@callback(
    Output('table-content', 'data'),
    Input('dropdown-selection', 'value')
)
def updatePerFieldResultTable(fieldName: str):
    return resultsPerField[fieldName]["table"].to_dict('records')

In [None]:
app.run(debug=True)