In [None]:
from feature_engine.pipeline import Pipeline
from archaeo_super_prompt.dataset.load import MagohDataset
from archaeo_super_prompt.pdf_to_text import OCR_Transformer, TextExtractor

With exploring the layout and the profile of the downloaded reports, we select some intervention identifiers that can be processable for a study of the chunks and the LLM interpretability

In [None]:
ds = MagohDataset(200, 0.3, True)
_selected_ids = [
    # very good
    33799, 34439, 38005, 36837, 36937, 37614, 37026, 37971, 36846, 36304, 34423, 36052,
    37043, 36055, 36554, 989, 37007, 30897, 36351, 36308, 38013, 36011, 33828, 1221,
    38039, 35429, 37065, 37116, 34452, 33441, 33062, 34939, 35918, 33689, 34508, 31035,
    38220, 38092, 36979, 36854, 36207, 34915, 33439, 35688, 36359,
    # not that good
    31164, 32600, 33760, 32714, 31208, 30712, 
    ]
selected_ids = set(_selected_ids)
inputs = ds.get_files_for_batch(selected_ids)

In [None]:
pipeline = Pipeline([
    ("ocr", OCR_Transformer()),
    ("pdf_reader", TextExtractor())
])

In [None]:
texts = pipeline.transform(inputs)

In [None]:
groupedChunks = { id_: { filename: fileChunks for filename, fileChunks in inpt.groupby("filename") } for id_, inpt in texts.groupby("id") }
groupedChunks[31049]["Relazione_storica_Pasquinucci.pdf"]

In [None]:
chunks = texts[(texts["id"] == 31049) & (texts["filename"] == "Relazione_storica_Pasquinucci.pdf")]
chunks

## Chunk information

From each chunk, we can have the following information:

- a simple type (paragraph, list item, table)
- its page number to have its approximate position in the document (beginning, middle, end, ...)
- its context text, including :
  - the description of the predicted section it belongs to
  - the text rendering of the chunk content 

In [None]:
from IPython.display import Markdown
from archaeo_super_prompt.types.pdfchunks import PDFChunkPerInterventionDataset

SAMPLE_CHUNK_NUMBER = 4
sample_chunks = chunks.sample(SAMPLE_CHUNK_NUMBER)

Markdown(str(PDFChunkPerInterventionDataset(sample_chunks)))

## Exploit the chunk information for the prompts

The contextual text can be compared to the query with an embedding model.  
The other information (type of content, position in the document, entity occurences) can also be used to select the best chunks for some extraction queries, according to the nature of the field to be extracted and the information we have about the excavation reports which compose the dataset.