In [None]:
from archaeo_super_prompt.dataset.load import MagohDataset

In [None]:
ds = MagohDataset(200, 0.8, True)

In [None]:
from try_dataload import pipeline

With exploring the layout and the profile of the downloaded reports, we select some intervention identifiers that can be processable for a study of the chunks and the LLM interpretability

In [None]:
from pathlib import Path
from archaeo_super_prompt.types.pdfpaths import buildPdfPathDataset
selected_ids = {
31049, 30913
}
inputs = buildPdfPathDataset([
    (31049, Path(".cache/pdfs/31049/Relazione_storica_Pasquinucci.pdf").resolve()),
    (30913, Path(".cache/pdfs/30913/Relazione_assistenza.pdf").resolve()),
])

In [None]:
texts = pipeline.fit_transform(inputs)

In [None]:
groupedChunks = { id_: { filename: fileChunks for filename, fileChunks in inpt.groupby("filename") } for id_, inpt in texts.groupby("id") }
groupedChunks[31049]["Relazione_storica_Pasquinucci.pdf"]

In [None]:
chunks = texts[(texts["id"] == 31049) & (texts["filename"] == "Relazione_storica_Pasquinucci.pdf")]
chunks

## Chunk information

From each chunk, we can have the following information:

- a simple type (paragraph, list item, table)
- its page number to have its approximate position in the document (beginning, middle, end, ...)
- its context text, including :
  - the description of the predicted section it belongs to
  - the text rendering of the chunk content 

In [None]:
from random import sample

SAMPLE_CHUNK_NUMBER = 4
sample_chunks = chunks.sample(SAMPLE_CHUNK_NUMBER)

TAG_TO_STRING = {
"para":"Paragraph", "list_item":"List item", "table": "Table", "header": "Header"
}
DEFAULT_ITEM = "Unknown pdf item"


for _, chunk in sample_chunks.iterrows():
    print(f"Page {chunk["chunk_page_position"]} ({TAG_TO_STRING.get(chunk["chunk_type"], DEFAULT_ITEM)})")
    print("-"*60)
    print(chunk["chunk_content"])
    print("-"*60 + "\n")

## Exploit the chunk information for the prompts

The contextual text can be compared to the query with an embedding model.  
The other information (type of content, position in the document, entity occurences) can also be used to select the best chunks for some extraction queries, according to the nature of the field to be extracted and the information we have about the excavation reports which compose the dataset.