# Making the Most of Markdown within Argilla TextFields

## Introduction

As you may have noticed, Argilla supports Markdown within its text fields. This allows you to add formatting to your text, such as **bold** or *italic* text, or even [links](https://www.google.com). Additionally, this also allows you to add all HTML content, such as images, videos, and even iframes, which is a powerfull tool to have at your disposal.

Within this notebook, we will go over the basics of Markdown, and how to use it within Argilla.

- multi-modality
    - image
    - video
    - audio
- table
- exploiting displacy
  - ner
  - relationships

## Installing Dependencies

We will be working with builtin Python libraries, as well as the `argilla` library. Additionally, we will use a unstructored document processor with a externally callable public API (to avoid overhead). This tool is called [IBM Deep Search](https://github.com/DS4SD/deepsearch-toolkit) but for a fully open source alternative, I recommend taking a look at [Unstructured](https://unstructured.io). To install the latter, run the following command:

In [None]:
!pip install argilla==1.17 
!pip install datasets
!pip install spacy spacy-transformers
!pip install Pillow
!pip install span_marker
!pip install soundfile librosa
!pip install deepsearch-toolkit
!python -m spacy download en_core_web_sm

### Setup Argilla

First you will need to deploy an Argilla instance, or just use your own if already deployed. The easiest and most straight forward way to do so is via Hugging Face Spaces at https://huggingface.co/new-space?template=argilla/argilla-template-space, or via Docker Quickstart image (installation notes can be found at https://docs.argilla.io/en/latest/getting_started/installation/deployments/docker.html).

```python
import argilla as rg
rg.init(api_key="YOUR_API_KEY", api_url="YOUR_SPACE_ID")
```

In [209]:
import argilla as rg
import re
import pandas as pd
import spacy
import span_marker
from datasets import load_dataset
from spacy import displacy
import deepsearch
import base64
from pathlib import Path
from datasets import load_dataset
from pathlib import Path
import glob
import json
import math
from pathlib import Path
from subprocess import CalledProcessError, check_call
from typing import List, Optional
from zipfile import ZipFile

## Get Coding

### Exploiting `displacy`

Displacy is the library from spaCy that allows you to visualize the output of the NLP models. It is a very powerful tool, and we will be using it to visualize the output of the NER model. To do so, we will be using the `displacy.render` function, which takes in the text and the output of the NER model, and returns a HTML string that can be rendered within Argilla.

Displacy can render dependencies, named entities, and even relationships. We will be using the first two, but if you want to learn more about displacy, you can check out the [documentation](https://spacy.io/usage/visualizers).

In [210]:
nlp = spacy.load(
    "en_core_web_sm", 
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

<span_marker.spacy_integration.SpacySpanMarkerWrapper at 0x2b35a0130>

In [211]:
doc = nlp("Rats are various medium-sized, long-tailed rodents.")
x = displacy.render(doc, style="dep")

In [212]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
doc2 = nlp(text)
displacy.render(doc2, style="ent")

We can now create an Argilla `FeedbackDataset` and add the displacy output to it. We will be using the `FeedbackDataset` to render the displacy output. We will now configure it to show the default text, dependencies and entities. Additionally, we add some label questions to indicate whether the text is relevant, and the dependency parsing and named entities are correct. Also, we will be adding questions to assess whether the user want to apply a correction to the dependencies and/or entities.

In [213]:
try:
    ds = rg.FeedbackDataset(
        fields=[
            rg.TextField(name="text", use_markdown=True),
            rg.TextField(name="dep", use_markdown=True),
            rg.TextField(name="ent", use_markdown=True)
        ],
        questions=[
            rg.LabelQuestion(name="relevant", labels=["yes", "no"]),
            rg.MultiLabelQuestion(name="question-multi", labels=["flag-pos", "flag-ner"]),
            rg.TextQuestion(name="dep-correction", use_markdown=True),
            rg.TextQuestion(name="ner-correction", use_markdown=True)
        ]
    )
    ds = ds.push_to_argilla("exploiting-displacy")
except Exception as e:
    ds = rg.FeedbackDataset.from_argilla("exploiting-displacy")

Now, we will load the basic [few-nerd dataset from Hugging Face](https://huggingface.co/datasets/DFKI-SLT/few-nerd). This dataset contains a few sentences, and the output of the NER model. We will be using this dataset to show how to use displacy within Argilla.

In [214]:
dataset_fewnerd = load_dataset("DFKI-SLT/few-nerd", "supervised", split="train[:10]")

Next, we will use this dataset to populate our Argilla `FeedbackDataset`. We will be using the `displacy.render` function to render the displacy output as html, and add it to the `FeedbackDataset`. We will also add the text, and the output of the NER model to the `FeedbackDataset`. Finally, we will also add markdown formatted tables to support basic support for NER and dependency annotation.

In [215]:
texts = [" ".join(x["tokens"]) for x in dataset_fewnerd]
docs = nlp.pipe(texts)

def wrap_in_max_width(html):
    html = html.replace("max-width: none;", "")
    # remove existing width and height setting based on regex width="/d"
    html = re.sub(r"width=\"\d+\"", "overflow-x: auto;", html)
    html = re.sub(r"height=\"\d+\"", "", html)
    
    # Find the SVG element in the HTML output
    svg_start = html.find("<svg")
    svg_end = html.find("</svg>") + len("</svg>")
    svg = html[svg_start:svg_end]

    # Set the width and height attributes of the SVG element to 100%
    svg = svg.replace("<svg", "<svg width='100%' height='100%'")

    # Wrap the SVG element in a div with max-width and horizontal scrolling
    return f"<div style='max-width: 100%; overflow-x: auto;'>{svg}</div>"

records = []
for doc in docs:
    record = rg.FeedbackRecord(
        fields={
            "text": doc.text, 
            "dep": wrap_in_max_width(displacy.render(doc, style="dep", jupyter=False)), 
            "ent": displacy.render(doc, style="ent", jupyter=False)
        },
        suggestions=[{
                "question_name": "dep-correction", 
                "value": pd.DataFrame([{"Label": token.dep_, "Text": token.text} for token in doc]).to_markdown(index=False)

            },
            {
                "question_name": "ner-correction", 
                "value": pd.DataFrame([{"Label": ent.label_, "Text": ent.text} for ent in doc.ents]).to_markdown(index=False),
            }
        ]
    )
    records.append(record)
ds.add_records(records)

Pushing records to Argilla...: 100%|██████████| 1/1 [00:00<00:00, 10.74it/s]


### Multi-Modality: audio, image and video

Yes, Argilla can work with images, video and audio, when formatted as HTML and used within a markdown field. However, besides using publicly available sources to do this, we can also us something called DataURLs. 

#### What are DataURLs?

A DataURL is a way to encode binary data into a string, which can then be used to embed the data into a webpage. This is a very useful tool, as it allows us to embed images, videos, and audio files directly into html, without having to worry about hosting them externally. This is done by prepending the data with a header, which specifies the type of data being encoded, and the encoding used. We will define three different functions, one for each modality, which will take a file path as input, and return a DataURL as output.

In [216]:
def get_file_type(path):
    return Path(path).suffix[1:]

def video_to_dataurl(path, file_type: str = None):
    # Open the video file and read its contents
    with open(path, 'rb') as f:
        video_data = f.read()

    # Encode the video data as base64
    video_base64 = base64.b64encode(video_data).decode('utf-8')

    # Get the file type (e.g. mp4)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:video/{file_type};base64,' + video_base64

    # Create HTML
    html = f"<video controls><source src='{data_url}' type='video/{file_type}'></video>"
    return html
    
def audio_to_dataurl(path, file_type: str = None):
    # Open the audio file and read its contents
    with open(path, 'rb') as f:
        audio_data = f.read()
    
    # Encode the audio data as base64
    audio_base64 = base64.b64encode(audio_data).decode('utf-8')
    
    # Get the file type (e.g. mp3)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:audio/{file_type};base64,' + audio_base64
    
    # Create HTML
    html = f"<audio controls autoplay><source src='{data_url}' type='audio/{file_type}'></audio>"
    return html

def image_to_dataurl(path, file_type: str = None):
    # open the image file and read its contents
    with open(path, 'rb') as f:
        image_data = f.read()
        
    # Encode the image data as base64
    image_base64 = base64.b64encode(image_data).decode('utf-8')
    
    # Get the file type (e.g. png)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:image/{file_type};base64,' + image_base64
    
    # Create HTML
    html = f'<img src="{data_url}">'
    return html

Next, we will define our `FeedbackDataset` and add the DataURLs to it. We will also add a question to ask the user to describe the image, video, or audio file.

In [217]:
try:
    ds = rg.FeedbackDataset(
        fields=[rg.TextField(name="content", use_markdown=True)],
        questions=[rg.TextQuestion(name="describe")],
    )
    ds = ds.push_to_argilla("multi-modal")
    
except:
    ds = rg.FeedbackDataset.from_argilla("multi-modal")


We will add the DataURLs to the `add_records`-method. We will also add a question to ask the user to describe the image, video, or audio file.

In [218]:

records = [
    rg.FeedbackRecord(fields={"content": audio_to_dataurl("data/making-most-of-markdown/heath_ledger.mp3")}),
    rg.FeedbackRecord(fields={"content": audio_to_dataurl("data/making-most-of-markdown/heath_ledger_2.mp3")}),
    rg.FeedbackRecord(fields={"content": image_to_dataurl("data/making-most-of-markdown/deepsearch.png")}),
    rg.FeedbackRecord(fields={"content": video_to_dataurl("data/making-most-of-markdown/snapshot.mp4")})
]
ds.add_records(records)

Pushing records to Argilla...: 100%|██████████| 1/1 [00:00<00:00,  2.56it/s]


#### Let's create `FeedbackDataset`s

##### Audio Classification

For this example audio classification dataset, we will be using the [ccmusic-database/bel_folk](https://huggingface.co/datasets/ccmusic-database/bel_folk) dataset from Hugging Face. This dataset contains 1 minute audio clips of Chinese folk music, and the genre of the music. We will be using this dataset to create a dataset for audio classification.

In [219]:
my_audio_dataset = load_dataset("ccmusic-database/bel_folk")
my_audio_dataset = my_audio_dataset.shuffle()
my_audio_dataset["train"][0], my_audio_dataset["train"].features

({'audio': {'path': '/Users/davidberenstein/.cache/huggingface/datasets/downloads/extracted/16960b86c4c6c0594ad82b11290ee5a7bb8f111068e9c6032444a2005fe715ad/dataset/audio/Bel_f (19).wav',
   'array': array([0., 0., 0., ..., 0., 0., 0.]),
   'sampling_rate': 44100},
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=476x349>,
  'label': 1,
  'gender': 0,
  'singing_method': 1},
 {'audio': Audio(sampling_rate=44100, mono=True, decode=True, id=None),
  'image': Image(decode=True, id=None),
  'label': ClassLabel(names=['m_bel', 'f_bel', 'm_folk', 'f_folk'], id=None),
  'gender': ClassLabel(names=['female', 'male'], id=None),
  'singing_method': ClassLabel(names=['Folk Singing', 'Bel Canto'], id=None)})

We will now define `label`, `gender`, `singing_method` columns as `LabelQeustion` columns and infer the label sets from the `Datasets.features` attribute. 

In [220]:
label_general = my_audio_dataset["train"].features["label"].names
label_gender = my_audio_dataset["train"].features["gender"].names
label_singing_method = my_audio_dataset["train"].features["singing_method"].names
rg_ds_audio = rg.FeedbackDataset(
    fields=[rg.TextField(name="audio", use_markdown=True)],
    questions=[
        rg.LabelQuestion(name="general", labels=label_general),
        rg.LabelQuestion(name="gender", labels=label_gender),
        rg.LabelQuestion(name="singing_method", labels=label_singing_method)
    ]
)
try:
    rg_ds_audio = rg.FeedbackDataset.from_argilla("audio")
except:
    rg_ds_audio = rg_ds_audio.push_to_argilla("audio")
rg_ds_audio

RemoteFeedbackDataset(
   id=8005e3b0-c932-462a-8f49-d0acb0044e2e
   name=audio
   workspace=Workspace(id=268dbd87-2970-46cd-a84a-e3785b7178b7, name=argilla, inserted_at=2023-08-14 13:02:15, updated_at=2023-08-14 13:02:15)
   url=http://localhost:6900/dataset/8005e3b0-c932-462a-8f49-d0acb0044e2e/annotation-mode
   fields=[RemoteTextField(id=UUID('16121a5f-a9b3-4a59-92cb-eebdbcbeb957'), client=None, name='audio', title='Audio', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('cf389c89-cf1e-416a-b7da-5f59838cf1d3'), client=None, name='general', title='General', description=None, required=True, type='label_selection', labels=['m_bel', 'f_bel', 'm_folk', 'f_folk'], visible_labels=None), RemoteLabelQuestion(id=UUID('08edc7e7-b8c2-41b3-8526-b9ae04fa15f5'), client=None, name='gender', title='Gender', description=None, required=True, type='label_selection', labels=['female', 'male'], visible_labels=None), RemoteLabelQuestion(id=UUID('54e321b7-7a6c-4df7

Next, we will define our `FeedbackDataset` and add the DataURLs to it by using the b64encode function `audio_to_dataurl`. Also, we will be adding dataset suggestions for each one of the label columns.

In [221]:
records = []
my_audio_dataset_slice = my_audio_dataset["train"].select(range(20))
for entry in my_audio_dataset_slice:
    record = rg.FeedbackRecord(
        fields={"audio": audio_to_dataurl(entry["audio"]["path"])},
        suggestions=[
            {"question_name": "general", "value": label_general[entry["label"]]},
            {"question_name": "gender", "value": label_gender[entry["gender"]]},
            {"question_name": "singing_method", "value": label_singing_method[entry["singing_method"]]}
        ]
    )
    try:
        rg_ds_audio.add_records(record, show_progress=False)
    except Exception as e:
        print(e)
        pass

Argilla server returned an error with http status: 500
Error details: [{'code': 'argilla.api.errors::GenericServerError', 'params': {'type': 'opensearchpy.exceptions.TransportError', 'message': 'TransportError(429, \'{"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=58731109, max_coordinating_and_primary_bytes=53687091]"}],"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=0, replica_bytes=0, all_bytes=0, coordinating_operation_bytes=58731109, max_coordinating_and_primary_bytes=53687091]"},"status":429}\')'}}]


##### Image Classification 

Within this example, we will be using the [zishuod/pokemon-icons](https://huggingface.co/datasets/"zishuod/pokemon-icons") dataset from Hugging Face. This dataset contains images of Pokemon that need to be classified. We will be using this dataset to create a dataset for image classification but feel free to use any other image classification dataset listed below.

Some examples are:
- "zishuod/pokemon-icons"
- "keremberke/shoe-classification"
- "sampath017/plants"
- "sin3142/memes-500"
- "adamkatav/mtg_subsample"
- "sxdave/emotion_detection"

In [222]:
my_image_dataset  = load_dataset("zishuod/pokemon-icons")
my_image_dataset = my_image_dataset.shuffle()
my_image_dataset["train"][0], my_image_dataset["train"].features

Resolving data files:   0%|          | 0/427 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/165 [00:00<?, ?it/s]

({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=72x72>,
  'label': 11},
 {'image': Image(decode=True, id=None),
  'label': ClassLabel(names=['aegislash', 'amoonguss', 'arcanine', 'azumarill', 'blastoise', 'blaziken', 'blissey', 'calyrex', 'celesteela', 'centiskorch', 'chansey', 'charizard', 'cinderace', 'clefable', 'clefairy', 'coalossal', 'comfey', 'cresselia', 'darmanitan', 'dialga', 'diggersby', 'dracovish', 'dracozolt', 'dragapult', 'drifblim', 'durant', 'dusclops', 'eternatus', 'excadrill', 'ferrothorn', 'garchomp', 'gastrodon', 'gigalith', 'glastrier', 'gothitelle', 'grimmsnarl', 'groudon', 'gyarados', 'heatran', 'hippowdon', 'hooh', 'hydreigon', 'incineroar', 'indeedee', 'kartana', 'kyogre', 'landorus', 'lapras', 'ludicolo', 'mamoswine', 'metagross', 'mewtwo', 'mienshao', 'milotic', 'mimikyu', 'moltres', 'necrozma', 'nihilego', 'ninetales', 'pheromosa', 'pikachu', 'porygon2', 'primarina', 'quagsire', 'raichu', 'raikou', 'regieleki', 'regigigas', 'registeel', 'rhy

We will now define `label` column as `LabelQeustion` columns and infer the label sets from the `Datasets.features` attribute. Also, we will use a basic `TextField` to add the image to the `FeedbackDataset`.

In [223]:
label_image = my_image_dataset["train"].features["label"].names
rg_ds_image = rg.FeedbackDataset(
    fields=[rg.TextField(name="image", use_markdown=True)],
    questions=[rg.LabelQuestion(name="label", labels=label)]
)
try:
    rg_ds_image = rg.FeedbackDataset.from_argilla("image")
except:
    rg_ds_image = rg_ds_image.push_to_argilla("image")
rg_ds_image



RemoteFeedbackDataset(
   id=c29c5a97-eb64-49d9-a57e-ec0a5f6e6038
   name=image
   workspace=Workspace(id=268dbd87-2970-46cd-a84a-e3785b7178b7, name=argilla, inserted_at=2023-08-14 13:02:15, updated_at=2023-08-14 13:02:15)
   url=http://localhost:6900/dataset/c29c5a97-eb64-49d9-a57e-ec0a5f6e6038/annotation-mode
   fields=[RemoteTextField(id=UUID('80f85ff4-052a-4e11-aec0-93e63857fcc2'), client=None, name='image', title='Image', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('81d57bd8-f69b-48fa-83cd-90fb805acd84'), client=None, name='label', title='Label', description=None, required=True, type='label_selection', labels=['aegislash', 'amoonguss', 'arcanine', 'azumarill', 'blastoise', 'blaziken', 'blissey', 'calyrex', 'celesteela', 'centiskorch', 'chansey', 'charizard', 'cinderace', 'clefable', 'clefairy', 'coalossal', 'comfey', 'cresselia', 'darmanitan', 'dialga', 'diggersby', 'dracovish', 'dracozolt', 'dragapult', 'drifblim', 'durant', 'dusclops

Now, we will define our `FeedbackDataset` and add the DataURLs to it by using the b64encode function `image_to_dataurl`. Also, we will be adding dataset suggestions for each one of the label columns.

In [224]:
temp_img_path = "data/making-most-of-markdown/temp_img.png"
records = []
for entry in my_image_dataset["train"]:
    entry["image"].save(temp_img_path, format="png")
    record = rg.FeedbackRecord(
        fields={"image": image_to_dataurl(temp_img_path, file_type="png")},
        suggestions=[{"question_name": "label", "value": label_image[entry["label"]]}]
    )
    try:
        rg_ds_image.add_records(record, show_progress=False)
    except Exception as e:
        print(e)
        pass

### Argilla Markdown for document processing

As mentioned above, we will be using IBM Deep Search to unstructured document processing.

#### Configure Deep Search

##### Signup

Go to https://deepsearch-experience.res.ibm.com/ and sign up for an account using the Google OAuth integration. Afterwards, you can use the following command to install the library.

![authenticate](data/making-most-of-markdown/deepsearch.png)


```bash
deepsearch profile config --profile-name "ds-experience" --host "https://deepsearch-experience.res.ibm.com/" --verify-ssl --username "<your-email>"
```

And add `your-api-key` to the prompted terminal.

##### Save settings

In [225]:
OUTPUT_DIR = Path("data/making-most-of-markdown/pdf/processed")
OUTPUT_DIR.mkdir(exist_ok=True)
PROJECT_KEY = "1234567890abcdefghijklmnopqrstvwyz123456" # always fixed
INPUT_FILE = Path("data/making-most-of-markdown/pdf")
INPUT_FILE

PosixPath('data/making-most-of-markdown/pdf')

In [226]:
api = deepsearch.CpsApi.from_env()
api

<deepsearch.cps.client.api.CpsApi at 0x28082b2e0>

#### Process Deep Search documents

Deep Search has a public API, and can this be used easily without needing to install anything. We will be using this API to process documents. We will be using the `convert_documents` function to convert and process the documents. This function takes in a file path, and returns the output of the Deep Search API. We will be using the `download_all` function later to download all the documents from the dataset.

In [227]:
documents = deepsearch.convert_documents(
    api=api, proj_key=PROJECT_KEY, source_path=INPUT_FILE, progress_bar=True
)
documents.download_all(result_dir=OUTPUT_DIR, progress_bar=True)

Processing input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:00<00:00, 202.90it/s][38;2;15;98;254m[0m
Submitting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:04<00:00,  4.86s/it][38;2;15;98;254m[0m
Converting input:     : 100%|[38;2;15;98;254m██████████████████████████████[0m| 1/1 [00:29<00:00, 29.49s/it][38;2;15;98;254m[0m
Downloading result:   : 100%|[38;2;15;98;254m██████████████████████████████[0m| 2/2 [00:02<00:00,  1.16s/it][38;2;15;98;254m[0m


#### Creating QnA datasets

We will be using the the [PDF examples from IBM Deep Search](https://github.com/DS4SD/deepsearch-examples/tree/main/data/samples) to showcase how we might extract tables and figures PDF-files and use this to create a multi-model QnA-dataset for LLM fine-tuning. First, we create a `FeedbackDataset` with a `content`-field that allows for markdown. Additionally, we will add two `TextQuestion`s that should correspond to the `question` and `answer`, delivered by the LLM.

In [228]:
rg_llm_qna_ds = rg.FeedbackDataset(
    fields=[rg.TextField(name="content", use_markdown=True)],
    questions=[
        rg.TextQuestion(name="question"),
        rg.TextQuestion(name="answer")
    ]
)
try:
    rg_llm_qna_ds = rg.FeedbackDataset.from_argilla("llm-qna")
except:
    rg_llm_qna_ds = rg_llm_qna_ds.push_to_argilla("llm-qna")
rg_llm_qna_ds

RemoteFeedbackDataset(
   id=9988cd27-42fe-406a-a9de-91e36f50654a
   name=llm-qna
   workspace=Workspace(id=268dbd87-2970-46cd-a84a-e3785b7178b7, name=argilla, inserted_at=2023-08-14 13:02:15, updated_at=2023-08-14 13:02:15)
   url=http://localhost:6900/dataset/9988cd27-42fe-406a-a9de-91e36f50654a/annotation-mode
   fields=[RemoteTextField(id=UUID('2ccc8978-7c55-496f-9ff9-523af401052e'), client=None, name='content', title='Content', required=True, type='text', use_markdown=True)]
   questions=[RemoteTextQuestion(id=UUID('640f2234-a80d-49dc-9575-8fcc7cbe9890'), client=None, name='question', title='Question', description=None, required=True, type='text', use_markdown=False), RemoteTextQuestion(id=UUID('13dd145f-2ab6-4528-967f-16c49e43f366'), client=None, name='answer', title='Answer', description=None, required=True, type='text', use_markdown=False)]
   guidelines=None)

##### Extracting tables

Deep Search allows for [extracting tables from PDF-files](https://github.com/DS4SD/deepsearch-examples/blob/main/examples/document_conversion_extract_tables/extract_tables.ipynb). We will be using this to extract tables from the PDF-files and add them to the `FeedbackDataset`. We will be using the converted documents to do this. First, we will define a function that loads the converted documents and extracts the tables from them. Next, we will add the tables to the `FeedbackDataset` by using the `add_records`-method.

In [229]:
def get_tablecell_span(cell, ix):
    span = set([s[ix] for s in cell['spans']])
    if len(span) == 0:
        return 1, None, None
    return len(span), min(span), max(span)


def write_table(item):
    """
    Convert the JSON table representation to HTML, including column and row spans.
    
    Parameters
    ----------
    item :
        JSON table
    doc_cellsdata :
        Cells document provided by the Deep Search conversion
    ncols : int, Default=3
        Number of columns in the display table.
    """
    
    table = item
    body = ""

    nrows = table['#-rows']
    ncols = table['#-cols']

    body += "<table>\n"
    for i in range(nrows):
        body += "  <tr>\n"
        for j in range(ncols):
            cell = table['data'][i][j]

            rowspan,rowstart,rowend = get_tablecell_span(cell, 0)
            colspan,colstart,colend = get_tablecell_span(cell, 1)

            if rowstart is not None and rowstart != i: continue
            if colstart is not None and colstart != j: continue

            if rowstart is None:
                rowstart = i
            if colstart is None:
                colstart = j

            content = cell['text']
            if content == '':
                content = '&nbsp;'

            label = cell['type']
            label_class = 'body'
            if label in ['row_header', 'row_multi_header', 'row_title']:
                label_class = 'header'
            elif label in ['col_header', 'col_multi_header']:
                label_class = 'header'
            
            
            celltag = 'th' if label_class == 'header' else 'td'
            style = 'style="text-align: center;"' if label_class == 'header' else ''

            body += f'    <{celltag} rowstart="{rowstart}" colstart="{colstart}" rowspan="{rowspan}" colspan="{colspan}" {style}>{content}</{celltag}>\n'

        body += "  </tr>\n"

    body += "</table>"

    return body

def get_document_tables(doc_jsondata):
    """
    Visualize the tables idenfitied in the converted document.
    
    Parameters
    ----------
    doc_jsondata :
        Converted document
    """

    tables = []
    page_counters = {}
    # Iterate through all the tables identified in the converted document
    for table in doc_jsondata.get("tables", []):
        prov = table["prov"][0]
        page = prov["page"]
        page_counters.setdefault(page, 0)
        page_counters[page] += 1
        
        output_html = write_table(table)
        tables.append(output_html)
    return tables


In [230]:
tables = []
for output_file in Path(OUTPUT_DIR).rglob("json*.zip"):
    with ZipFile(output_file) as archive:
        all_files = archive.namelist()
        for name in all_files:
            if not name.endswith(".json"):
                continue
            
            basename = name.rstrip('.json')
            doc_jsondata = json.loads(archive.read(f"{basename}.json"))

            tables += get_document_tables(doc_jsondata)
tables

['<table>\n  <tr>\n    <th rowstart="0" colstart="0" rowspan="1" colspan="1" style="text-align: center;">&nbsp;</th>\n    <th rowstart="0" colstart="1" rowspan="1" colspan="1" style="text-align: center;">&nbsp;</th>\n    <td rowstart="0" colstart="2" rowspan="1" colspan="3" >% of Total </td>\n    <td rowstart="0" colstart="5" rowspan="1" colspan="7" >triple inter-annotator mAP @ 0.5-0.95 (%) </td>\n  </tr>\n  <tr>\n    <td rowstart="1" colstart="0" rowspan="1" colspan="1" >class label </td>\n    <td rowstart="1" colstart="1" rowspan="1" colspan="1" >Count </td>\n    <td rowstart="1" colstart="2" rowspan="1" colspan="1" >Train </td>\n    <td rowstart="1" colstart="3" rowspan="1" colspan="1" >Test </td>\n    <td rowstart="1" colstart="4" rowspan="1" colspan="1" >Val </td>\n    <td rowstart="1" colstart="5" rowspan="1" colspan="1" >All </td>\n    <td rowstart="1" colstart="6" rowspan="1" colspan="1" >Fin </td>\n    <td rowstart="1" colstart="7" rowspan="1" colspan="1" >Man </td>\n    <td 

In [231]:
records = [
    rg.FeedbackRecord(
        fields = {"content": table},
    ) for table in tables
]
rg_llm_qna_ds.add_records(records)

Pushing records to Argilla...:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing records to Argilla...: 100%|██████████| 1/1 [00:00<00:00, 29.92it/s]


##### Extracting figures

Deep Search allows for [extracting figures from PDF-files](https://github.com/DS4SD/deepsearch-examples/blob/main/examples/document_conversion_extract_figures/extract_figures.py). We will be using this to extract figures from the PDF-files and add them to the `FeedbackDataset`. We will be using the converted documents to do this. First, we will define a function that loads the converted documents and extracts the figures from them. Next, we will add the tables to the `FeedbackDataset` by using the `add_records`-method.

In [232]:
def crop_pdf_to_image(
    pdf_filename: Path, page: int, bbox: list, output_filename: Path, resolution=72
):
    """
    Invoke the pdftoppm executable for cropping the given bounding box from the PDF doucment

    Parameters
    ----------
    pdf_filename : Path
        Input PDF file.
    page : int
        Page number where the bounding box is located.
    bbox : List[int]
        Bounding box to extract, in the format [x0, y0, x1, y1], where the origin is the top-left corner.
    output_filename : Path
        Output filename where the PNG image is saved to.
    resolution : int, Default=72
        Resolution of the extracted image.
    """
    cmd = [
        "pdftoppm",
        "-png",
        "-singlefile",
        "-f",
        str(page),
        "-l",
        str(page),
        "-cropbox",
        "-r",
        str(resolution),
        "-x",
        str(bbox[0]),
        "-y",
        str(bbox[1]),
        "-W",
        str(bbox[2] - bbox[0]),
        "-H",
        str(bbox[3] - bbox[1]),
        str(pdf_filename),
        str(output_filename),
    ]
    try:
        check_call(cmd)
    except CalledProcessError as cpe:
        raise RuntimeError(
            f"PDFTOPPM PROCESSING ERROR. Exited with: {cpe.returncode}"
        ) from cpe

def extract_figures_from_json_doc(
    pdf_filename: Path, document: dict, output_dir: Path, resolution: int
):
    """
    Iterate through the converted document format and extract the figures as PNG files

    Parameters
    ----------
    pdf_filename : Path
        Input PDF file.
    document :
        The converted document from Deep Search.
    bbox : List[int]
        Bounding box to extract, in the format [x0, y0, x1, y1], where the origin is the top-left corner.
    output_dir : Path
        Output directory where all extracted images will be saved.
    resolution : int
        Resolution of the extracted image.
    """

    output_base = output_dir / document["file-info"]["filename"].rstrip(".pdf").rstrip(
        ".PDF"
    )
    page_counters = {}
    # Iterate through all the figures identified in the converted document
    for figure in document.get("figures", []):
        prov = figure["prov"][0]
        page = prov["page"]
        page_counters.setdefault(page, 0)
        page_counters[page] += 1

        # Retrieve the page dimensions, needed for shifting the coordinates of the bounding boxes
        page_dims = next(
            (dims for dims in document["page-dimensions"] if dims["page"] == page), None
        )
        if page_dims is None:
            continue

        # Convert the Deep Search bounding box in the coordinate frame used to extract images.
        # From having the origin in the bottom-left corner, to the top-left corner
        # The bounding box is expanded to the closest integer coordinates, because of the format
        # requirements of the tools used in the extraction.
        bbox = [
            math.floor(prov["bbox"][0]),
            math.floor(page_dims["height"] - prov["bbox"][3]),
            math.ceil(prov["bbox"][2]),
            math.ceil(page_dims["height"] - prov["bbox"][1]),
        ]

        # Extract the bounding box
        output_filename = output_base.with_name(
            f"{output_base.name}_{page}_{page_counters[page]}"
        )
        crop_pdf_to_image(
            pdf_filename, page, bbox, output_filename, resolution=resolution
        )

for pdf_filename in glob.glob("data/making-most-of-markdown/pdf/*.pdf"):
    for output_file in Path(OUTPUT_DIR).rglob("json*.zip"):
        with ZipFile(output_file) as archive:
            all_files = archive.namelist()
            for name in all_files:
                if name.endswith(".json"):
                    document = json.loads(archive.read(name))
                    extract_figures_from_json_doc(
                        pdf_filename, document, OUTPUT_DIR, 72
                    )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

We will now reuse the multi-modal function `image_to_dataurl` to convert these `.png` files to DataURLs. We will also add a question to ask the user to describe the image.

In [233]:
files = glob.glob("data/making-most-of-markdown/pdf/processed/*.png")
records = [
    rg.FeedbackRecord(
        fields = {"content": image_to_dataurl(file, file_type="png")}
    ) for file in files
]
rg_llm_qna_ds.add_records(records)

Pushing records to Argilla...:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing records to Argilla...: 100%|██████████| 1/1 [00:00<00:00,  9.66it/s]
