# Making the Most of Markdown within Argilla TextFields

## Introduction

As you may have noticed, Argilla supports Markdown within its text fields. This allows you to add formatting to your text, such as **bold** or *italic* text, or even [links](https://www.google.com). Additionally, this also allows you to add all HTML content, such as images, videos, and even iframes, which is a powerfull tool to have at your disposal.

Within this notebook, we will go over the basics of Markdown, and how to use it within Argilla.

- multi-modality
    - image
    - video
    - audio
- table
- exploiting displacy
  - ner
  - relationships

## Installing Dependencies

We will be working with builtin Python libraries, as well as the `argilla` library. Additionally, we will use a unstructored document processor with a externally callable public API (to avoid overhead). This tool is called [IBM Deep Search](https://github.com/DS4SD/deepsearch-toolkit) but for a fully open source alternative, I recommend taking a look at [Unstructured](https://unstructured.io). To install the latter, run the following command:

In [83]:
!pip install argilla==1.17 
!pip install datasets
!pip install spacy spacy-transformers
!pip install Pillow
!pip install span_marker
!pip install soundfile librosa
!pip install deepsearch-toolkit
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://dmrepository.datamaran.com:8443/repository/dmPYTHON/simple
Collecting typer<0.8.0,>=0.6.0
  Using cached https://dmrepository.datamaran.com:8443/repository/dmPYTHON/packages/typer/0.7.0/typer-0.7.0-py3-none-any.whl (38 kB)
Installing collected packages: typer
  Attempting uninstall: typer
    Found existing installation: typer 0.9.0
    Uninstalling typer-0.9.0:
      Successfully uninstalled typer-0.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
deepsearch-toolkit 0.29.1 requires typer[all]<0.10.0,>=0.9.0, but you have typer 0.7.0 which is incompatible.[0m[31m
[0mSuccessfully installed typer-0.7.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m23.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, 

In [94]:
import argilla as rg
import re
import pandas as pd
import spacy
import span_marker
from datasets import load_dataset
from spacy import displacy

nlp = spacy.load(
    "en_core_web_sm", 
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

<span_marker.spacy_integration.SpacySpanMarkerWrapper at 0x2bd635ea0>

### Signup to Deep Search


Go to https://deepsearch-experience.res.ibm.com/ and sign up for an account using the Google OAuth integration. Afterwards, you can use the following command to install the library.

![authenticate](img/making-most-of-markdown/deepsearch.png)


```bash
deepsearch profile config --profile-name "ds-experience" --host "https://deepsearch-experience.res.ibm.com/" --verify-ssl --username "<your-email>"
```

And add `your-api-key` to the prompted terminal.

## Get Coding

### Exploiting `displacy`

Displacy is the library from spaCy that allows you to visualize the output of the NLP models. It is a very powerful tool, and we will be using it to visualize the output of the NER model. To do so, we will be using the `displacy.render` function, which takes in the text and the output of the NER model, and returns a HTML string that can be rendered within Argilla.

Displacy can render dependencies, named entities, and even relationships. We will be using the first two, but if you want to learn more about displacy, you can check out the [documentation](https://spacy.io/usage/visualizers).

In [88]:
doc = nlp("Rats are various medium-sized, long-tailed rodents.")
x = displacy.render(doc, style="dep")

In [89]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
doc2 = nlp(text)
displacy.render(doc2, style="ent")

We can now create an Argilla `FeedbackDataset` and add the displacy output to it. We will be using the `FeedbackDataset` to render the displacy output. We will now configure it to show the default text, dependencies and entities. Additionally, we add some label questions to indicate whether the text is relevant, and the dependency parsing and named entities are correct. Also, we will be adding questions to assess whether the user want to apply a correction to the dependencies and/or entities.

In [101]:
try:
    ds = rg.FeedbackDataset(
        fields=[
            rg.TextField(name="text", use_markdown=True),
            rg.TextField(name="dep", use_markdown=True),
            rg.TextField(name="ent", use_markdown=True)
        ],
        questions=[
            rg.LabelQuestion(name="relevant", labels=["yes", "no"]),
            rg.MultiLabelQuestion(name="question-multi", labels=["flag-pos", "flag-ner"]),
            rg.TextQuestion(name="dep-correction", use_markdown=True),
            rg.TextQuestion(name="ner-correction", use_markdown=True)
        ]
    )
    ds = ds.push_to_argilla("exploiting-displacy")
except Exception as e:
    ds = rg.FeedbackDataset.from_argilla("exploiting-displacy")

Now, we will load the basic [few-nerd dataset from Hugging Face](https://huggingface.co/datasets/DFKI-SLT/few-nerd). This dataset contains a few sentences, and the output of the NER model. We will be using this dataset to show how to use displacy within Argilla.

In [95]:
dataset_fewnerd = load_dataset("DFKI-SLT/few-nerd", "supervised", split="train[:10]")

Next, we will use this dataset to populate our Argilla `FeedbackDataset`. We will be using the `displacy.render` function to render the displacy output as html, and add it to the `FeedbackDataset`. We will also add the text, and the output of the NER model to the `FeedbackDataset`. Finally, we will also add markdown formatted tables to support basic support for NER and dependency annotation.

In [102]:
texts = [" ".join(x["tokens"]) for x in dataset_fewnerd]
docs = nlp.pipe(texts)

def wrap_in_max_width(html):
    html = html.replace("max-width: none;", "")
    # remove existing width and height setting based on regex width="/d"
    html = re.sub(r"width=\"\d+\"", "overflow-x: auto;", html)
    html = re.sub(r"height=\"\d+\"", "", html)
    
    # Find the SVG element in the HTML output
    svg_start = html.find("<svg")
    svg_end = html.find("</svg>") + len("</svg>")
    svg = html[svg_start:svg_end]

    # Set the width and height attributes of the SVG element to 100%
    svg = svg.replace("<svg", "<svg width='100%' height='100%'")

    # Wrap the SVG element in a div with max-width and horizontal scrolling
    return f"<div style='max-width: 100%; overflow-x: auto;'>{svg}</div>"

records = []
for doc in docs:
    record = rg.FeedbackRecord(
        fields={
            "text": doc.text, 
            "dep": wrap_in_max_width(displacy.render(doc, style="dep", jupyter=False)), 
            "ent": displacy.render(doc, style="ent", jupyter=False)
        },
        suggestions=[{
                "question_name": "dep-correction", 
                "value": pd.DataFrame([{"Label": token.dep_, "Text": token.text} for token in doc]).to_markdown(index=False)

            },
            {
                "question_name": "ner-correction", 
                "value": pd.DataFrame([{"Label": ent.label_, "Text": ent.text} for ent in doc.ents]).to_markdown(index=False),
            }
        ]
    )
    records.append(record)
ds.add_records(records)

Pushing records to Argilla...: 100%|██████████| 1/1 [00:00<00:00, 11.15it/s]


### Multi-Modality

Yes, Argilla can work with images, video and audio, when formatted as HTML and used within a markdown field. However, besides using publicly available sources to do this, we can also us something called DataURLs. 

#### DataURLs

A DataURL is a way to encode binary data into a string, which can then be used to embed the data into a webpage. This is a very useful tool, as it allows us to embed images, videos, and audio files directly into html, without having to worry about hosting them externally. This is done by prepending the data with a header, which specifies the type of data being encoded, and the encoding used. We will define three different functions, one for each modality, which will take a file path as input, and return a DataURL as output.

In [None]:
import base64
from pathlib import Path

def get_file_type(path):
    return Path(path).suffix[1:]

def video_to_dataurl(path, file_type: str = None):
    # Open the video file and read its contents
    with open(path, 'rb') as f:
        video_data = f.read()

    # Encode the video data as base64
    video_base64 = base64.b64encode(video_data).decode('utf-8')

    # Get the file type (e.g. mp4)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:video/{file_type};base64,' + video_base64

    # Create HTML
    html = f"<video controls><source src='{data_url}' type='video/{file_type}'></video>"
    return html
    
def audio_to_dataurl(path, file_type: str = None):
    # Open the audio file and read its contents
    with open(path, 'rb') as f:
        audio_data = f.read()
    
    # Encode the audio data as base64
    audio_base64 = base64.b64encode(audio_data).decode('utf-8')
    
    # Get the file type (e.g. mp3)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:audio/{file_type};base64,' + audio_base64
    
    # Create HTML
    html = f"<audio controls autoplay><source src='{data_url}' type='audio/{file_type}'></audio>"
    return html

def image_to_dataurl(path, content, file_type: str = None):
    # open the image file and read its contents
    with open(path, 'rb') as f:
        image_data = f.read()
        
    # Encode the image data as base64
    image_base64 = base64.b64encode(image_data).decode('utf-8')
    
    # Get the file type (e.g. png)
    file_type = file_type or get_file_type(path)
    
    # Prepend the Data URL prefix to the base64-encoded data
    data_url = f'data:image/{file_type};base64,' + image_base64
    
    # Create HTML
    html = f'<img src="{data_url}">'
    return html

Next, we will define our `FeedbackDataset` and add the DataURLs to it. We will also add a question to ask the user to describe the image, video, or audio file.

In [None]:
try:
    ds = rg.FeedbackDataset(
        fields=[rg.TextField(name="content", use_markdown=True)],
        questions=[rg.TextQuestion(name="describe")],
    )
    ds = ds.push_to_argilla("multi-modal")
    
except:
    ds = rg.FeedbackDataset.from_argilla("multi-modal")


We will add the DataURLs to the `add_records`-method. We will also add a question to ask the user to describe the image, video, or audio file.

In [None]:

records = [
    rg.FeedbackRecord(fields={"content": audio_to_dataurl("img/making-most-of-markdown/heath_ledger.mp3")}),
    rg.FeedbackRecord(fields={"content": audio_to_dataurl("img/making-most-of-markdown/heath_ledger_2.mp3")}),
    rg.FeedbackRecord(fields={"content": image_to_dataurl("img/making-most-of-markdown/deepsearch.png")}),
    rg.FeedbackRecord(fields={"content": video_to_dataurl("img/making-most-of-markdown/snapshot.mp4")})
]
ds.add_records(records)

#### Let's create proper datasets

##### For Audio Classification

For this example audio classification dataset, we will be using the [ccmusic-database/bel_folk](https://huggingface.co/datasets/ccmusic-database/bel_folk) dataset from Hugging Face. This dataset contains 1 minute audio clips of Chinese folk music, and the genre of the music. We will be using this dataset to create a dataset for audio classification.

In [107]:
from datasets import load_dataset

my_audio_dataset = load_dataset("ccmusic-database/bel_folk")
my_audio_dataset = my_audio_dataset.shuffle()
my_audio_dataset["train"][0], my_audio_dataset["train"].features

({'audio': {'path': '/Users/davidberenstein/.cache/huggingface/datasets/downloads/extracted/16960b86c4c6c0594ad82b11290ee5a7bb8f111068e9c6032444a2005fe715ad/dataset/audio/folk_f (39).wav',
   'array': array([-0.00036986, -0.00120128, -0.00222397, ...,  0.00090062,
           0.00168412,  0.0010872 ]),
   'sampling_rate': 44100},
  'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=476x349>,
  'label': 3,
  'gender': 0,
  'singing_method': 0},
 {'audio': Audio(sampling_rate=44100, mono=True, decode=True, id=None),
  'image': Image(decode=True, id=None),
  'label': ClassLabel(names=['m_bel', 'f_bel', 'm_folk', 'f_folk'], id=None),
  'gender': ClassLabel(names=['female', 'male'], id=None),
  'singing_method': ClassLabel(names=['Folk Singing', 'Bel Canto'], id=None)})

We will now define `label`, `gender`, `singing_method` columns as `LabelQeustion` columns and infer the label sets from the `Datasets.features` attribute. 

In [105]:
label_general = my_audio_dataset["train"].features["label"].names
label_gender = my_audio_dataset["train"].features["gender"].names
label_singing_method = my_audio_dataset["train"].features["singing_method"].names
rg_ds_audio = rg.FeedbackDataset(
    fields=[rg.TextField(name="audio", use_markdown=True)],
    questions=[
        rg.LabelQuestion(name="general", labels=label_general),
        rg.LabelQuestion(name="gender", labels=label_gender),
        rg.LabelQuestion(name="singing_method", labels=label_singing_method)
    ]
)
try:
    rg_ds_audio = rg.FeedbackDataset.from_argilla("audio")
except:
    rg_ds_audio = rg_ds_audio.push_to_argilla("audio")
rg_ds_audio

RemoteFeedbackDataset(
   id=8005e3b0-c932-462a-8f49-d0acb0044e2e
   name=audio
   workspace=Workspace(id=268dbd87-2970-46cd-a84a-e3785b7178b7, name=argilla, inserted_at=2023-08-14 13:02:15, updated_at=2023-08-14 13:02:15)
   url=http://localhost:6900/dataset/8005e3b0-c932-462a-8f49-d0acb0044e2e/annotation-mode
   fields=[RemoteTextField(id=UUID('16121a5f-a9b3-4a59-92cb-eebdbcbeb957'), client=None, name='audio', title='Audio', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('cf389c89-cf1e-416a-b7da-5f59838cf1d3'), client=None, name='general', title='General', description=None, required=True, type='label_selection', labels=['m_bel', 'f_bel', 'm_folk', 'f_folk'], visible_labels=None), RemoteLabelQuestion(id=UUID('08edc7e7-b8c2-41b3-8526-b9ae04fa15f5'), client=None, name='gender', title='Gender', description=None, required=True, type='label_selection', labels=['female', 'male'], visible_labels=None), RemoteLabelQuestion(id=UUID('54e321b7-7a6c-4df7

Next, we will define our `FeedbackDataset` and add the DataURLs to it by using the b64encode function `audio_to_dataurl`. Also, we will be adding dataset suggestions for each one of the label columns.

In [119]:
records = []
my_audio_dataset_slice = my_audio_dataset["train"].select(range(20))
for entry in my_audio_dataset_slice:
    record = rg.FeedbackRecord(
        fields={"audio": audio_to_dataurl(entry["audio"]["path"])},
        suggestions=[
            {"question_name": "general", "value": label_general[entry["label"]]},
            {"question_name": "gender", "value": label_gender[entry["gender"]]},
            {"question_name": "singing_method", "value": label_singing_method[entry["singing_method"]]}
        ]
    )
    try:
        rg_ds_audio.add_records(record, show_progress=False)
    except:
        print("too large")
        pass

too large


##### For Image Classification 

Within this example, we will be using the [zishuod/pokemon-icons](https://huggingface.co/datasets/"zishuod/pokemon-icons") dataset from Hugging Face. This dataset contains images of Pokemon that need to be classified. We will be using this dataset to create a dataset for image classification but feel free to use any other image classification dataset listed below.

Some examples are:
- "zishuod/pokemon-icons"
- "keremberke/shoe-classification"
- "sampath017/plants"
- "sin3142/memes-500"
- "adamkatav/mtg_subsample"
- "sxdave/emotion_detection"

In [120]:
my_image_dataset  = load_dataset("zishuod/pokemon-icons")
my_image_dataset = my_image_dataset.shuffle()
my_image_dataset["train"][0], my_image_dataset["train"].features

Resolving data files:   0%|          | 0/427 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/165 [00:00<?, ?it/s]

({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=75x75>,
  'label': 70},
 {'image': Image(decode=True, id=None),
  'label': ClassLabel(names=['aegislash', 'amoonguss', 'arcanine', 'azumarill', 'blastoise', 'blaziken', 'blissey', 'calyrex', 'celesteela', 'centiskorch', 'chansey', 'charizard', 'cinderace', 'clefable', 'clefairy', 'coalossal', 'comfey', 'cresselia', 'darmanitan', 'dialga', 'diggersby', 'dracovish', 'dracozolt', 'dragapult', 'drifblim', 'durant', 'dusclops', 'eternatus', 'excadrill', 'ferrothorn', 'garchomp', 'gastrodon', 'gigalith', 'glastrier', 'gothitelle', 'grimmsnarl', 'groudon', 'gyarados', 'heatran', 'hippowdon', 'hooh', 'hydreigon', 'incineroar', 'indeedee', 'kartana', 'kyogre', 'landorus', 'lapras', 'ludicolo', 'mamoswine', 'metagross', 'mewtwo', 'mienshao', 'milotic', 'mimikyu', 'moltres', 'necrozma', 'nihilego', 'ninetales', 'pheromosa', 'pikachu', 'porygon2', 'primarina', 'quagsire', 'raichu', 'raikou', 'regieleki', 'regigigas', 'registeel', 'rhy

We will now define `label` column as `LabelQeustion` columns and infer the label sets from the `Datasets.features` attribute. Also, we will use a basic `TextField` to add the image to the `FeedbackDataset`.

In [125]:
label_image = my_image_dataset["train"].features["label"].names
rg_ds_image = rg.FeedbackDataset(
    fields=[rg.TextField(name="image", use_markdown=True)],
    questions=[rg.LabelQuestion(name="label", labels=label)]
)
try:
    rg_ds_image = rg.FeedbackDataset.from_argilla("image")
except:
    rg_ds_image = rg_ds_image.push_to_argilla("image")
rg_ds_image



RemoteFeedbackDataset(
   id=c29c5a97-eb64-49d9-a57e-ec0a5f6e6038
   name=image
   workspace=Workspace(id=268dbd87-2970-46cd-a84a-e3785b7178b7, name=argilla, inserted_at=2023-08-14 13:02:15, updated_at=2023-08-14 13:02:15)
   url=http://localhost:6900/dataset/c29c5a97-eb64-49d9-a57e-ec0a5f6e6038/annotation-mode
   fields=[RemoteTextField(id=UUID('80f85ff4-052a-4e11-aec0-93e63857fcc2'), client=None, name='image', title='Image', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('81d57bd8-f69b-48fa-83cd-90fb805acd84'), client=None, name='label', title='Label', description=None, required=True, type='label_selection', labels=['aegislash', 'amoonguss', 'arcanine', 'azumarill', 'blastoise', 'blaziken', 'blissey', 'calyrex', 'celesteela', 'centiskorch', 'chansey', 'charizard', 'cinderace', 'clefable', 'clefairy', 'coalossal', 'comfey', 'cresselia', 'darmanitan', 'dialga', 'diggersby', 'dracovish', 'dracozolt', 'dragapult', 'drifblim', 'durant', 'dusclops

Now, we will define our `FeedbackDataset` and add the DataURLs to it by using the b64encode function `image_to_dataurl`. Also, we will be adding dataset suggestions for each one of the label columns.

In [126]:
temp_img_path = "data/making-most-of-markdown/temp_img.png"
records = []
for entry in my_image_dataset["train"]:
    entry["image"].save(temp_img_path, format="png")
    record = rg.FeedbackRecord(
        fields={"image": image_to_dataurl(temp_img_path, file_type="png")},
        suggestions=[{"question_name": "label", "value": label_image[entry["label"]]}]
    )
    try:
        rg_ds_image.add_records(record, show_progress=False)
    except Exception as e:
        print(e)
        print("too large")
        pass