## Data Labelling! 

Imagine you have a lot of science-y texts. These could be logs from a research lab doing materials research. You want to annotate and find out important topics like 'processes', 'materials', etc. You also want to fine-tune a BERT to do this for you later on. 

So what do you do? First, you look online and find a [fine-tuning guide](https://www.datasciencecentral.com/how-to-fine-tune-bert-transformer-with-spacy-3/). 
> "To fine-tune BERT using spaCy 3, we need to provide training and dev data in the spaCy 3 JSON format (see here) which will be then converted to a .spacy binary file. We will provide the data in IOB format contained in a TSV file then convert to spaCy JSON format."

Example IOB: 

```python
MS B-DIPLOMA
in O
electrical B-DIPLOMA_MAJOR
engineering I-DIPLOMA_MAJOR
or O
computer B-DIPLOMA_MAJOR
engineering I-DIPLOMA_MAJOR
. O
```

After reading this, you understand you need to create a pipeline which can generate these tags in the above format. **This seems like something GrammarFlow can help with!**

> Note: The focus of this notebook is data labelling, not fine-tuning!

## Helper Functions

In [1]:
import re
import time
import random
from typing import List, Optional
from pydantic import BaseModel, Field
from grammarflow import *
import openai
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [3]:
class LLM:
    def __init__(self):
        self.client = openai.OpenAI(
            api_key=os.environ["OPENAI_API_KEY"],
        )

    def invoke(self, config: dict):
        with PromptContextManager(config) as filled_prompt:
            return self.request(filled_prompt, temperature=0.01)

    def __call__(self, prompt, temperature=0.2, context=None):
        response = self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
        )
        return response.choices[0].message.content


llm = LLM()

In [4]:
def tag_abstract(abstract, materials, conditions, processes):
    words = re.findall(r"\w+|[^\w\s]", abstract)

    tags = ["O"] * len(words)

    def tag_words(word_list, tag_suffix):
        for word in word_list:
            start = 0
            while start < len(abstract):
                found_at = abstract.find(word, start)
                if found_at == -1:
                    break
                token_index = len(re.findall(r"\w+|[^\w\s]", abstract[:found_at]))
                if token_index < len(tags):
                    tags[token_index] = "B" + tag_suffix
                    for i in range(
                        token_index + 1,
                        min(
                            token_index + len(re.findall(r"\w+|[^\w\s]", word)),
                            len(tags),
                        ),
                    ):
                        tags[i] = "I" + tag_suffix
                start = found_at + 1

    tag_words(materials, "-MATERIALS")
    tag_words(conditions, "-CONDITIONS")
    tag_words(processes, "-PROCESSES")

    tagged_abstract = "\n".join([f"{word} {tag}" for word, tag in zip(words, tags)])

    return tagged_abstract

## Pydantic Model

In [5]:
class Annotations(BaseModel):
    materials: List[str] = Field(
        ...,
        description="Only nouns that count as materials in experiments (such as types of chromatography, purification, methods, preparation, study, etc)",
    )
    conditions: List[str] = Field(
        ...,
        description="Conditions used within the experiments (like temperature units, quantity units, mathematical units, percentages, coefficients, any numbers)",
    )
    processes: List[str] = Field(
        ...,
        description="Only nouns and verbs for processes within the experiments (like factor names, rate, yield, etc)",
    )

In [6]:
SampleAnnotations = Annotations(
    materials=["silica resin"],
    conditions=[
        "10µm, 20µm, 50µm",
        "only 10µm silica resin",
        "yield (>96%)",
        "productivity (> 1kg/kg-resin/day)",
    ],
    processes=["chiral separation"],
)

## Making Prompt

In [7]:
prompt = PromptBuilder()
prompt.add_section(
    text="""
Your role is that of a DATA ANNOTATOR for research paper abstracts. You are expected to identify the materials used, different processes involved, and conditions mentioned in the abstract. 
I want you to look at the abstract given below and return all the key phrases you find. Every str object within the list you return for each of the tags must contain at least 2 words and not exceed 8 words. 
"""
)
prompt.add_section(define_grammar=True)
prompt.add_section(
    text="""
Here is an example: 

Abstract: Of the three particle sizes studied (10µm, 20µm, 50µm) only 10µm silica resin was able to produce purified API at the yield (>96%) and productivity (> 1kg/kg-resin/day) necessitated by the project. The second case study uses DoE studies to identify critical process parameters of column load, mobile phase solvent ratio and basic modifier level for a low-resolution, preparative, chiral separation.
Annotations: 
{sample} 
""",
    placeholders=["sample"],
)
prompt.add_section(
    text="Remember that your role is to automate the data annotation process for a chemistry based project. Begin!\nAbstract: {abstract}\nAnnotations:",
    placeholders=["abstract"],
)

## Constraining

In [8]:
abstract = """ 
The simultaneous determination of multi-mycotoxins in food commodities are highly desirable due to their potential toxic effects and mass consumption of foods. 
Herein, liquid chromatography-quadrupole exactive orbitrap mass spectrometry was proposed to analyze multi-mycotoxins in commercial vegetable oils. 
Specifically, the method featured a successive liquid–liquid extraction process, in which the complementary solvents consisted of acetonitrile and water were optimized. 
Resultantly, matrix effects were reduced greatly. External calibration approach revealed good quantification property for each analyte. 
Under optimal conditions, the recovery ranging from 80.8% to 109.7%, relative standard deviation less than 11.7%, and good limit of quantification (0.35 to 45.4ng/g) were achieved. 
The high accuracy of proposed method was also validated. The detection of 20 commercial vegetable oils revealed that aflatoxins B1 and B2, zearalenone were observed in 10 real samples. 
The as-developed method is simple and low-cost, which merits the wide applications for scanning mycotoxins in oil matrices.
"""

In [9]:
with Constrain(prompt) as manager:
    manager.set_config(format="xml")

    manager.format_prompt(
        placeholders={"abstract": abstract, "sample": XML.format(SampleAnnotations)},
        grammars=[
            {"model": [Annotations]},
        ],
    )

    llm_response = llm(manager.prompt)
    response = manager.parse(llm_response)

In [10]:
materials, conditions, processes = (
    response.Annotations.materials,
    response.Annotations.conditions,
    response.Annotations.processes,
)

In [11]:
materials, conditions, processes

(['vegetable oils', 'acetonitrile', 'water'],
 ['80.8% to 109.7%', '11.7%', '0.35 to 45.4ng/g'],
 ['liquid chromatography-quadrupole exactive orbitrap mass spectrometry', 'liquid–liquid extraction process', 'external calibration approach'])

In [12]:
tagged_abstract = tag_abstract(abstract, materials, conditions, processes)

In [13]:
print(tagged_abstract)

The O
simultaneous O
determination O
of O
multi O
- O
mycotoxins O
in O
food O
commodities O
are O
highly O
desirable O
due O
to O
their O
potential O
toxic O
effects O
and O
mass O
consumption O
of O
foods O
. O
Herein O
, O
liquid B-PROCESSES
chromatography I-PROCESSES
- I-PROCESSES
quadrupole I-PROCESSES
exactive I-PROCESSES
orbitrap I-PROCESSES
mass I-PROCESSES
spectrometry I-PROCESSES
was O
proposed O
to O
analyze O
multi O
- O
mycotoxins O
in O
commercial O
vegetable B-MATERIALS
oils I-MATERIALS
. O
Specifically O
, O
the O
method O
featured O
a O
successive O
liquid B-PROCESSES
– I-PROCESSES
liquid I-PROCESSES
extraction I-PROCESSES
process I-PROCESSES
, O
in O
which O
the O
complementary O
solvents O
consisted O
of O
acetonitrile B-MATERIALS
and O
water B-MATERIALS
were O
optimized O
. O
Resultantly O
, O
matrix O
effects O
were O
reduced O
greatly O
. O
External O
calibration O
approach O
revealed O
good O
quantification O
property O
for O
each O
analyte O
. O
Under O
optimal 

# Voila! 

And we're done! Now, follow the next steps in the guide to convert to a TSV file and subsequent JSON file. 