# Advanced Topics in Weak Supervision - Name Entity Recognition
## IMD3011 - Datacentric AI
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Keypoints 
- **Named Entity Recognition (NER)**: A NLP task that identifies and classifies named entities in text into predefined categories (like persons, organizations, locations, medications), transforming unstructured text into structured, machine-processable data.

- **Weak Supervision Framework**: An approach that uses programmatic labeling functions instead of manual annotation to create training data for NER models, dramatically reducing annotation time and costs.

- **Diverse Labeling Functions**: Multiple methods to generate weak labels, including gazetteers (entity lists), pre-trained models, regular expressions, zero-shot learning models (GLiNER), and large language models with function calling capabilities.

- **Label Aggregation**: Techniques to combine outputs from multiple labeling functions using generative models (Hidden Markov Models) or majority voting to produce high-quality aggregated labels.

- **Document-Level Consistency**: Application of document-level labeling functions to ensure entity labels are consistent throughout a document, reducing annotation errors and improving coherence.

- **Transfer Learning for NER**: Using pre-trained language models like BERT and fine-tuning them on weakly labeled data to develop effective NER systems for specialized domains.

- **Performance Comparison**: Models trained with weak supervision (HMM labels: 0.87 F1-score, Majority vote: 0.87 F1-score) can achieve performance close to those trained on manually annotated data (True labels: 0.89 F1-score).

- **Time Efficiency Analysis**: Weak supervision significantly reduces annotation time compared to manual labeling (estimated 600 hours of human labor saved for a 10,000 document dataset).

- **Iterative Improvement Cycle**: A cyclical process of generating weak labels, training models, evaluating performance, and refining labeling functions to continuously improve NER models.


## Learning goals
By the end of this class, you will be able to:

1. **Explain** the fundamental concepts of Named Entity Recognition (NER), including how entities are defined, why they are important, and how NER transforms unstructured text into structured data.

2. **Apply** various weak supervision techniques (such as regex rules, gazetteers, pre-trained transformer models, and zero-shot methods) to generate and refine noisy labels effectively.

3. **Combine** outputs from multiple labeling functions using majority voting or generative modeling (e.g., Hidden Markov Models) to produce high-quality aggregated labels.

4. **Execute** document-level labeling strategies to ensure consistency of entity annotations across entire documents, using context for more accurate recognition.

5. **Fine-tune** pre-trained language models (e.g., BERT) on weakly labeled data for specialized domains, and evaluate their performance compared to models trained on manually annotated gold data.

6. **Analyze** the trade-offs between different labeling strategies and assess how iterative refinement can progressively improve NER accuracy and efficiency.

7. **Evaluate** the cost-effectiveness and scalability of weak supervision, quantifying the time savings and benefits relative to fully manual annotation for large-scale datasets.


In [1]:
import os
import pandas as pd

os.environ["TOKENIZERS_PARALLELISM"] = "false"
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Review: Named Entity Recognition


## What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is a task within the broader field of Information Extraction. It involves identifying and classifying named entities in a text into predefined categories. The primary goal of NER is to convert unstructured textual data into structured information that machines can easily read and analyze.

Named Entity Recognition (NER) is a foundational technique in Natural Language Processing (NLP) that transforms unstructured text into structured data by identifying and classifying named entities. This process is vital for various applications, making it easier for machines to understand and use textual information effectively.

### Key Concepts

**Entities**: In NER, entities refer to specific pieces of information within a text that are of interest. Any word or phrase that represents something specific in the real world can be considered an entity.
**Named Entities**: These are specific, identifiable entities mentioned in the text. They are categorized into various groups, such as:

- **`PERSON`**: Names of individuals.
- **`ORG`**: Names of organizations.
- **`GPE`**: Geopolitical entities, including countries, cities, and states.
- **`TIME`**: Time expressions, such as specific times of the day.
- **`DATE`**: Date expressions, including specific dates and periods.
- **...among others**: There are other categories, depending on the application and domain.

### Objectives of NER

The primary objective of NER is to extract structured information from unstructured text. This process involves:

1. **Detection**: Identifying the spans of text that correspond to named entities.
2. **Classification**: Assigning each detected entity to a predefined category.

### Importance of NER

NER is important for various applications, including:

- **Information Retrieval**: Enhancing search engines to retrieve more relevant results.
- **Question Answering Systems**: Improving the accuracy of systems designed to answer user queries.
- **Content Recommendation**: Personalizing content based on identified entities.
- **Text Summarization**: Extracting key information to generate concise summaries.
- **Graph Databases**: Populating knowledge graphs with structured information.

### Example

Consider the following sentence:

> "Elias Jacob is a professor at the Federal University of Rio Grande do Norte, located in Natal, Brazil."

In this sentence, NER would identify and classify the named entities as:

- **`PERSON`**: Elias Jacob
- **`ORG`**: Federal University of Rio Grande do Norte
- **`GPE`**: Natal, Brazil
- **`PROFESSION`**: professor

> Note: GPE stands for Geopolitical Entity, which includes countries, cities, and states.

### Visual Representation

To better understand how NER works, refer to the visual example below, which highlights named entities within a legal text:

<p align="center">
<img src="images/NER.png" alt="NER Example" style="width: 80%; height: 80%"/>
</p>

### Potential Questions

- **How does NER handle ambiguous cases?**
    - NER systems often rely on context and sophisticated algorithms to disambiguate entities. For instance, "Apple" could refer to a fruit or a tech company, and the surrounding text helps determine the correct classification.

- **What are the challenges in NER?**
    - Some of the main challenges include handling ambiguous entities, dealing with different languages and dialects, and recognizing entities in noisy or informal text (e.g., social media posts).

- **Can NER be customized for specific domains?**
    - Yes, NER systems can be trained on domain-specific data to improve their accuracy for particular applications, such as medical texts or legal documents.

In [2]:
# Import the necessary libraries from the spaCy package
import spacy
from spacy import displacy

# Load the pre-trained spaCy model for Portuguese
# 'pt_core_news_lg' is a large model with more accuracy and features
# You can also use 'pt_core_news_sm' for a smaller, faster model with fewer features
# To install the model, run: python -m spacy download pt_core_news_lg
nlp = spacy.load("pt_core_news_lg")

# Define a sample text in Portuguese for Named Entity Recognition (NER)
text_example = (
    "Meu nome é Elias Jacob e eu moro em Natal, Rio Grande do Norte. "
    "Eu trabalho no Instituto Metrópole Digital, que é a unidade mais bacana da UFRN. "
    "Quando a pandemia começou, eu estava com as malas prontas para uma viagem de férias para o Japão. "
    "Eu até fui buscar meu visto no Consulado em Recife, mas, quando chegou mais perto da viagem, meus voos foram todos cancelados pela United Airlines e eu não viajei. "
    "Somente em 2025 eu conseguirei ir à Ásia, mas, desta vez, iremos para a China. "
    "Em 2023, eu estive no show do Roger Waters em São Paulo. Sempre que visito SP, gosto de comer no restaurante Kidoairaku, meu restaurante japonês favorito fora de Natal. "
    "No RN, meu restaurante favorito chama-se 'En', um lugar super simples, simpático e agradável em Nísia Floresta."
)
# Process the text using the spaCy model
# This step performs tokenization, part-of-speech tagging, and named entity recognition
doc = nlp(text_example)

# Use the displaCy visualizer to render the named entities in the text
# The 'style' parameter is set to 'ent' to visualize named entities
# The 'jupyter' parameter is set to True to display the visualization in a Jupyter Notebook
displacy.render(doc, style="ent", jupyter=True)

# If you find any issues running the code above, please check the instructions on https://github.com/pytorch/captum/issues/936

In [3]:
def extract_named_entities_from_text(input_text: str) -> list[tuple[str, str]]:
    """
    Extract named entities from a given text.

    Args:
        input_text (str): The input text.

    Returns:
        list[tuple[str, str]]: A list of tuples where each tuple contains the named entity and its label.
    """
    # Parse the text with spaCy
    # The nlp object processes the input text and returns a parsed Doc object
    parsed_text = nlp(input_text)

    # Initialize an empty list to store the named entities
    named_entities = []

    # Iterate over the named entities in the parsed text
    # The parsed_text.ents attribute contains a list of named entities identified in the text
    for entity in parsed_text.ents:
        # Append the entity text and label to the list
        # entity.text is the named entity, and entity.label_ is its label (e.g., PERSON, ORG, LOC)
        named_entities.append((entity.text, entity.label_))

    # Return the list of named entities and their labels
    return named_entities


# Test the function with an example text
# This will extract named entities from the text_example and print them
extract_named_entities_from_text(text_example)

[('Elias Jacob', 'PER'),
 ('Natal', 'LOC'),
 ('Rio Grande do Norte', 'LOC'),
 ('Instituto Metrópole Digital', 'MISC'),
 ('UFRN', 'LOC'),
 ('Japão', 'LOC'),
 ('Consulado', 'MISC'),
 ('Recife', 'LOC'),
 ('United Airlines', 'ORG'),
 ('Ásia', 'LOC'),
 ('China', 'LOC'),
 ('Roger Waters', 'PER'),
 ('São Paulo', 'LOC'),
 ('SP', 'LOC'),
 ('Kidoairaku', 'LOC'),
 ('Natal', 'LOC'),
 ('RN', 'LOC'),
 ('En', 'MISC'),
 ('Nísia Floresta', 'LOC')]


# Illustrative Use Cases of NER

Let's dive deeper into real-world applications of NER:

- **Healthcare Text Analysis**: In clinical note analysis, NER can help identify information about diseases, symptoms, treatments, medications which can aid in improved medical decision making.
- **News Articles**: For a news reading app that curates articles, NER can help extract information about people, organizations, and locations mentioned in the article. This can then be used to categorize or tag the articles or improve article recommendations.
- **Customer Support**: In a customer support scenario, NER can be used to identify and separate out the important pieces of information from a customer’s query like name, email address, phone number. This can allow for more efficient handling of customer requests.

fundamentally, NER's utility stretches across various domains, transforming unstructured textual data into structured, machine-readable data, thereby enabling more sophisticated and nuanced analyses.

## Why not use regular expressions?

Regex is a powerful tool for text processing and pattern matching. However, it is not a good choice for NER. The main reason is that regular expressions are not able to generalize well to unseen data. For instance, if we want to extract all the names of people in a text, we can use a regular expression such as [A-Z][a-z]*\s[A-Z][a-z]*.
This regular expression will match all the names of people that have a first name and a last name. However, it will not match names that have a middle name or initial. It will also not match names that have a hyphen, connectives (such as de, a, do, dos, das), suffixes (such as Júnior or Neto). It will also match strings that are not actually names such as the word “Doctor” or “Professor”.

## Why not use a dictionary?

A dictionary is a good choice for NER if we have a small number of entities that we want to extract. However, it is not a good choice if we have a large number of entities. For instance, if we want to extract all the names of people in a text, we can use a dictionary that contains all the names of people in the world. However, this dictionary will be very large and it will be difficult to maintain. It will also be difficult to update the dictionary when new names are added to the world.

# Weak Supervision for Named Entity Recognition

## Setting the Stage

Let's imagine a scenario where we want to train a Named Entity Recognition (NER) model to identify entities in a text. However, we have limited labeled data available for training. Specifically, imagine the following:

1. You work for the legal department of the local governenment.
2. Every day, people file lawsuits demanding your government to pay them for drugs that they might need to treat their diseases. Remember: according to the Brazilian Constitution, the government must provide free healthcare to everyone, including free drugs.
3. At the sime time, the government has [UNICAT](http://www.unicat.rn.gov.br/), a agency responsible to provide drugs to the population. However, the agency don't know exactly what drugs are being demanded by the population.
4. Remember: each lawsuit represents an extra cost to the government. Therefore, it would be better to avoid lawsuits that are not necessary by making the drugs available to the population before the lawsuits are filed.
5. It would be very useful to have a system that could automatically identify the names of the drugs mentioned in the lawsuits. This would allow you to quickly analyze past lawsuits and identify patterns that could help you make better decisions in the future.
6. Your goal is to build a NER model that can identify the names of the drugs mentioned in the lawsuits, so that you can analyze the data and give UNICAT the information they need to make the drugs available to the population before the lawsuits are filed.

In this scenario, you have a limited number of labeled examples where the names of the drugs are annotated. You could manually annotate more examples, but this would be time-consuming and expensive. Instead, you decide to use weak supervision to train your NER model with limited labeled data.

## Our Data

You'll work with a dataset of legal documents containing lawsuits filed against the government. Each document contains a text description of the lawsuit, and your task is to identify the names of the drugs mentioned in these texts.
The data comes from the [LexCare.BR](https://github.com/eliasjacob/lexcare.br) dataset, which contains legal documents related to healthcare lawsuits in Brazil.
Your dataset is divided into four parts:
- **Train Gold Data**: Contains 914 legal documents with gold annotations for the names of the drugs. We'll use this at the end to verify if or WSL pipeline can match the performance of a model trained with gold data.
- **Train WSL Unlabeled Data**: Contains 10,000 legal documents without any annotations. We'll apply the labeling functions to generate weak labels for these documents.
- **Development Data**: Contains 102 legal documents with gold annotations for evaluation.
- **Test Data**: Contains 254 legal documents with gold annotations for final evaluation.


Labels use [IOB tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), where "B" indicates the beginning of an entity, and "I" indicates the continuation of an entity.

In [4]:
df = pd.read_parquet("data/ner/dataset.parquet")
df

Unnamed: 0,uid,text,label_iob,label_span,split
37,bf9e07d680a0c85e8c4ff7b5ecbf0a3a,REQUERENTE: RONNY ROBERT BERTO FREITAS REQUERI...,"[[REQUERENTE, O], [:, O], [RONNY, O], [ROBERT,...",[],train
969,dab3c550c6ad85849fdeef689ecac95a,"No caso em comento, não foi produzido nos auto...","[[No, O], [caso, O], [em, O], [comento, O], [,...",[],train
611,79d65b619aa4afd1e9fea5ecced1d030,Cuida-se de pedido de tratamento público de sa...,"[[Cuida, O], [-, O], [se, O], [de, O], [pedido...","[{'end': 109, 'label': 'MEDICAMENTO', 'start':...",train
468,3ba71b948d5dd8bf0b3185f5ad4c430d,9. Súmula do julgamento: A Turma Recursal dos ...,"[[9, O], [., O], [Súmula, O], [do, O], [julgam...",[],train
769,b1e195ed146f6b10316c71856e45a1c6,"No tocante ao juízo de verossimilhança, apoiad...","[[No, O], [tocante, O], [ao, O], [juízo, O], [...","[{'end': 1156, 'label': 'MEDICAMENTO', 'start'...",train
...,...,...,...,...,...
9995,f051c712d6a96c70fb51f8d51b4758e5,"Com efeito, em que pese a CONITEC não ter reco...",,,unlabel
9996,d0eb685e20b7936ff19253ca5571dc32,I - Consoante o decidido pelo Plenário desta C...,,,unlabel
9997,679598e0b22e279c0c4769a32d6b1185,"Sendo assim, não obstante o Rituximabe (Mabthe...",,,unlabel
9998,9691214ffb6e0828d1d2883f1046661a,"9. No caso dos autos, o autor é detentor de di...",,,unlabel


In [5]:
df.split.value_counts()

split
unlabel    10000
train        914
test         254
valid        102
Name: count, dtype: int64

In [6]:
df_train = df[df.split == "train"].copy()
df_dev = df[df.split == "valid"].copy()
df_test = df[df.split == "test"].copy()
df_unlabeled = df[df.split == "unlabel"].copy()


df_train.reset_index(drop=True, inplace=True)
df_dev.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)
df_unlabeled.reset_index(drop=True, inplace=True)

df_train.shape, df_dev.shape, df_test.shape, df_unlabeled.shape

((914, 5), (102, 5), (254, 5), (10000, 5))

In [7]:
df_train

Unnamed: 0,uid,text,label_iob,label_span,split
0,bf9e07d680a0c85e8c4ff7b5ecbf0a3a,REQUERENTE: RONNY ROBERT BERTO FREITAS REQUERI...,"[[REQUERENTE, O], [:, O], [RONNY, O], [ROBERT,...",[],train
1,dab3c550c6ad85849fdeef689ecac95a,"No caso em comento, não foi produzido nos auto...","[[No, O], [caso, O], [em, O], [comento, O], [,...",[],train
2,79d65b619aa4afd1e9fea5ecced1d030,Cuida-se de pedido de tratamento público de sa...,"[[Cuida, O], [-, O], [se, O], [de, O], [pedido...","[{'end': 109, 'label': 'MEDICAMENTO', 'start':...",train
3,3ba71b948d5dd8bf0b3185f5ad4c430d,9. Súmula do julgamento: A Turma Recursal dos ...,"[[9, O], [., O], [Súmula, O], [do, O], [julgam...",[],train
4,b1e195ed146f6b10316c71856e45a1c6,"No tocante ao juízo de verossimilhança, apoiad...","[[No, O], [tocante, O], [ao, O], [juízo, O], [...","[{'end': 1156, 'label': 'MEDICAMENTO', 'start'...",train
...,...,...,...,...,...
909,39d246fe5e3e2b54617b03733447e22c,JUSTIÇA FEDERAL DA 5a REGIÃO SEÇÃO JUDICIÁRIA ...,"[[JUSTIÇA, O], [FEDERAL, O], [DA, O], [5a, O],...",[],train
910,0061871fa52981dde4cac8d4b4acc6d2,"Trata-se de execução provisória, que determino...","[[Trata, O], [-, O], [se, O], [de, O], [execuç...",[],train
911,1d52ef97483cabbb900520db56f7047a,"[9][4] RE-AgR 534908/PE, AI-AgR 486816/RJ, RE-...","[[[, O], [9, O], [], O], [[, O], [4, O], [], O...",[],train
912,fd0396071d4f6bf5dbea71b188225010,"(.) Sobre os antiangiogênicos, é oportuno menc...","[[(, O], [., O], [), O], [Sobre, O], [os, O], ...","[{'end': 70, 'label': 'MEDICAMENTO', 'start': ...",train


In [8]:
def tuplify_list_of_arrays(list_of_arrays):
    return tuple(tuple(array) for array in list_of_arrays)


df_train["label_iob"] = df_train.label_iob.apply(tuplify_list_of_arrays)
df_dev["label_iob"] = df_dev.label_iob.apply(tuplify_list_of_arrays)
df_test["label_iob"] = df_test.label_iob.apply(tuplify_list_of_arrays)

In [9]:
df_dev

Unnamed: 0,uid,text,label_iob,label_span,split
0,b9d3df422ce015c153c50305030e0314,"Registre-se que, em razão do alto custo do med...","((Registre, O), (-, O), (se, O), (que, O), (,,...","[{'end': 201, 'label': 'MEDICAMENTO', 'start':...",valid
1,4cc46f51c096890a07e8adbdea07da84,Cuida-se de Ação Especial proposta por EDSON F...,"((Cuida, O), (-, O), (se, O), (de, O), (Ação, ...","[{'end': 275, 'label': 'MEDICAMENTO', 'start':...",valid
2,62c251c2b452ce68c0ef495820147dc2,CONSTITUCIONAL. TUTELA ESPECÍFICA DE OBRIGAÇÃO...,"((CONSTITUCIONAL, O), (., O), (TUTELA, O), (ES...",[],valid
3,a40e8231716223938444aa29e9858cef,"Ante o exposto, DOU PROVIMENTO aos embargos de...","((Ante, O), (o, O), (exposto, O), (,, O), (DOU...",[],valid
4,1b8fe98375220bcf35b98a06e2e4661a,"Colhe-se dos autos, de acordo com a perícia ju...","((Colhe, O), (-, O), (se, O), (dos, O), (autos...",[],valid
...,...,...,...,...,...
97,84d1d66c9cf1932a0d1c0b2bb9ed3e13,"De acordo com tais princípios, nenhuma espécie...","((De, O), (acordo, O), (com, O), (tais, O), (p...","[{'end': 2972, 'label': 'MEDICAMENTO', 'start'...",valid
98,cb45a88f1067ee34b63dd7ea9a7a29e9,ADMINISTRATIVO - MOLÉSTIA GRAVE - FORNECIMENTO...,"((ADMINISTRATIVO, O), (-, O), (MOLÉSTIA, O), (...",[],valid
99,fa66b746254c628a217135471fe92523,JUSTIÇA FEDERAL DA 5.a REGIÃO SEÇÃO JUDICIÁRIA...,"((JUSTIÇA, O), (FEDERAL, O), (DA, O), (5, O), ...",[],valid
100,86f874baa22edef6a14565988036b1a0,(iii) diante da necessidade de evitar cenário ...,"(((, O), (iii, O), (), O), (diante, O), (da, O...","[{'end': 658, 'label': 'MEDICAMENTO', 'start':...",valid


In [10]:
print(df_dev.iloc[0]["text"])

Registre-se que, em razão do alto custo do medicamento, dos próprios orçamentos apresentados pelo autor, bem como da presumida eficácia do medicamento genérico com o mesmo princípio ativo (TEMOZOLOMIDA), o fornecimento do medicamento genérico atende à determinação judicial. No caso de impedimento objetivo ao cumprimento do prazo de vinte dias, devidamente provado por documentos, faculta-se ao Estado de Pernambuco, em substituição, o depósito em dinheiro da quantia necessária e suficiente para a aquisição do medicamento para os primeiros 3 (três) ciclos do tratamento, segundo o valor de mercado, tomando-se por base o menor orçamento idôneo apresentado pelo autor (anexo 9). A fim de instruir melhor o feito, nos termos do acordo de cooperação institucional celebrado, em 18/12/2017, pelo Tribunal Regional Federal da 5a Região e pelo Tribunal de Justiça do Estado de Pernambuco, para a prestação de informações técnicas relevantes em demandas judiciais de saúde pelo Núcleo de Assessoria Técni

In [11]:
df_dev.iloc[0]["label_iob"]

(('Registre', 'O'),
 ('-', 'O'),
 ('se', 'O'),
 ('que', 'O'),
 (',', 'O'),
 ('em', 'O'),
 ('razão', 'O'),
 ('do', 'O'),
 ('alto', 'O'),
 ('custo', 'O'),
 ('do', 'O'),
 ('medicamento', 'O'),
 (',', 'O'),
 ('dos', 'O'),
 ('próprios', 'O'),
 ('orçamentos', 'O'),
 ('apresentados', 'O'),
 ('pelo', 'O'),
 ('autor', 'O'),
 (',', 'O'),
 ('bem', 'O'),
 ('como', 'O'),
 ('da', 'O'),
 ('presumida', 'O'),
 ('eficácia', 'O'),
 ('do', 'O'),
 ('medicamento', 'O'),
 ('genérico', 'O'),
 ('com', 'O'),
 ('o', 'O'),
 ('mesmo', 'O'),
 ('princípio', 'O'),
 ('ativo', 'O'),
 ('(', 'O'),
 ('TEMOZOLOMIDA', 'B-MEDICAMENTO'),
 (')', 'O'),
 (',', 'O'),
 ('o', 'O'),
 ('fornecimento', 'O'),
 ('do', 'O'),
 ('medicamento', 'O'),
 ('genérico', 'O'),
 ('atende', 'O'),
 ('à', 'O'),
 ('determinação', 'O'),
 ('judicial', 'O'),
 ('.', 'O'),
 ('No', 'O'),
 ('caso', 'O'),
 ('de', 'O'),
 ('impedimento', 'O'),
 ('objetivo', 'O'),
 ('ao', 'O'),
 ('cumprimento', 'O'),
 ('do', 'O'),
 ('prazo', 'O'),
 ('de', 'O'),
 ('vinte', 'O'),
 

## Why Nots?

Considering the scenario described above, let's discuss why other methods, when considered alone, might not be the best fit for this task.

### Why Not Manual Annotation?

While manual annotation is a reliable method for creating labeled datasets, it has several limitations:

- **Time-Consuming**: Manual annotation is labor-intensive and time-consuming, especially for large datasets.
- **Costly**: Hiring annotators or using crowdsourcing platforms can be expensive.
- **Subjectivity**: Annotations may vary between annotators, leading to inconsistencies.
- **Scalability Issues**: Manual annotation is not easily expandable to large datasets or frequent updates.


### Why Not Use Regular Expressions?

While regular expressions (regex) are powerful for text processing and pattern matching, they have significant limitations for NER:

- **Lack of Generalization**: Regex rules are often too rigid to generalize well to unseen data. For example, a pattern designed to match names may not account for variations like middle names, hyphens, or cultural variations.
- **Overgeneralization**: Regex might incorrectly identify non-entity text that matches the pattern (e.g., mistaking "New York" for a person's name).
- **Contextual Blindness**: Regex cannot consider the surrounding context, which is vital for accurate entity recognition.
- **Maintenance Challenges**: As language evolves, maintaining an exhaustive set of regex patterns becomes increasingly complex and time-consuming.

### Why Not Use a Dictionary?

While dictionaries can be useful for NER in certain scenarios, they also have significant drawbacks:

- **Scalability Issues**: Maintaining an up-to-date dictionary for broad entity categories (like all person names) is practically impossible due to the vast and ever-changing nature of language.
- **Ambiguity Challenges**: Many words can be both entity and non-entity, making dictionary-based approaches prone to errors.
- **Incomplete Coverage**: A dictionary might miss newly coined terms or entities, leading to incomplete extraction.

> **Note**: Machine learning models can learn context and generalize from training data, making them more adaptable and accurate for NER tasks compared to regex or dictionary-based approaches.

### Why Not Use a Pre-Trained Model?

While pre-trained models offer a quick and effective way to perform NER, they may not always be suitable for specific use cases due to the following reasons:

- **Domain-Specific Knowledge**: Pre-trained models might not capture domain-specific entities or terminologies.
- **Fine-Grained Control**: Customizing pre-trained models for specific entity types or constraints can be challenging.
- **Data Privacy Concerns**: Using pre-trained models might expose sensitive data to third-party services, raising privacy concerns.

> **Note**: Training a custom NER model on domain-specific data can address these limitations and provide more tailored entity recognition capabilities.

### Why Not Use Zero-Shot Learning?

Zero-shot learning is a powerful technique that allows models to generalize to unseen classes. However, it has some limitations for NER tasks:

- **Limited Contextual Understanding**: Zero-shot learning may struggle with complex entity relationships and context-dependent entity recognition.
- **Fine-Grained Entity Recognition**: For tasks requiring fine-grained entity classification, zero-shot learning may not provide the necessary granularity.

> **Note**: Zero-shot learning can be a valuable tool for NER tasks, especially when dealing with novel entities or limited labeled data, but it may not always outperform supervised learning approaches in all scenarios.


While these methods have their advantages, they also come with limitations that can impact the accuracy, scalability, and adaptability of NER systems. Good thing is that we can use Weak Supervision to use the best of these methods to train a NER model. Weak supervision offers a solution to this problem by utilizing various sources of noisy or weak supervision to train models effectively.

## New Tool: [Skweak](https://github.com/NorskRegnesentral/skweak)

Skweak is a Python library designed for weak supervision tasks in Natural Language Processing (NLP). It supports the application of weak supervision to various NLP tasks, particularly sequence labeling (our case). While Snorkel is a popular tool for weak supervision, it does not directly support sequence labeling tasks like Named Entity Recognition (NER). We would need way too much workarounds to use Snorkel for NER. Skweak is specifically tailored for sequence labeling tasks, such as Named Entity Recognition (NER).

### Features of Skweak

- **Labeling Functions**:
    - These are custom rules or heuristics that generate noisy labels based on patterns, external sources, or other criteria.
    - Labeling functions are essential for reducing the manual effort in data annotation.

- **Label Aggregation**:
    - Skweak combines the outputs of multiple labeling functions into a single, aggregated label for each data point.
    - This process leverages models like HMM to ensure that the final labels are as accurate as possible despite the noise in individual labeling functions.

- **Support for Named Entity Recognition (NER)**:
    - Skweak is particularly useful for NER tasks, allowing developers to create labeling functions aimed at recognizing entities within text.
    - This feature is vital for tasks requiring the identification of names, dates, locations, and other entities in text data.

### Additional Resources

- For more details on Skweak, refer to the [GitHub repository](https://github.com/NorskRegnesentral/skweak).
- The tool is discussed in depth in the paper [skweak: Weak Supervision Made Easy for NLP](https://arxiv.org/abs/2104.09683).

> **Note**: Understanding the basic principles of weak supervision and generative models like HMM can significantly enhance the effective use of Skweak. I recommend exploring these concepts further to exploit Skweak optimally for NER tasks.

## Workflow with skweak for Named Entity Recognition (NER)

### 1. Data Preparation

- **Organize Text Data**: Structure your text data in a format suitable for processing. This often involves tokenizing the text and performing basic preprocessing. This also involves converting the text data into a format that can be used by skweak (spacy Doc objects).

- **Create Data Splits**:
    - **Labeled Data**: Prepare a small set of examples manually annotated with entities for training and validation.
    - **Unlabeled Data**: Gather a larger corpus without annotations to apply skweak's weak supervision techniques.
    - **Held-out Test Set**: Reserve a set of data with ground truth annotations for final evaluation.

### 2. Labeling Functions Creation

Develop a set of labeling functions that generate noisy labels based on various heuristics and rules:

- **Patterns**: Use regular expressions or pattern-matching techniques (e.g., capitalization, specific sequences of words).
- **Gazetteers**: Apply lists of known entities.
- **Context-based Rules**: Identify entities based on surrounding words or typical contexts.
- **External Knowledge**: Call APIs or use databases such as Wikipedia for additional context.
- **Pre-trained Models**: Incorporate outputs from existing models like spaCy or the Google Natural Language API.

> Aim for coverage and diversity to capture different aspects of entity recognition. Start with high-precision rules and gradually introduce more complex functions for better recall.

### 3. Weak Supervision

- **Apply Labeling Functions**: Use your functions to generate noisy labels for the unlabeled data.
- **Aggregate Labels**: Use skweak's generative models (e.g., Hidden Markov Model) to combine these noisy labels into a coherent set of annotations. This step resolves conflicts and leverages the collective insights of all labeling functions.

### 4. Model Training

- **Prepare Training Data**: Combine the small set of manually labeled data with the weakly supervised data.
- **Choose a Model**: Select an appropriate NER model, such as a Conditional Random Field (CRF) or a Transformer-based model like BERT.
- **Train the Model**: Use the aggregated labels as targets. Consider transfer learning to exploit pre-existing language models.

### 5. Evaluation

- **Performance Assessment**: Evaluate the trained model on the held-out test set with ground truth annotations.
- **Metrics**: Measure performance using precision, recall, and F1 score. More advanced metrics like span-based F1 or partial matching can also be useful.
- **Error Analysis**: Examine misclassified examples to identify common errors and areas for improvement.

### 6. Iterative Refinement
- **Refine Labeling Functions**: Improve existing functions or develop new ones based on the errors observed during evaluation.
- **Tune Aggregation Methods**: Experiment with different models or parameters to perfect label quality.
- **Model Improvements**: Apply architectural changes, hyperparameter tuning, or advanced techniques to enhance model performance.
- **Data Augmentation**: Create targeted functions or increase data for underrepresented entity types.

> Remember that weak supervision and iterative refinement are key to improving the NER model over time. Each cycle of refinements will help in enhancing the model's precision and robustness.


## Step 1 - Convert Text Data to Spacy Doc Objects

The first step in the skweak workflow is to convert your text data into Spacy Doc objects. Spacy is a popular NLP library that provides efficient tokenization, part-of-speech tagging, and named entity recognition capabilities. We first need to convert our text data into Spacy Doc objects so that we can use them with skweak.

In [None]:
# Import the spaCy library for natural language processing
# spaCy provides tools for tokenization, part-of-speech tagging, named entity recognition, and more
import spacy

# Import the skweak library for weak supervision
# skweak allows us to combine multiple weak supervision sources to create high-quality training data
import skweak

# Load the spaCy model for Portuguese
# 'pt_core_news_lg' is a large model with more accuracy and features
# You can also use 'pt_core_news_sm' for a smaller, faster model with fewer features
# To install the model, run: python -m spacy download pt_core_news_lg
nlp = spacy.load("pt_core_news_lg")

In [None]:
# Process the training dataset using the spaCy pipeline
# The nlp.pipe method processes the text in batches, which is more efficient than processing each text individually
# df_train['text'].values contains the text data from the training dataset
spacy_docs_train = list(nlp.pipe(df_unlabeled["text"].values))

# Process the validation dataset using the spaCy pipeline
# df_dev['text'].values contains the text data from the validation dataset
spacy_docs_dev = list(nlp.pipe(df_dev["text"].values))

# Process the test dataset using the spaCy pipeline
# df_test['text'].values contains the text data from the test dataset
spacy_docs_test = list(nlp.pipe(df_test["text"].values))

# Save the processed spaCy documents to disk
# This avoids the need to run the spaCy pipeline again, saving time in future runs
# skweak.utils.docbin_writer writes the spaCy documents to a binary file
# The first argument is the list of spaCy documents, and the second argument is the file path
skweak.utils.docbin_writer(spacy_docs_train, "data/bin/ner/spacy_docs_train.bin")
skweak.utils.docbin_writer(spacy_docs_dev, "data/bin/ner/spacy_docs_dev.bin")
skweak.utils.docbin_writer(spacy_docs_test, "data/bin/ner/spacy_docs_test.bin")

Write to data/bin/ner/spacy_docs_train.bin...done
Write to data/bin/ner/spacy_docs_dev.bin...done
Write to data/bin/ner/spacy_docs_test.bin...done


In [14]:
# Load the processed spaCy documents from disk
# This avoids the need to run the spaCy pipeline again, saving time in future runs
# skweak.utils.docbin_reader reads the spaCy documents from a binary file
# The first argument is the file path, and the second argument is the name of the spaCy model used for processing

# Load the training documents
spacy_docs_train = skweak.utils.docbin_reader(
    "data/bin/ner/spacy_docs_train.bin", spacy_model_name="pt_core_news_lg"
)

# Load the validation documents
spacy_docs_dev = skweak.utils.docbin_reader(
    "data/bin/ner/spacy_docs_dev.bin", spacy_model_name="pt_core_news_lg"
)

# Load the test documents
spacy_docs_test = skweak.utils.docbin_reader(
    "data/bin/ner/spacy_docs_test.bin", spacy_model_name="pt_core_news_lg"
)

# Convert the loaded documents to lists
# This step ensures that the documents are in a list format, which is easier to work with in subsequent steps
spacy_docs_train = list(spacy_docs_train)
spacy_docs_dev = list(spacy_docs_dev)
spacy_docs_test = list(spacy_docs_test)

In [15]:
spacy_docs_train[:2]

[Trata-se de ação proposta por CLEÔNIA LÚCIA DA SILVA, com pedido de tutela antecipada, contra a UNIÃO, ESTADO DO RIO GRANDE DO NORTE e MUNICÍPIO DO NATAL, requerendo o fornecimento de medicamentos não disponibilizados pelos entes réus, qual sejam LIMBITROL, DEPAKOTE 500mg, FLUOXITINA 20mg, MIGRANE e RIVOTRIL, indispensáveis ao tratamento da patologia TRANSTORNO MISTO ANSIOSO E DEPRESSIVO (CID F41.2), da qual é portadora. II – FUNDAMENTAÇÃO Inicialmente, torna-se interessante discorrer acerca da questão preliminar suscitada pela União. Quanto à legitimidade passiva, em que pese divergência jurisprudencial neste ponto, tenho que os entes públicos das três esferas federativas respondem solidariamente, já que há uma sucessão de atos para o efetivo cumprimento dos serviços públicos de saúde prestados à população, seja quanto à origem dos recursos, os repasses legais, até a sua distribuição, demonstrando uma cadeia complexa de atos que resulta na solidariedade e, em conseqüência, na legitim

### Step 2 - Create Labeling Functions

Labeling functions form the central of the skweak library's approach to weak supervision. These functions programmatically assign labels to text data, providing an efficient alternative to manual annotation. Each labeling function:

- Takes a spaCy Doc object as input
- Returns a list of spans with associated labels
- Operates at the token level for precise entity marking

For sequence labeling tasks (like Named Entity Recognition), these spans identify specific entities within the text. In text classification tasks (such as sentiment analysis), the span typically covers the entire text unit requiring classification, either a single sentence or the complete document.

There are several heuristics that can be used to create labelling functions with skweak. I recommend you to check the [documentation](https://github.com/NorskRegnesentral/skweak/wiki/Step-1:-Labelling-functions) for more details.

#### 2.1 Using a predefined list of drugs (gazetteer)

One common approach to creating labelling functions is to use a predefined list of entities, also known as a gazetteer. In our case, we can use a list of drug names to create a labelling function that identifies drug entities in the text.
We'll load the drug list from the ["Preço Máximo ao Consumidor"](https://www.gov.br/anvisa/pt-br/assuntos/medicamentos/cmed/precos) (Maximum Price to the Consumer) database, which contains information about the maximum prices of drugs in Brazil. This list contains the names of various drugs that are commonly prescribed and used in healthcare.

In [16]:
pmc = pd.read_excel("data/ner/pmc_20250307.xls", header=None)
pmc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,64,65,66,67,68,69,70,71,72,73
0,Secretaria Executiva - CMED,,,,,,,,,,...,,,,,,,,,,
1,LISTA DE PREÇOS DE MEDICAMENTOS - PREÇOS FÁBRI...,,,,,,,,,,...,,,,,,,,,,
2,"Publicada em 07/03/2025 às 15h00min, atualizad...",,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,Esta lista apresenta os preços dos medicamento...,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25745,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013107,1015500910166,7896226100272,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,14.10,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25746,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013207,1015500910174,7896226100265,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,16.22,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25747,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHE LABORATORIOS FARMACEUTICOS SA,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,142.63,Não,Não,Não,Não,,Negativa,Sim,Tarja Sem Tarja,
25748,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATORIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.60,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,


In [17]:
pmc.head(50)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,64,65,66,67,68,69,70,71,72,73
0,Secretaria Executiva - CMED,,,,,,,,,,...,,,,,,,,,,
1,LISTA DE PREÇOS DE MEDICAMENTOS - PREÇOS FÁBRI...,,,,,,,,,,...,,,,,,,,,,
2,"Publicada em 07/03/2025 às 15h00min, atualizad...",,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,Esta lista apresenta os preços dos medicamento...,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,A lista de Preços de Medicamentos contempla o ...,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,Nesta lista foi incluída a alíquota de ICMS 0%...,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


In [18]:
# drop the first 41 rows
pmc = pmc.drop(range(41))
pmc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,64,65,66,67,68,69,70,71,72,73
41,SUBSTÂNCIA,CNPJ,LABORATÓRIO,CÓDIGO GGREM,REGISTRO,EAN 1,EAN 2,EAN 3,PRODUTO,APRESENTAÇÃO,...,PMC 23 % ALC,RESTRIÇÃO HOSPITALAR,CAP,CONFAZ 87,ICMS 0%,ANÁLISE RECURSAL,LISTA DE CONCESSÃO DE CRÉDITO TRIBUTÁRIO (PIS/...,COMERCIALIZAÇÃO 2022,TARJA,DESTINAÇÃO COMERCIAL
42,21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL,18.459.628/0001-15,BAYER S.A.,538912020009303,1705600230032,7891106000956,-,-,BAYCUTEN N,"10 MG/G + 0,443 MG/G CREM DERM CT BG AL X 40 G",...,46.30,Não,Não,Não,Não,,Negativa,Sim,- (*),
43,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505107701157215,1018003900019,7896016806469,-,-,ORENCIA,250 MG PO LIOF SOL INJ CT 1 FA + SER DESCARTÁVEL,...,,Sim,Sim,Não,Não,,Positiva,Sim,Tarja Vermelha,
44,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505113100020505,1018003900078,7896016808197,-,-,ORENCIA,125 MG/ML SOL INJ SC CT 4 SER PREENC VD TRANS ...,...,11529.15,Não,Sim,Sim,Não,,Positiva,Sim,- (*),
45,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060021902,1126001990018,7896382708442,-,-,VERZENIOS,50 MG COM REV CT BL AL AL X 30,...,4874.97,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25745,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013107,1015500910166,7896226100272,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,14.10,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25746,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013207,1015500910174,7896226100265,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,16.22,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25747,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHE LABORATORIOS FARMACEUTICOS SA,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,142.63,Não,Não,Não,Não,,Negativa,Sim,Tarja Sem Tarja,
25748,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATORIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.60,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,


In [19]:
# Make the first row the header
pmc.columns = pmc.iloc[0]
pmc = pmc.drop(41)
pmc

41,SUBSTÂNCIA,CNPJ,LABORATÓRIO,CÓDIGO GGREM,REGISTRO,EAN 1,EAN 2,EAN 3,PRODUTO,APRESENTAÇÃO,...,PMC 23 % ALC,RESTRIÇÃO HOSPITALAR,CAP,CONFAZ 87,ICMS 0%,ANÁLISE RECURSAL,LISTA DE CONCESSÃO DE CRÉDITO TRIBUTÁRIO (PIS/COFINS),COMERCIALIZAÇÃO 2022,TARJA,DESTINAÇÃO COMERCIAL
42,21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL,18.459.628/0001-15,BAYER S.A.,538912020009303,1705600230032,7891106000956,-,-,BAYCUTEN N,"10 MG/G + 0,443 MG/G CREM DERM CT BG AL X 40 G",...,46.30,Não,Não,Não,Não,,Negativa,Sim,- (*),
43,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505107701157215,1018003900019,7896016806469,-,-,ORENCIA,250 MG PO LIOF SOL INJ CT 1 FA + SER DESCARTÁVEL,...,,Sim,Sim,Não,Não,,Positiva,Sim,Tarja Vermelha,
44,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505113100020505,1018003900078,7896016808197,-,-,ORENCIA,125 MG/ML SOL INJ SC CT 4 SER PREENC VD TRANS ...,...,11529.15,Não,Sim,Sim,Não,,Positiva,Sim,- (*),
45,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060021902,1126001990018,7896382708442,-,-,VERZENIOS,50 MG COM REV CT BL AL AL X 30,...,4874.97,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,
46,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060022102,1126001990034,7896382708466,-,-,VERZENIOS,100 MG COM REV CT BL AL AL X 30,...,9749.90,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25745,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013107,1015500910166,7896226100272,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,14.10,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25746,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524825010013207,1015500910174,7896226100265,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 4,...,16.22,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,
25747,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHE LABORATORIOS FARMACEUTICOS SA,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,142.63,Não,Não,Não,Não,,Negativa,Sim,Tarja Sem Tarja,
25748,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATORIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.60,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,


In [20]:
substance_names = pmc["SUBSTÂNCIA"].values
substance_names[:5]

array(['21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL', 'ABATACEPTE',
       'ABATACEPTE', 'ABEMACICLIBE', 'ABEMACICLIBE'], dtype=object)

In [21]:
brand_names = pmc["PRODUTO"].values
brand_names[:5]

array(['BAYCUTEN N', 'ORENCIA', 'ORENCIA', 'VERZENIOS', 'VERZENIOS'],
      dtype=object)

In [22]:
import re

# Initialize an empty list to store drug names
drugs = []

# Split pharmacological substances
for substance in substance_names:
    # Split the substance names by commas, semicolons, or plus signs
    split_substance = re.split(r"[,;\+]", substance)
    # Strip any leading or trailing whitespace from each split part
    split_substance = [s.strip() for s in split_substance]
    # Extend the drugs list with the split parts
    drugs.extend(split_substance)

# Extend the drugs list with brand names
drugs.extend(brand_names)

# Remove duplicates by converting the list to a set and back to a list
drugs = list(set(drugs))

In [23]:
drugs[:10]

['XYLOCAINA',
 'NEUTROFER',
 'POLISSACARÍDEO DE NEISSERIA MENINGITIDIS DO SOROGRUPO W CONJUGADO A PROTEÍNA CARREADORA TOXOIDE TETÂNICO',
 'FLAGYL NISTATINA',
 'MIOCALVEN D',
 'ESPIRONOLACTONA',
 'RENOVI B',
 'SENE HERBARIUM',
 'STREPTOCOCCUS PYOGENES',
 'RENALAPRIL']

In [24]:
"ACIDO GLUTAMICO" in drugs

False

In [25]:
from helpers.text import remove_accented_characters

# Remove accented characters from each drug name
# This is useful for standardizing the drug names for further processing
drugs.extend([remove_accented_characters(d) for d in drugs])

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each drug name appears only once in the list
drugs = list(set(drugs))

# Display the first 10 drug names
# This is useful for quickly checking the contents of the list
drugs[:10]

['XYLOCAINA',
 'NEUTROFER',
 'POLISSACARÍDEO DE NEISSERIA MENINGITIDIS DO SOROGRUPO W CONJUGADO A PROTEÍNA CARREADORA TOXOIDE TETÂNICO',
 'FLAGYL NISTATINA',
 'POLIOVIRUS TIPO 1 (MAHONEY)',
 'MIOCALVEN D',
 'ESPIRONOLACTONA',
 'ACIDO PARAMINOBENZOICO',
 'RENOVI B',
 'SENE HERBARIUM']

In [26]:
"ACIDO GLUTAMICO" in drugs

True

In [27]:
# Convert all drug names to lowercase
# This ensures that the comparison of drug names is case-insensitive
# The list comprehension iterates over each drug name in the 'drugs' list and converts it to lowercase
drugs = [d.lower() for d in drugs]

# Remove duplicate drug names
# The set() function removes duplicates by converting the list to a set, which only keeps unique elements
# The list() function converts the set back to a list
drugs = list(set(drugs))

# Display the first 10 drug names
# This allows us to inspect a sample of the processed drug names
# The slicing operation [:10] retrieves the first 10 elements from the 'drugs' list
drugs[:10]

['castanha da india globo',
 'cis',
 'abatacepte',
 'ritmonorm',
 'lfm-paracetamol',
 'senan',
 'lfm-aciclovir',
 'adapaleno',
 'concárdio',
 'soro anti-rabico']

In [28]:
def remove_hydrated_compounds(text: str) -> str:
    """
    Remove words related to hydrated compounds from the input text.

    Args:
        text (str): The input text containing chemical compound names.

    Returns:
        str: The text with hydrated compound names removed.
    """
    # Replace words like "monoidratada" with an empty string
    cleaned_text = re.sub(r" \b\S*idratad[oa]\b", "", text, flags=re.IGNORECASE)

    # Remove any extra spaces that may result from the substitution
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    return cleaned_text


print(
    remove_hydrated_compounds("ACIDO GLUTAMICO MONOIDRATADO")
)  # Expected: 'ACIDO GLUTAMICO'
print(
    remove_hydrated_compounds("cloridrato de lidocaína monoidratada")
)  # Expected: 'cloridrato de lidocaína'
print(remove_hydrated_compounds("Metilprednisolona"))  # Expected: 'Metilprednisolona'

ACIDO GLUTAMICO
cloridrato de lidocaína
Metilprednisolona


In [29]:
# Apply the remove_hydrated_compounds function to each drug name in the drugs list
drugs = [remove_hydrated_compounds(d) for d in drugs]

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each drug name appears only once in the list
drugs = list(set(drugs))

# Calculate the number of unique drug names
# This is useful for understanding the size of the dataset after processing
len(drugs)

8152

In [30]:
def extract_active_ingredient(full_name: str) -> str:
    """
    Extract the active ingredient name from the full medication name.

    Args:
        full_name (str): The full name of the medication.

    Returns:
        str: The name of the active ingredient.
    """
    # List of words that indicate the start of the active ingredient name
    indicators = ["de", "do", "da", "dos", "das"]

    # Split the full name into words and convert to lowercase
    words = full_name.lower().split()

    # Find the index of the first indicator word
    active_ingredient_index = None
    for i, word in enumerate(words):
        if word in indicators and i < len(words) - 1:
            active_ingredient_index = i
            break

    # If an indicator is found, return the words after it
    if active_ingredient_index is not None:
        return " ".join(words[active_ingredient_index + 1 :])
    else:
        # If no indicator is found, return the full name
        return full_name


# Examples of usage
print(
    extract_active_ingredient("micofenolato de pantoprazol")
)  # Should return "pantoprazol"
print(
    extract_active_ingredient("cloridrato de metformina")
)  # Should return "metformina"
print(extract_active_ingredient("paracetamol"))  # Should return "paracetamol"

pantoprazol
metformina
paracetamol


In [31]:
# Apply the extract_active_ingredient function to each drug name in the drugs list
drugs.extend([extract_active_ingredient(d) for d in drugs])

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each active ingredient appears only once in the list
drugs = list(set(drugs))

In [32]:
# Filter the drugs list to keep only the names with more than 4 characters
# This step helps to remove very short names that are likely not meaningful drug names
drugs = [d for d in drugs if len(d) > 4]

# Calculate the number of remaining drug names
# This gives an idea of how many drug names are left after filtering
len(drugs)

8722

In [33]:
drugs[:5]

['castanha da india globo',
 'abatacepte',
 'ritmonorm',
 'lfm-paracetamol',
 'senan']

In [34]:
# Import the json module to handle JSON data
import json

# Initialize an empty dictionary to store the drug names
json_drugs = {}

# Add the list of drug names to the dictionary under the key 'MEDICAMENTO'
json_drugs["MEDICAMENTO"] = drugs

# Save the dictionary as a JSON file
# The ensure_ascii=False parameter allows for non-ASCII characters to be saved correctly
# The indent=4 parameter makes the JSON file more readable by adding indentation
with open("data/ner/drugs_gazetteer.json", "w") as f:
    json.dump(json_drugs, f, ensure_ascii=False, indent=4)

# Load the JSON file into a gazetteer for weak supervision
# The extract_json_data function reads the JSON file and prepares it for use with skweak
tries_drugs = skweak.gazetteers.extract_json_data(
    "data/ner/drugs_gazetteer.json", spacy_model="pt_core_news_lg"
)

# Create a GazetteerAnnotator for labeling the data
# The GazetteerAnnotator uses the gazetteer to annotate text with drug names
# The case_sensitive=False parameter makes the annotation case-insensitive
lf_drugs_gazetteer = skweak.gazetteers.GazetteerAnnotator(
    "drugs_gazetteer", tries_drugs, case_sensitive=False
)

Extracting data from data/ner/drugs_gazetteer.json
Populating trie for class MEDICAMENTO (number: 8722)


In [None]:
# Define a sample text containing drug names for testing the function
text = "O paciente foi medicado com ácido glutâmico monoidratado , Cloridrato de lidocaína e Rivotril ."

# Process the text using the spaCy NLP pipeline
# This step tokenizes the text and applies linguistic annotations
doc = nlp(text)

# Apply the GazetteerAnnotator to the processed text
# This annotates the text with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the text
# This function highlights the recognized drug names in the text for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")


# You may get an error on this cell due to IPython recent changes in the display functionality.  You may need to edit skweak utils to fix it. Check the error message and edit the appropriate file inside your virtual environment

In [36]:
# Select the first document from the training set of spaCy documents
# This document will be used to demonstrate the annotation process
doc = spacy_docs_train[0]

# Apply the GazetteerAnnotator to the selected document
# This annotates the document with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the document
# This function highlights the recognized drug names in the document for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")

In [37]:
# Select the first document from the training set of spaCy documents
# This document will be used to demonstrate the annotation process
doc = spacy_docs_train[1]

# Apply the GazetteerAnnotator to the selected document
# This annotates the document with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the document
# This function highlights the recognized drug names in the document for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")

#### 2.2 Utilizing Pretrained Transformer Models for Labeling Functions

In addition to rule-based approaches, pretrained language models offer a powerful alternative for constructing labeling functions. These models, trained on vast amounts of text data, can be employed to identify entities within our target text.

This section focuses on using a pretrained Named Entity Recognition (NER) model from the Hugging Face Transformers library to build a labeling function specifically for identifying **drug entities**. We'll be applying the `pucpr/clinicalnerpt-chemical` model. This model is particularly well-suited for our purpose as it has been fine-tuned on a corpus of clinical text data, enabling it to effectively recognize chemical entities, including drug names.

**Why this model?**

- **Domain Specificity:** Fine-tuning on clinical text makes the model more accurate for our use case compared to a general-purpose NER model.
- **Direct Applicability:** The model's output directly aligns with our goal of identifying drug entities, simplifying the labeling function creation process.

This approach leverages the power of transfer learning, allowing us to benefit from the extensive training these models have undergone and apply their knowledge to our specific task. You can find more details about this model on its Hugging Face model card: [https://huggingface.co/pucpr/clinicalnerpt-chemical](https://huggingface.co/pucpr/clinicalnerpt-chemical).

In [38]:
# Import the pipeline function from the transformers library
# This function is used to create a named entity recognition (NER) pipeline
from transformers import pipeline

# Create an NER pipeline using a pre-trained model for clinical chemical entities
# The model 'pucpr/clinicalnerpt-chemical' is specifically trained for recognizing chemical entities in clinical texts
# The aggregation_strategy="first" parameter ensures that only the first sub-token of a word is used for entity recognition
# The device=-1 parameter indicates that the pipeline should run on the CPU (use 0 or a positive integer for GPU)
ner_pipeline = pipeline(
    "ner", model="pucpr/clinicalnerpt-chemical", aggregation_strategy="first", device=0
)

# Apply the NER pipeline to the text of the spaCy document
# This step performs named entity recognition on the text, identifying chemical entities
ner_results = ner_pipeline(doc.text)

# Display the NER results
# This will show the recognized chemical entities along with their positions and labels
print(ner_results)

2025-03-18 13:39:50.846694: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742315990.862813 1517168 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742315990.867863 1517168 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742315990.881065 1517168 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742315990.881074 1517168 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742315990.881076 1517168 computation_placer.cc:177] computation placer alr

[{'entity_group': 'ChemicalDrugs', 'score': 0.77123636, 'word': 'olanzapina', 'start': 45, 'end': 55}, {'entity_group': 'ChemicalDrugs', 'score': 0.79542863, 'word': 'olanzapina', 'start': 284, 'end': 294}, {'entity_group': 'ChemicalDrugs', 'score': 0.79000133, 'word': 'olanzapina', 'start': 632, 'end': 642}, {'entity_group': 'ChemicalDrugs', 'score': 0.78814346, 'word': 'olanzapina', 'start': 760, 'end': 770}, {'entity_group': 'ChemicalDrugs', 'score': 0.77936846, 'word': 'olanzapina', 'start': 906, 'end': 916}, {'entity_group': 'ChemicalDrugs', 'score': 0.7389247, 'word': 'litio', 'start': 965, 'end': 970}, {'entity_group': 'ChemicalDrugs', 'score': 0.93128145, 'word': 'valproato', 'start': 974, 'end': 983}, {'entity_group': 'ChemicalDrugs', 'score': 0.8313691, 'word': 'olanzapina', 'start': 1148, 'end': 1158}, {'entity_group': 'ChemicalDrugs', 'score': 0.95426273, 'word': 'cloridrato de bupropiona', 'start': 1338, 'end': 1362}]


Here's how the pipeline object works for any given text

In [39]:
from helpers.ner import render_entity_data_from_pipeline

# Define a sample text containing drug names for testing the NER pipeline
sample_text = "O paciente foi medicado com ácido glutâmico monoidratado, Cloridrato de lidocaína e Rivotril."

# Apply the NER pipeline to the sample text
# This step performs named entity recognition on the text, identifying chemical entities
ner_results = ner_pipeline(sample_text)

# Render the annotated entities in the sample text
# This function highlights the recognized chemical entities in the text for visualization
render_entity_data_from_pipeline(sample_text, ner_results)

In [40]:
df_lf_transfomer_1 = pd.DataFrame(
    [doc.text for doc in spacy_docs_train], columns=["text"]
)
df_lf_transfomer_1

Unnamed: 0,text
0,Trata-se de ação proposta por CLEÔNIA LÚCIA DA...
1,Informações sobre os medicamentos postulados O...
2,A negativa de fornecimento de um medicamento d...
3,"A parte autora requer, em caráter de urgência,..."
4,Já o anexo 07 menciona que: “Na ocasião de sua...
...,...
9995,"Com efeito, em que pese a CONITEC não ter reco..."
9996,I - Consoante o decidido pelo Plenário desta C...
9997,"Sendo assim, não obstante o Rituximabe (Mabthe..."
9998,"9. No caso dos autos, o autor é detentor de di..."


In [41]:
# This cell may take a few minutes to run, depending on your hardware. For me, it took about 3 minutes.

df_lf_transfomer_1["weak_label"] = ner_pipeline(df_lf_transfomer_1.text.values.tolist())
df_lf_transfomer_1

Unnamed: 0,text,weak_label
0,Trata-se de ação proposta por CLEÔNIA LÚCIA DA...,"[{'entity_group': 'ChemicalDrugs', 'score': 0...."
1,Informações sobre os medicamentos postulados O...,"[{'entity_group': 'ChemicalDrugs', 'score': 0...."
2,A negativa de fornecimento de um medicamento d...,"[{'entity_group': 'ChemicalDrugs', 'score': 0...."
3,"A parte autora requer, em caráter de urgência,...",[]
4,Já o anexo 07 menciona que: “Na ocasião de sua...,"[{'entity_group': 'ChemicalDrugs', 'score': 0...."
...,...,...
9995,"Com efeito, em que pese a CONITEC não ter reco...","[{'entity_group': 'ChemicalDrugs', 'score': 0...."
9996,I - Consoante o decidido pelo Plenário desta C...,"[{'entity_group': 'ChemicalDrugs', 'score': 0...."
9997,"Sendo assim, não obstante o Rituximabe (Mabthe...","[{'entity_group': 'ChemicalDrugs', 'score': 0...."
9998,"9. No caso dos autos, o autor é detentor de di...",[]


In [42]:
import torch
import gc

ner_pipeline = None
gc.collect()
torch.cuda.empty_cache()

In [43]:
def remap_label(label_key, label_mapping, label_list_of_dicts):
    """
    Remap a label key using a label mapping dictionary.

    Args:
        label_key (str): The label key to remap.
        label_mapping (dict): The label mapping dictionary.
        label_list_of_dicts (list[dict]): A list of dictionaries containing the labels.

    Returns:
        str: The remapped label key.
    """
    output = []
    for label_dict in label_list_of_dicts:
        if label_key in label_dict:
            label_dict[label_key] = label_mapping.get(label_dict[label_key])
        output.append(label_dict)
    return output

In [44]:
# Define a custom label mapping for the NER annotator
# This mapping translates the model's "ChemicalDrugs" label to "MEDICAMENTO"
custom_label_mapping_lf_transformer_1 = {
    "ChemicalDrugs": "MEDICAMENTO",
}

# Apply the remap_label function to the weak label
# This step remaps the label from "ChemicalDrugs" to "MEDICAMENTO"
df_lf_transfomer_1["weak_label"] = df_lf_transfomer_1["weak_label"].apply(
    lambda x: remap_label("entity_group", custom_label_mapping_lf_transformer_1, x)
)

In [None]:
# Let's remove Hydrated compounds and salts from the weak labels. This is a specificity of our dataset (I know because I created it), and it ignores the salt names in the medication. So, "Cloridrato de Propanolol" will be "Propanolol" and not "Cloridrato de Propanolol". The same for the hydrated compounds. So, "Cloridrato de Metformina Monoidratada" will be "Metformina" and not "Cloridrato de Metformina Monoidratada".
# You could adopt a different labeling strategy, but this is the one I adopted for this dataset. Since we will compare the results of the model trained with WS and the model trained with the full labeled dataset, its a good approach to keep the same labeling strategy.
# The salts are in the list below. The hydrated compounds are removed by the function remove_hydrated_compounds.

salt_list = [
    "21-acetato",
    "aceponato",
    "acetato",
    "acetilsalicilato",
    "adipato",
    "alendronato",
    "alfaoxofenilpropionato",
    "alginato",
    "aminossalicilato",
    "antimoniato",
    "arginato",
    "arsenito",
    "ascorbato",
    "aspartato",
    "axetil",
    "benzoato",
    "besilato",
    "betacipionato",
    "bicarbonato",
    "bissulfato",
    "bitartarato",
    "borato",
    "brometo",
    "bromidrato",
    "butilbrometo",
    "caproato",
    "carbonato",
    "carboxilato",
    "ciclossilicato",
    "cipionato",
    "citrato",
    "clatrato",
    "clavulanato",
    "clonixinato",
    "cloranfenicol",
    "cloreto",
    "cloridrato",
    "colistimetato",
    "cromacato",
    "cromato",
    "cromoglicato",
    "decanoato",
    "di-hidrato",
    "diaspartato",
    "diatrizoato",
    "dicloreto",
    "dicloridrato",
    "difosfato",
    "diidrato",
    "dimaleato",
    "dimesilato",
    "dinitrato",
    "dinitrobenzoato",
    "dipropionato",
    "ditosilato",
    "divalproato",
    "dobesilato",
    "docusato",
    "embonato",
    "enantato",
    "esilato",
    "estearato",
    "estolato",
    "etabonato",
    "etanolato",
    "etexilato",
    "etilsuccinato",
    "fempropionato",
    "fendizoato",
    "fenilpropionato",
    "ferededato",
    "ferrocianeto",
    "fluoreto",
    "folinato",
    "fosfatidilcolina",
    "fosfato",
    "fosfito",
    "fumarato",
    "furoato",
    "fusidato",
    "gadobenato",
    "gadopentetato",
    "glicerofosfato",
    "glicinato",
    "glicirrizato",
    "gliconato",
    "gluceptato",
    "gluconato",
    "glutamato",
    "hemi-hidrato",
    "hemifumarato",
    "hemisulfato",
    "hemitartarato",
    "hexafluoreto",
    "hialuronato",
    "hiclato",
    "hidrobrometo",
    "hidrocloreto",
    "hidrogenotartarato",
    "hidroxibenzoato",
    "hidroxinaftoato",
    "hipofosfito",
    "ibandronato",
    "iodeto",
    "isetionato",
    "isocaproato",
    "lactato",
    "lactobionato",
    "laurato",
    "laurilsulfato",
    "levolisinato",
    "levomalato",
    "levulinato",
    "lisetil",
    "lisina",
    "lisinato",
    "malato",
    "maleato",
    "mepesuccinato",
    "mesilato",
    "metilbrometo",
    "metilsulfato",
    "metotrexato",
    "micofenolato",
    "molibdato",
    "mono-hidrato",
    "monofosfato",
    "mononitrato",
    "mucato",
    "naftoato",
    "nicotinato",
    "nitrato",
    "nitrito",
    "nitroprusseto",
    "oleato",
    "orotato",
    "oxalato",
    "oxoglurato",
    "palmitato",
    "pamoato",
    "pantotenato",
    "pantotênico",
    "permanganato",
    "piconato",
    "picossulfato",
    "pidolato",
    "pivalato",
    "poliestirenossulfonato",
    "polissulfato",
    "propilenoglicolato",
    "propionato",
    "racealfa-hidroxigamametiltiobutanoato",
    "racealfaoxobetametilbutanoato",
    "racealfaoxobetametilpentanoato",
    "racealfaoxogamametilpentanoato",
    "resinato",
    "sacarato",
    "salicilato",
    "selenato",
    "selenito",
    "silicato",
    "subacetato",
    "subgalato",
    "succinato",
    "sulfato",
    "sulfeto",
    "sulfito",
    "sódico",
    "tanato",
    "tartarato",
    "teoclato",
    "tetra-hidrato",
    "tiocianato",
    "tosilato",
    "triclofenato",
    "trifenatato",
    "undecanoato",
    "undecilato",
    "undecilenato",
    "valerato",
    "valproato",
    "xinafoato",
    "zirconato",
    "zíncico",
]

import re


def process_medication_entities(entities, text):
    """
    Process medication entities by extracting active ingredients and removing hydrated compounds.
    Updates the 'word', 'start', and 'end' fields in each entity.

    Args:
        entities (list): List of entity dictionaries with 'start', 'end', and 'word' fields.
        text (str): The source text containing the entities.

    Returns:
        list: Updated list of entities with processed fields.
    """
    processed_entities = []

    for entity in entities:
        # Create a copy of the original entity
        processed_entity = entity.copy()

        text_key_name = "word" if "word" in entity else "text"

        # Get the original text from the source using start/end indices
        entity_text = text[entity["start"] : entity["end"]]

        if len(entity_text) < 3:
            # Skip entities with less than 3 characters
            continue
        if entity_text.lower() in salt_list:
            continue

        # Apply processing functions
        active_ingredient = extract_active_ingredient(entity_text)
        cleaned_text = remove_hydrated_compounds(active_ingredient)

        # Update the 'text_key_name' field with the processed text
        processed_entity[text_key_name] = cleaned_text

        # Find the new position of the processed text in the original text
        if cleaned_text.lower() in text.lower():
            # Try to find the position that's closest to the original start position
            original_start = entity["start"]
            text_lower = text.lower()
            cleaned_lower = cleaned_text.lower()

            # Find all occurrences of the processed text
            positions = []
            pos = text_lower.find(cleaned_lower)
            while pos != -1:
                positions.append(pos)
                pos = text_lower.find(cleaned_lower, pos + 1)

            # Choose the position closest to the original start
            if positions:
                closest_pos = min(positions, key=lambda x: abs(x - original_start))
                processed_entity["start"] = closest_pos
                processed_entity["end"] = closest_pos + len(cleaned_text)

        processed_entities.append(processed_entity)

    return processed_entities

In [82]:
df_lf_transfomer_1["weak_label"].iloc[1]

[{'entity_group': 'MEDICAMENTO',
  'score': 0.77123636,
  'word': 'Olanzapina',
  'start': 45,
  'end': 55},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.79542863,
  'word': 'olanzapina',
  'start': 284,
  'end': 294},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.79000133,
  'word': 'olanzapina',
  'start': 632,
  'end': 642},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.78814346,
  'word': 'olanzapina',
  'start': 760,
  'end': 770},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.77936846,
  'word': 'olanzapina',
  'start': 906,
  'end': 916},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.7389247,
  'word': 'lítio',
  'start': 965,
  'end': 970},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.93128145,
  'word': 'valproato',
  'start': 974,
  'end': 983},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.8313691,
  'word': 'olanzapina',
  'start': 1148,
  'end': 1158},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.95426273,
  'word': 'bupropiona',
  'start': 1352,
  'end': 1362}]

In [None]:
process_medication_entities(
    entities=df_lf_transfomer_1["weak_label"].iloc[1],
    text=df_lf_transfomer_1["text"].iloc[1],
)

[{'entity_group': 'MEDICAMENTO',
  'score': 0.77123636,
  'word': 'Olanzapina',
  'start': 45,
  'end': 55},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.79542863,
  'word': 'olanzapina',
  'start': 284,
  'end': 294},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.79000133,
  'word': 'olanzapina',
  'start': 632,
  'end': 642},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.78814346,
  'word': 'olanzapina',
  'start': 760,
  'end': 770},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.77936846,
  'word': 'olanzapina',
  'start': 906,
  'end': 916},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.7389247,
  'word': 'lítio',
  'start': 965,
  'end': 970},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.8313691,
  'word': 'olanzapina',
  'start': 1148,
  'end': 1158},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.95426273,
  'word': 'bupropiona',
  'start': 1352,
  'end': 1362}]

In [84]:
df_lf_transfomer_1["text"].iloc[1]

'Informações sobre os medicamentos postulados Olanzapina - está incluído na lista do RENAME 2022, e, portanto, é fornecido pelo SUS (http:/url_placeholder.com.br - integra o componente especializado da assistência farmacêutica (grupo 1A); - segundo a bula, este fármaco é indicado: “A olanzapina é indicado para o tratamento agudo e de manutenção da esquizofrenia e outras psicoses em adultos, nas quais sintomas positivos (exemplo: delírios, alucinações, alterações de pensamento, hostilidade e desconfiança) e/ou sintomas negativos (exemplo: afeto diminuído, isolamento emocional/social e pobreza de linguagem) são proeminentes. A olanzapina alivia também os sintomas afetivos secundários, comumente associados com esquizofrenia e transtornos relacionados. A olanzapina é eficaz na manutenção da melhora clínica durante o tratamento contínuo nos pacientes adultos que responderam ao tratamento inicial. A olanzapina é indicado, em monoterapia ou em combinação com lítio ou valproato, para o tratame

In [85]:
df_lf_transfomer_1["weak_label"] = df_lf_transfomer_1.apply(
    lambda x: process_medication_entities(x["weak_label"], x["text"]), axis=1
)

In [86]:
df_lf_transfomer_1["weak_label"].iloc[397]

[{'entity_group': 'MEDICAMENTO',
  'score': 0.80986035,
  'word': 'umeclidínio',
  'start': 11,
  'end': 22},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.96012425,
  'word': 'vilanterol',
  'start': 40,
  'end': 50},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.9751471,
  'word': 'tiotrópio',
  'start': 99,
  'end': 108},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.9619494,
  'word': 'olodaterol',
  'start': 138,
  'end': 148},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.9329622,
  'word': 'NATJUS',
  'start': 379,
  'end': 385},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.9592293,
  'word': 'agentes',
  'start': 1022,
  'end': 1029},
 {'entity_group': 'MEDICAMENTO',
  'score': 0.9727674,
  'word': 'açãoprolongada',
  'start': 1050,
  'end': 1064}]

Now we can transform it into a labelling function that can be used with Skweak.
I've created this `DataFrameAnnotator` class that extends the `skweak.annotator.Annotator` class. This class will be used to create a labelling function and will be used to annotate the text data based on a pandas dataframe with the annotated data. This way, we can use the same class to create labelling functions for any given text data, regardless of the source, as long as we have a pandas dataframe with the annotated data following the same format.

In [87]:
from helpers.ner import (
    DataFrameAnnotator,
)  # You can check the code for this class later to understand better what it does.

lf_transformer_1 = DataFrameAnnotator(
    annotator_name="lf_transformer_1",
    data_frame=df_lf_transfomer_1,  # The DataFrame containing the weak labels
    column_weak_label="weak_label",
    column_text="text",
    column_uid="uid",
    label_key_name="entity_group",
)

In [88]:
# Apply the TransformerNERAnnotator to the spaCy document
# This step annotates the document with named entities using the pre-trained model
lf_transformer_1(doc)

# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_transformer_1")

#### 2.3 Using Zero-Shot NER with GLiNER 


Named Entity Recognition (NER) systems traditionally focus on identifying a pre-defined set of entity types. However, this approach proves limiting when dealing with diverse and evolving entity types in real-world data. GLiNER, a novel NER model, overcomes this limitation by employing bidirectional transformer encoders, similar to the architecture of BERT, to detect **any** entity type. This capability distinguishes GLiNER from traditional NER systems and positions it as an cheaper alternative to large language models (LLMs), particularly in resource-constrained environments where deploying large LLMs can be impractical.

##### Evolution and Advancements in GLiNER Architecture

Early iterations of GLiNER relied on older encoder architectures such as BERT and DeBERTA. These versions, trained on relatively smaller datasets, lacked the benefits of modern optimization techniques like flash attention and were limited by a restricted context window of 512 tokens, hindering their performance and applicability to tasks requiring broader textual context.

To address these limitations, recent developments in GLiNER have focused on:

- **Advanced Encoder Architectures:** Shifting from older architectures to more advanced ones that capitalize on the LLM2Vec method. This method transforms the initial decoder model into a bidirectional encoder, leading to enhanced performance.

- **Extensive Pre-training:** Pre-training the model on a massive scale using the Wikipedia corpus and masked token prediction tasks. This results in several advantages, including:

- **Assimilation of Flash Attention:** Enabling faster training and inference processes.

- **Expanded Context Window:** Extending the context window up to 32k tokens, allowing the model to capture longer-range dependencies within the text, which is vital for understanding complex relationships and improving accuracy in tasks requiring broader textual context.

- **Improved Generalization:** The model's ability to generalize and perform well on unseen data is significantly enhanced.

##### Key Advantages of GLiNER 

The latest GLiNER model offers substantial improvements over its predecessors, including:

- **Enhanced Performance and Generalization:** Exhibiting superior performance and better generalization capabilities due to architectural improvements and extensive pre-training.

- **Flash Attention Support:** Integrating flash attention for faster training and inference, making it more efficient for real-world applications.

- **Extended Context Window:** Expanding the context window to accommodate up to 32k tokens, allowing for a more thorough understanding of textual relationships and improved accuracy in tasks requiring a wider range of textual information.

For a more in-depth understanding of the GLiNER model and its evolution, refer to the research paper available [here](https://arxiv.org/pdf/2406.12925).

In [89]:
# Import the GLiNER class from the gliner library
# GLiNER is used for named entity recognition (NER) tasks
from gliner import GLiNER

# Import the helpers.ner module
# This module contains helper functions for NER tasks
import helpers.ner

# Load a pre-trained GLiNER model
# We'll use 3 different GLiNER flavours as Labeling Functions
# The from_pretrained method loads the model weights and configuration
model_gliner_llama = GLiNER.from_pretrained(
    "knowledgator/gliner-llama-1.3B-v1.0",
)

model_gliner_bi_large = GLiNER.from_pretrained(
    "knowledgator/gliner-bi-large-v1.0",
)

model_gliner_qwen = GLiNER.from_pretrained(
    "knowledgator/gliner-qwen-1.5B-v1.0",
)

model_gliner_llama = model_gliner_llama.half()
model_gliner_bi_large = model_gliner_bi_large.half()
model_gliner_qwen = model_gliner_qwen.half()

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
                            You should to consider manually add new tokens to tokenizer or to load tokenizer with added tokens.


Here's how the gliner model works for any given text

In [90]:
# Extract the text content from the spaCy document
# This text will be used as input for the GLiNER model
text = doc.text

# Define the list of labels to be recognized by the GLiNER model
# In this case, we are interested in recognizing entities labeled as 'medicamento'
labels = ["medicamento"]

In [91]:
# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_llama.predict_entities(text, labels, threshold=0.5)

# Display the prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text using the helpers.ner module
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name="label")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'start': 284, 'end': 294, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.54296875}]


In [92]:
# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_bi_large.predict_entities(text, labels, threshold=0.5)

# Display the prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text using the helpers.ner module
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name="label")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'start': 45, 'end': 55, 'text': 'Olanzapina', 'label': 'medicamento', 'score': 0.60498046875}, {'start': 284, 'end': 294, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.6435546875}, {'start': 632, 'end': 642, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.66455078125}, {'start': 760, 'end': 770, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.6484375}, {'start': 906, 'end': 916, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.63037109375}, {'start': 1148, 'end': 1158, 'text': 'olanzapina', 'label': 'medicamento', 'score': 0.6376953125}, {'start': 1338, 'end': 1362, 'text': 'Cloridrato de bupropiona', 'label': 'medicamento', 'score': 0.52294921875}]


In [93]:
# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_qwen.predict_entities(text, labels, threshold=0.5)

# Display the prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text using the helpers.ner module
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name="label")

[{'start': 45, 'end': 55, 'text': 'Olanzapina', 'label': 'medicamento', 'score': 0.82421875}, {'start': 282, 'end': 294, 'text': 'A olanzapina', 'label': 'medicamento', 'score': 0.8251953125}, {'start': 630, 'end': 642, 'text': 'A olanzapina', 'label': 'medicamento', 'score': 0.76806640625}, {'start': 758, 'end': 770, 'text': 'A olanzapina', 'label': 'medicamento', 'score': 0.92626953125}, {'start': 904, 'end': 916, 'text': 'A olanzapina', 'label': 'medicamento', 'score': 0.939453125}, {'start': 974, 'end': 983, 'text': 'valproato', 'label': 'medicamento', 'score': 0.6787109375}, {'start': 1146, 'end': 1158, 'text': 'A olanzapina', 'label': 'medicamento', 'score': 0.94482421875}, {'start': 1338, 'end': 1362, 'text': 'Cloridrato de bupropiona', 'label': 'medicamento', 'score': 0.939453125}]


In [94]:
def create_batches(data, batch_size):
    """
    Create batches of data with a specified batch size.

    Args:
        data (list): The input data to be batched.
        batch_size (int): The size of each batch.

    Returns:
        list: A list of batches containing the input data.
    """
    # Initialize an empty list to store the batches
    batches = []

    # Iterate over the data in steps of the batch size
    for i in range(0, len(data), batch_size):
        # Extract a batch of data with the specified size
        batch = data[i : i + batch_size]
        # Append the batch to the list of batches
        batches.append(batch)

    return batches

In [95]:
from tqdm import tqdm
import torch
import gc


def process_labels_gliner(model, spacy_docs, batch_size):
    """
    Process weak labels using a GLiNER model.

    Args:
        model: The GLiNER model for predicting entities.
        spacy_docs: A list of spaCy documents to process.
        batch_size: The size of each batch for processing.

    Returns:
        list: A list of weak labels predicted by the model.
    """
    # Send the model to the GPU for faster processing
    model = model.cuda()

    # Initialize an empty list to store the weak labels
    weak_labels = []

    # Create a DataFrame to store the text data
    df = pd.DataFrame([doc.text for doc in spacy_docs], columns=["text"])

    # Create batches of spaCy documents for processing
    batches = create_batches(spacy_docs, batch_size)

    # Iterate over the batches of data
    for batch in tqdm(batches):
        # Extract the text content from the spaCy documents in the batch
        texts = [doc.text for doc in batch]
        # Use the GLiNER model to predict entities in the batch of text
        # The predict_entities method takes the text, labels, and a confidence threshold as input
        # The threshold parameter specifies the minimum confidence score for an entity to be considered valid
        result = model.batch_predict_entities(texts, labels, threshold=0.5)
        # Append the prediction results to the list of weak labels
        weak_labels.extend(result)

    # Assign the weak labels to the 'weak_label' column in the DataFrame
    df["weak_label"] = weak_labels

    # Send the model back to the CPU to free up GPU memory
    model = model.cpu()
    model = None
    gc.collect()
    torch.cuda.empty_cache()

    return df

In [96]:
df_lf_gliner_llama = process_labels_gliner(model_gliner_llama, spacy_docs_train, 24)
df_lf_gliner_llama.iloc[1]["weak_label"]

100%|██████████| 417/417 [07:45<00:00,  1.12s/it]


[{'start': 284,
  'end': 294,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.5439453125}]

In [97]:
df_lf_gliner_qwen = process_labels_gliner(model_gliner_qwen, spacy_docs_train, 4)
df_lf_gliner_qwen.iloc[1]["weak_label"]

100%|██████████| 2500/2500 [08:33<00:00,  4.87it/s]


[{'start': 45,
  'end': 55,
  'text': 'Olanzapina',
  'label': 'medicamento',
  'score': 0.82421875},
 {'start': 282,
  'end': 294,
  'text': 'A olanzapina',
  'label': 'medicamento',
  'score': 0.82373046875},
 {'start': 630,
  'end': 642,
  'text': 'A olanzapina',
  'label': 'medicamento',
  'score': 0.76513671875},
 {'start': 758,
  'end': 770,
  'text': 'A olanzapina',
  'label': 'medicamento',
  'score': 0.92626953125},
 {'start': 904,
  'end': 916,
  'text': 'A olanzapina',
  'label': 'medicamento',
  'score': 0.93896484375},
 {'start': 974,
  'end': 983,
  'text': 'valproato',
  'label': 'medicamento',
  'score': 0.67822265625},
 {'start': 1146,
  'end': 1158,
  'text': 'A olanzapina',
  'label': 'medicamento',
  'score': 0.9443359375},
 {'start': 1338,
  'end': 1362,
  'text': 'Cloridrato de bupropiona',
  'label': 'medicamento',
  'score': 0.939453125}]

In [98]:
df_lf_gliner_bi_large = process_labels_gliner(
    model_gliner_bi_large, spacy_docs_train, 20
)
df_lf_gliner_bi_large.iloc[1]["weak_label"]

100%|██████████| 500/500 [04:25<00:00,  1.88it/s]


[{'start': 45,
  'end': 55,
  'text': 'Olanzapina',
  'label': 'medicamento',
  'score': 0.60498046875},
 {'start': 284,
  'end': 294,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.6435546875},
 {'start': 632,
  'end': 642,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.66455078125},
 {'start': 760,
  'end': 770,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.6484375},
 {'start': 906,
  'end': 916,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.630859375},
 {'start': 1148,
  'end': 1158,
  'text': 'olanzapina',
  'label': 'medicamento',
  'score': 0.6376953125},
 {'start': 1338,
  'end': 1362,
  'text': 'Cloridrato de bupropiona',
  'label': 'medicamento',
  'score': 0.52294921875}]

In [99]:
df_lf_gliner_bi_large

Unnamed: 0,text,weak_label
0,Trata-se de ação proposta por CLEÔNIA LÚCIA DA...,"[{'start': 247, 'end': 256, 'text': 'LIMBITROL..."
1,Informações sobre os medicamentos postulados O...,"[{'start': 45, 'end': 55, 'text': 'Olanzapina'..."
2,A negativa de fornecimento de um medicamento d...,"[{'start': 1207, 'end': 1215, 'text': 'Mabther..."
3,"A parte autora requer, em caráter de urgência,...",[]
4,Já o anexo 07 menciona que: “Na ocasião de sua...,"[{'start': 200, 'end': 213, 'text': 'Oxibutina..."
...,...,...
9995,"Com efeito, em que pese a CONITEC não ter reco...","[{'start': 118, 'end': 126, 'text': 'glargina'..."
9996,I - Consoante o decidido pelo Plenário desta C...,"[{'start': 1360, 'end': 1370, 'text': 'Carvedi..."
9997,"Sendo assim, não obstante o Rituximabe (Mabthe...",[]
9998,"9. No caso dos autos, o autor é detentor de di...",[]


In [None]:
df_lf_gliner_llama["weak_label"] = df_lf_gliner_llama.apply(
    lambda x: process_medication_entities(x["weak_label"], x["text"]), axis=1
)

df_lf_gliner_qwen["weak_label"] = df_lf_gliner_qwen.apply(
    lambda x: process_medication_entities(x["weak_label"], x["text"]), axis=1
)

df_lf_gliner_bi_large["weak_label"] = df_lf_gliner_bi_large.apply(
    lambda x: process_medication_entities(x["weak_label"], x["text"]), axis=1
)

Now we can transform them into a labelling function that can be used with Skweak.

In [101]:
lf_gliner_llama = DataFrameAnnotator(
    annotator_name="lf_gliner_llama",
    data_frame=df_lf_gliner_llama,
    column_weak_label="weak_label",
    column_text="text",
    column_uid="uid",
    label_key_name="label",
)

In [102]:
lf_gliner_qwen = DataFrameAnnotator(
    annotator_name="lf_gliner_qwen",
    data_frame=df_lf_gliner_qwen,
    column_weak_label="weak_label",
    column_text="text",
    column_uid="uid",
    label_key_name="label",
)

In [103]:
lf_gliner_bi_large = DataFrameAnnotator(
    annotator_name="lf_gliner_bi_large",
    data_frame=df_lf_gliner_bi_large,
    column_weak_label="weak_label",
    column_text="text",
    column_uid="uid",
    label_key_name="label",
)

In [104]:
# Apply the GLiNER model with the "llama" configuration to the spaCy document
lf_gliner_llama(doc)

# Apply the GLiNER model with the "qwen" configuration to the spaCy document
lf_gliner_qwen(doc)

# Apply the GLiNER model with the "bi_large" configuration to the spaCy document
lf_gliner_bi_large(doc)

Informações sobre os medicamentos postulados Olanzapina - está incluído na lista do RENAME 2022, e, portanto, é fornecido pelo SUS (http:/url_placeholder.com.br - integra o componente especializado da assistência farmacêutica (grupo 1A); - segundo a bula, este fármaco é indicado: “A olanzapina é indicado para o tratamento agudo e de manutenção da esquizofrenia e outras psicoses em adultos, nas quais sintomas positivos (exemplo: delírios, alucinações, alterações de pensamento, hostilidade e desconfiança) e/ou sintomas negativos (exemplo: afeto diminuído, isolamento emocional/social e pobreza de linguagem) são proeminentes. A olanzapina alivia também os sintomas afetivos secundários, comumente associados com esquizofrenia e transtornos relacionados. A olanzapina é eficaz na manutenção da melhora clínica durante o tratamento contínuo nos pacientes adultos que responderam ao tratamento inicial. A olanzapina é indicado, em monoterapia ou em combinação com lítio ou valproato, para o tratamen

In [105]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_llama")

In [106]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_qwen")

In [107]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_bi_large")

#### 2.4 Using Zero-Shot NER with LLMs and Function Calling Capabilities

Instruction-tuned large language models (LLMs) have revolutionized natural language processing tasks, including Named Entity Recognition (NER). These powerful models, such as GPT-4, can perform zero-shot learning, enabling them to tackle tasks without explicit training on labeled data. By using the function calling capabilities of LLMs, we can create custom functions that extract entities from text, effectively performing NER without extensive training or fine-tuning.


[LangChain](https://www.langchain.com/) serves as a important intermediary, simplifying the process of interacting with LLMs for various language-related tasks, including NER. This platform offers several key features that make it an ideal choice for integrating LLMs into NER workflows:

1. **Smooth LLM Combination**: LangChain supports various popular LLMs, including OpenAI's GPT models and Google's BERT, ensuring compatibility with latest language models.

2. **Natural API and Documentation**: The platform provides a well-documented, user-friendly API, complete with code examples and tutorials, assisting easy incorporation of LLMs into applications.

3. **Flexible Input and Output Handling**: LangChain supports various input formats and offers customizable output handling, allowing for versatile processing of diverse content types.

4. **Task-Specific Modules**: Pre-configured modules optimized for common language tasks, including NER, simplify the process of achieving high-quality results.



**Applying NER with LangChain and LLMs**
To perform NER using LangChain and LLMs, follow these general steps:

1. **Setup and Installation**: Install LangChain and configure necessary API keys for the chosen LLM.

2. **Model Initialization**: Import required LangChain modules and initialize the LLM instances with desired configurations.

3. **Input Preparation**: Preprocess and format the input text to ensure compatibility with the chosen LLM.

4. **Entity Extraction**: Use LangChain's API to pass the prepared input to the LLM and generate output containing extracted entities.

5. **Post-processing**: Process the generated output to extract relevant entity information and integrate it into your application's workflow.


We'll use OpenAI GPT-4o-mini to perform zero-shot NER on legal documents, extracting entities with high accuracy and minimal manual intervention. This approach showcases the power of LLMs in automating complex NER tasks and streamlining information extraction processes.

> Note: You'll need to an OpenAI API key to access GPT-4o-mini through LangChain.

In [108]:
from pydantic import BaseModel, Field
from typing import Optional, List
from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
import os

# Load environment variables from .env file
# This is useful for keeping sensitive information like API keys out of your code
load_dotenv(
    override=True
)  # You are expected to have a .env file with the OpenAI API KEY `OPENAI_API_KEY`

# Retrieve the OpenAI API key from environment variables
# We're only displaying the first 5 characters for security reasons
# This is a good practice to verify the key is loaded without exposing it entirely
api_key_preview = os.getenv("OPENAI_API_KEY")[:5]
print(f"First 5 characters of API key: {api_key_preview}")

First 5 characters of API key: sk-pr


In [None]:
# Define a Pydantic model for Medicamento
# This model represents information about a medication
# Note that I'm using Portuguese text to define this class, as it will be passed to the model and I want it to work with Portuguese texts. This kinda "prime" the model to work with Portuguese texts.


class Medicamento(BaseModel):
    """
    Representa informações sobre um medicamento mencionado.
    Não inclui suplementos ou fórmulas alimentares/nutricionais.
    """

    nome_comercial: str  = Field(
        default=None, description="O nome comercial do medicamento."
    )
    principio_ativo: str  = Field(
        default=None,
        description="O princípio ativo farmacologicamente ativo do medicamento, sem o nome do sal. NÃO INCLUIR O NOME DO SAL (ex.: citrato de sildenafila -> sildenafila).",
    )
    dosagem: str  = Field(
        default=None,
        description="A dosagem do medicamento, especificando a quantidade e a unidade de medida.",
    )


# Define a Pydantic model for a list of Medicamentos
# This model represents a list of medications
class ListaMedicamentos(BaseModel):
    """
    Lista de medicamentos.
    """

    medicamentos: Optional[List[Medicamento]] = Field(
        default_factory=list, description="Lista de medicamentos."
    )

In [None]:
# Create a chat prompt template for the model
# The system message primes the model to extract named entities related to medications
# The human message is a placeholder for the text input

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Você é um especialista em extração de entidades nomeadas com precisão excepcional. "
            "Sua tarefa é identificar e extrair informações específicas do texto fornecido, seguindo estas diretrizes:"
            "\n\n1. Extraia as informações exatamente como aparecem no texto, sem interpretações ou alterações."
            "\n2. Se uma informação solicitada não estiver presente ou for ambígua, retorne null para esse campo."
            "\n3. Mantenha-se estritamente dentro do escopo das entidades e atributos definidos no esquema fornecido."
            "\n4. Preste atenção especial para manter a mesma ortografia, pontuação e formatação das informações extraídas."
            "\n5. Não infira ou adicione informações que não estejam explicitamente presentes no texto."
            "\n6. Se houver múltiplas menções da mesma entidade, extraia todas as ocorrências relevantes."
            "\n7. Ignore informações irrelevantes ou fora do contexto das entidades solicitadas."
            "\n\nLembre-se: sua precisão e aderência ao texto original são cruciais para o sucesso desta tarefa.",
        ),
        ("human", "{text}"),
    ]
)

In [111]:
# Initialize the ChatOpenAI model
# - model: Specify the version of the model to use
# - temperature: Controls the randomness of the output (0.0 to 1.0)

model_openai = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# Invoke the model with a test message to ensure it's working
model_openai.invoke("Oi!")

AIMessage(content='Oi! Como posso ajudar você hoje?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 9, 'total_tokens': 18, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_06737a9306', 'finish_reason': 'stop', 'logprobs': None}, id='run-7bd33188-c4a8-4127-a418-7e93ac0e5c14-0', usage_metadata={'input_tokens': 9, 'output_tokens': 9, 'total_tokens': 18, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [112]:
# Create an extractor by combining the prompt template with the OpenAI model
# The extractor is configured to output structured data in the form of ListaMedicamentos
# include_raw=True ensures that the raw response from the model is included in the output
# with_retry is used to handle retries in case of failures, with a maximum of 3 attempts
# wait_exponential_jitter=True adds randomness to the wait time between retries to avoid collision
model_openai = prompt | model_openai.with_structured_output(
    ListaMedicamentos, include_raw=False
).with_retry(
    stop_after_attempt=10,  # Retry up to 10 times in case of failure
    wait_exponential_jitter=True,  # Add randomness to the wait time between retries
)

Here's how the Zero-Shot model works for any given text

In [113]:
result = model_openai.invoke(spacy_docs_train[1].text)
print(result)

medicamentos=[Medicamento(nome_comercial='Olanzapina', principio_ativo='olanzapina', dosagem=None), Medicamento(nome_comercial='Cloridrato de bupropiona', principio_ativo='bupropiona', dosagem=None)]


Now we can transform it into a labelling function that can be used with Skweak.

In [114]:
# Let's process in batches of 100

# Divide the inputs in batches of size 100
batch_size = 100
texts = [doc.text for doc in spacy_docs_train]
text_batches = [texts[i : i + 100] for i in range(0, len(texts), 100)]

# run the model
results = []
for batch in tqdm(text_batches):
    try:
        out = model_openai.batch(batch, config={"max_concurrency": 100})
        results.extend(out)
    except Exception as e:
        print(e)
        results.extend([None] * len(batch))

100%|██████████| 100/100 [14:28<00:00,  8.68s/it]


In [115]:
from typing import Any


def convert_pydantic_to_gliner_format(
    text: str, lista_medicamentos: BaseModel
) -> List[dict[str, Any]]:
    """
    Convert Pydantic model output to GLiNER format.

    Args:
        text (str): Original text.
        lista_medicamentos (BaseModel): Pydantic model instance with medication information.

    Returns:
        List[Dict[str, Any]]: List of entity dictionaries in GLiNER format.
    """
    result = []
    if lista_medicamentos is None:
        return []

    if lista_medicamentos.medicamentos is None:
        lista_medicamentos.medicamentos = []

    # Iterate over each medication in the Pydantic model
    for medicamento in lista_medicamentos.medicamentos:
        if medicamento.nome_comercial:
            # Find all matches of the medication name in the text
            for match in re.finditer(
                re.escape(medicamento.nome_comercial), text, re.IGNORECASE
            ):
                result.append(
                    {
                        "start": match.start(),
                        "end": match.end(),
                        "text": match.group(),
                        "label": "medicamento",
                        "score": 1.0,
                    }
                )

        # If the medication has an active ingredient, find all matches in the text
        if medicamento.principio_ativo:
            for match in re.finditer(
                re.escape(medicamento.principio_ativo), text, re.IGNORECASE
            ):
                result.append(
                    {
                        "start": match.start(),
                        "end": match.end(),
                        "text": match.group(),
                        "label": "medicamento",
                        "score": 1.0,
                    }
                )

    # Sort the results by the start index of the matches
    result.sort(key=lambda x: x["start"])
    return result

In [116]:
results[0]

ListaMedicamentos(medicamentos=[Medicamento(nome_comercial='LIMBITROL', principio_ativo=None, dosagem=None), Medicamento(nome_comercial='DEPAKOTE', principio_ativo=None, dosagem='500mg'), Medicamento(nome_comercial='FLUOXITINA', principio_ativo=None, dosagem='20mg'), Medicamento(nome_comercial='MIGRANE', principio_ativo=None, dosagem=None), Medicamento(nome_comercial='RIVOTRIL', principio_ativo=None, dosagem=None)])

In [117]:
df_lf_openai = pd.DataFrame([doc.text for doc in spacy_docs_train], columns=["text"])
df_lf_openai["weak_label"] = results
df_lf_openai["weak_label"] = df_lf_openai.apply(
    lambda x: convert_pydantic_to_gliner_format(x.text, x.weak_label), axis=1
)

df_lf_openai

Unnamed: 0,text,weak_label
0,Trata-se de ação proposta por CLEÔNIA LÚCIA DA...,"[{'start': 247, 'end': 256, 'text': 'LIMBITROL..."
1,Informações sobre os medicamentos postulados O...,"[{'start': 45, 'end': 55, 'text': 'Olanzapina'..."
2,A negativa de fornecimento de um medicamento d...,"[{'start': 1195, 'end': 1205, 'text': 'RITUXIM..."
3,"A parte autora requer, em caráter de urgência,...",[]
4,Já o anexo 07 menciona que: “Na ocasião de sua...,"[{'start': 200, 'end': 209, 'text': 'Oxibutina..."
...,...,...
9995,"Com efeito, em que pese a CONITEC não ter reco...","[{'start': 79, 'end': 87, 'text': 'insulina', ..."
9996,I - Consoante o decidido pelo Plenário desta C...,"[{'start': 1331, 'end': 1350, 'text': 'Bissulf..."
9997,"Sendo assim, não obstante o Rituximabe (Mabthe...","[{'start': 28, 'end': 38, 'text': 'Rituximabe'..."
9998,"9. No caso dos autos, o autor é detentor de di...","[{'start': 149, 'end': 164, 'text': 'Freestyle..."


In [None]:
df_lf_openai["weak_label"] = df_lf_openai.apply(
    lambda x: process_medication_entities(x["weak_label"], x["text"]), axis=1
)

In [119]:
df_lf_openai.iloc[0]

text          Trata-se de ação proposta por CLEÔNIA LÚCIA DA...
weak_label    [{'start': 247, 'end': 256, 'text': 'LIMBITROL...
Name: 0, dtype: object

In [120]:
lf_openai = DataFrameAnnotator(
    annotator_name="lf_openai",
    data_frame=df_lf_openai,
    column_weak_label="weak_label",
    column_text="text",
    column_uid="uid",
    label_key_name="label",
)
lf_openai(doc)
skweak.utils.display_entities(doc, "lf_openai")

#### 2.5 Using Regex Patterns for Labeling Functions

Regular expressions (regex) are used for matching specific patterns within text data. They define patterns that correspond to target entities, so we can design labeling functions that identify these entities automatically. This approach is effective in tasks such as Named Entity Recognition, especially when the entities follow a predictable format.



##### Application to Pharmacological Data

When processing pharmacological data, such as drug names written in Portuguese, regex can be very practical. Drug names often have a structure that includes a salt and substance name. For example, in the name "Cloridrato de Propranolol":

- **Salt:** "Cloridrato"
- **Substance name:** "Propranolol"

Identifying such structures allows the development of regex patterns that accurately capture the parts of the drug name.



##### Generic Structure

Typically, the pattern to match a drug name in Portuguese follows a format like:

- **Format:**  `salt + " de " + substance name`

For instance, "Cloridrato de Propranolol" fits this structure. A regex pattern aiming to match such names needs to consider the following:

- Recognize the salt (e.g., "Cloridrato", "Sulfato").
- Detect the specific connector "de".
- Capture the subsequent substance name.



##### Framework for Designing Regex Patterns

When crafting regex patterns for labeling functions, consider the following steps:

1. **Begin with a Broad Pattern and Refine:**  
 Start with a general pattern to capture the overall structure, then adjust the regex to exclude unwanted matches and increase its precision.

2. **Consider Variations:**  
 Drug names may vary in spelling or include abbreviations. It is important to account for these variations by sometimes allowing optional spaces or alternate forms in the regex.

3. **Utilize Capturing Groups:**  
 Capturing groups isolate parts of the matched text for further processing. For example, the pattern can define one capturing group for the salt and another for the substance name. 

4. **Apply Word Boundaries:**  
 Use word boundary symbols (typically denoted as `\b`) in regex to ensure that partial words are not mistakenly matched. This confines the match to whole words, increasing accuracy.



##### Considerations, Challenges, and Testing

When applying regex-based labeling functions, several practical aspects should be kept in mind:

- **Coverage:**  
 Ensure that the regex pattern accommodates broad range of common salts and their possible variants.

- **Handling Edge Cases:**  
 Acknowledge that some drug names might deviate from the typical format or include additional descriptors. The pattern should be flexible enough to either handle these cases or flag them for manual review.

- **False Positives and False Negatives:**  
 Make sure to test the regex pattern against a diverse set of examples to minimize false positives (incorrectly identifying non-drug names) and false negatives (missing actual drug names).

- **Iterative Refinement and Validation:**  
 Develop and test the regex patterns in stages. Refine them based on feedback from annotated datasets to improve accuracy. Solid testing helps ensure that the regex performs reliably in real applications.

> **Note:** While regex is a reliable method for automating NER tasks, maintaining a balance is necessary. The goal is to capture relevant entities accurately without making the pattern overly complex, which may lead to misidentifications.

In [121]:
# Lets get some idea for salt names

# Initialize an empty list to store potential salt names
salts = []

# Iterate over each drug name in the drugs list
for drug in drugs:
    # Split the drug name into words
    split_drug = drug.split()

    # Check if the drug name has more than one word
    if len(split_drug) > 1:
        # Check if the second word is 'de'
        if split_drug[1] == "de":
            # Check if the first word ends with 'ato', 'ito', or 'eto'
            if (
                split_drug[0].endswith("ato")
                or split_drug[0].endswith("ito")
                or split_drug[0].endswith("eto")
            ):
                # Append the first word to the salts list
                salts.append(split_drug[0])

# Remove duplicates from the salts list by converting it to a set and back to a list
salts = list(set(salts))

# Display the unique salt names
salts  # we'll use this list to create a regex pattern to extract drugs from the text. Check the next cell.

['poliestirenossulfonato',
 'xinafoato',
 'aceponato',
 'fosfato',
 'salicilato',
 'mononitrato',
 'fluoreto',
 'embonato',
 'clavulanato',
 'dipropionato',
 'cetoconazol+dipropionato',
 'butilbrometo',
 'dimesilato',
 'racealfa-hidroxigamametiltiobutanoato',
 'hexafluoreto',
 'fendizoato',
 'pidolato',
 'cipionato',
 'palmitato',
 'amoxicilina+clavulanato',
 'clatrato',
 'valerato',
 'brometo',
 'triancinolona+sulfato',
 'pantotenato',
 '21-acetato',
 'decanoato',
 'hemifumarato',
 'hidrogenotartarato',
 'laurilsulfato',
 'acetato',
 'benzoato',
 'ciclossilicato',
 'sulfato',
 'oxalato',
 'isetionato',
 'difosfato',
 'resinato',
 'bicarbonato',
 'maleato',
 'divalproato',
 'estolato',
 'polissulfato',
 'tartarato',
 'succinato',
 'micofenolato',
 'ibandronato',
 'dimaleato',
 'levomalato',
 'etanolato',
 'nitrato',
 'pamoato',
 'etexilato',
 'citrato',
 'racealfaoxobetametilbutanoato',
 'gliconato',
 'hialuronato',
 'picossulfato',
 'hidroxinaftoato',
 'tosilato',
 'hiclato',
 'diclor

In [122]:
# Select the 1st document from the training set of spaCy documents
doc = spacy_docs_train[1]

# Print the text of the selected document for reference
print(doc.text)

Informações sobre os medicamentos postulados Olanzapina - está incluído na lista do RENAME 2022, e, portanto, é fornecido pelo SUS (http:/url_placeholder.com.br - integra o componente especializado da assistência farmacêutica (grupo 1A); - segundo a bula, este fármaco é indicado: “A olanzapina é indicado para o tratamento agudo e de manutenção da esquizofrenia e outras psicoses em adultos, nas quais sintomas positivos (exemplo: delírios, alucinações, alterações de pensamento, hostilidade e desconfiança) e/ou sintomas negativos (exemplo: afeto diminuído, isolamento emocional/social e pobreza de linguagem) são proeminentes. A olanzapina alivia também os sintomas afetivos secundários, comumente associados com esquizofrenia e transtornos relacionados. A olanzapina é eficaz na manutenção da melhora clínica durante o tratamento contínuo nos pacientes adultos que responderam ao tratamento inicial. A olanzapina é indicado, em monoterapia ou em combinação com lítio ou valproato, para o tratamen

In [None]:
import importlib

importlib.reload(helpers.ner)

# Initialize the SaltAnnotator from the helpers.ner module
lf_salt = helpers.ner.SaltAnnotator(remove_salt=True)

# Apply the SaltAnnotator to the selected document
lf_salt(doc)

# Display the entities annotated by the SaltAnnotator using skweak's display_entities function
skweak.utils.display_entities(doc, "lf_salt")

#### 2.6 Apply Labeling Functions

**Working with Individual Labeling Functions**

When applying a single labeling function to a document:

1. Execute the function with the document as its argument:
   ```python
   name_of_lf_object(doc)
   ```

2. This operation modifies the document by adding annotations from the labeling function (it make changes in place).

3. The annotations are stored in the document's `spans` attribute under a key matching the labeling function's name

**Verification**

To confirm that your labeling function has operated correctly:

- Examine the spans it produced: `doc.spans["name_of_your_labeling_function"]`
- Each span contains information about:
  - The text covered
  - The start and end character positions
  - The assigned label (accessible via the `label_` attribute)



##### Integrating Multiple Labeling Functions

Managing numerous labeling functions individually can become cumbersome and inefficient. skweak provides a mechanism to simplify this process through the `CombinedAnnotator` class from the `base` module.

**Creating a Combined Annotator**

The process involves two main steps:

1. **Initialize the combined annotator**:
   ```python
   combined = CombinedAnnotator()
   ```

2. **Add each labeling function** to the combined annotator:
   ```python
   combined.add_annotator(lf_object1)
   combined.add_annotator(lf_object2)
   ```

This approach creates a unified interface that simplifies the application of multiple labeling functions.
From a mathematical perspective, we can view a combined annotator $C$ as applying a series of labeling functions $L_1, L_2, ..., L_n$ to a document $d$:

$$C(d) = \{L_1(d) \cup L_2(d) \cup ... \cup L_n(d)\}$$

Where each labeling function $L_i$ produces a set of labeled spans for document $d$.


**For Single Documents**

To apply the combined annotator to a single document:

```python
combined(doc)
```

This is functionally equivalent to applying each labeling function sequentially.

**For Document Collections**

For processing multiple documents efficiently:

```python
docs = list(combined.pipe(docs))
```

The `pipe` method is particularly advantageous because:

- It implements **lazy evaluation**, computing results only when they're actually needed
- This approach significantly reduces memory usage for large collections
- Processing speed is improved through batch operations



##### Performance Considerations

When working with large document collections, the efficiency gained through combined annotators can be substantial. The time complexity decreases from $O(n \times m)$ to approximately $O(n + m)$, where:
- $n$ is the number of documents
- $m$ is the number of labeling functions

The labeled spans generated by these functions serve as the weak supervision signals that will later be combined through statistical methods to produce a unified, higher-quality labeling model. The quality and diversity of your labeling functions directly impact the performance of the final model.

In [133]:
# Create a CombinedAnnotator instance
# The CombinedAnnotator allows us to combine multiple weak supervision sources (annotators)
# Each annotator will provide its own annotations, which will be combined to create a final set of annotations
combined = skweak.base.CombinedAnnotator()

# Add various annotators to the CombinedAnnotator
# Each annotator is responsible for a different type of annotation or uses a different method to generate annotations

# Add a gazetteer-based annotator for drug names
combined.add_annotator(lf_drugs_gazetteer)

# Add a transformer-based annotator (e.g., BERT, RoBERTa)
combined.add_annotator(lf_transformer_1)

# Add a GLINER-based annotator using the LLaMA model
combined.add_annotator(lf_gliner_llama)

# Add a GLINER-based annotator using the Qwen model
combined.add_annotator(lf_gliner_qwen)

# Add a GLINER-based annotator using a large BiLSTM model
combined.add_annotator(lf_gliner_bi_large)

# Add a GPT-4o annotator using OpenAI's models
combined.add_annotator(lf_openai)

# Add a regex-based annotator for salt names
combined.add_annotator(lf_salt)

# Apply the combined annotators to the training documents
# The combined.pipe method processes the documents in batches, applying all the annotators
# This step generates the combined annotations for each document in the training dataset
spacy_docs_train = list(combined.pipe(spacy_docs_train))

# Note: The running time of this cell may vary depending on the number of documents and the complexity of the annotators
# For our 10k documents, it took about 1m22s to run.

In [134]:
# We can save the weakly annotated spacy_docs to disk to avoid running the labelling functions again
skweak.utils.docbin_writer(
    spacy_docs_train, "data/bin/ner/spacy_docs_train_annotated.bin"
)

Write to data/bin/ner/spacy_docs_train_annotated.bin...done


In [1]:
import skweak

# Now we can load them back, so we don't need to run the labelling functions again
spacy_docs_train = skweak.utils.docbin_reader(
    "data/bin/ner/spacy_docs_train_annotated.bin", spacy_model_name="pt_core_news_lg"
)
spacy_docs_train = list(spacy_docs_train)

#### 2.7 Document-Level Labelling Functions

Skweak offers a mechanism for developing labelling functions that operate at the document level, allowing the model to take advantage of the overall context. By considering the global context, these functions help ensure that all occurrences of an entity within a document are assigned consistent label. For instance, the entity [“Frontal”](https://www.saudedireta.com.br/catinc/drugs/bulas/frontal.pdf) might be ambiguous in isolation, as it could denote the brand name of Alprazolam or refer to a spatial descriptor. However, within the context of a single document, it is usually clear which interpretation is correct.

##### DocumentMajorityAnnotator

The **DocumentMajorityAnnotator** is designed to enforce label consistency within a single document. It operates as follows:

1. **Initial Predictions**:  
 Use predictions generated by an existing labelling function on the document. These predictions provide a starting point with preliminary labels for each identified entity.

2. **Frequency Computation**:  
 For each unique entity string in the document, compute the frequency of each label assigned by the initial predictions. Mathematically, for an entity $ e $ and set of possible labels $ \{l_1, l_2, \dots, l_k\} $, calculate:
   $$
   \text{freq}(l_i, e) = \text{number of times } l_i \text{ is assigned to } e
   $$ 

3. **Majority Label Selection**:  
 Identify the label with the highest frequency for each unique entity. This can be expressed as:
   $$
   L(e) = \arg \max_{l} \; \text{freq}(l, e)
   $$
   where $ L(e) $ is the majority label for the entity $ e $.

4. **Label Assignment**:  
 Assign the majority label to every occurrence of the entity $ e $ in the document. This step reinforces intra-document consistency.

##### Considerations for Document-Level Labelling

- **Contextual Consistency**:  
 Since document-level labelling functions rely on the assumption of consistency, it is important to verify that the document's context supports a single interpretation of its entities.
  
- **Handling Ambiguity**:  
 In situations where an entity could reasonably have multiple labels, it is generally a good idea to avoid using document-level labelling functions. 

In [3]:
# Create a MajorityVoter instance for aggregating annotations
# The MajorityVoter combines annotations from multiple annotators using a majority voting scheme
# "doclevel_voter" is the name of the voter
# initial_weights={"doc_majority":0.0} sets the initial weight for the "doc_majority" annotator to 0.0
# This means we do not want to include the "doc_majority" annotator itself in the vote
majority_voter = skweak.aggregation.MajorityVoter(
    "doclevel_voter", ["MEDICAMENTO"], initial_weights={"doc_majority": 0.0}
)

# Apply the MajorityVoter to the training documents
# The majority_voter.pipe method processes the documents in batches, applying the majority voting scheme
# This step generates the final aggregated annotations for each document in the training dataset
spacy_docs_train = list(majority_voter.pipe(spacy_docs_train))

In [4]:
# Create a DocumentMajorityAnnotator instance
# The DocumentMajorityAnnotator assigns labels to documents based on the majority vote of the annotations
# "doc_majority" is the name of the annotator
# "doclevel_voter" is the name of the voter used for majority voting
# case_sensitive=False indicates that the annotation process is case-insensitive
doc_majority = skweak.doclevel.DocumentMajorityAnnotator(
    "doc_majority", "doclevel_voter", case_sensitive=False
)

# Apply the DocumentMajorityAnnotator to the training documents
# The doc_majority.pipe method processes the documents in batches, applying the majority voting scheme at the document level
# This step generates the final aggregated annotations for each document in the training dataset
spacy_docs_train = list(doc_majority.pipe(spacy_docs_train))

### Step 3 - Generating Aggregated Labels

Aggregating labels is the process of merging outputs from multiple labeling functions to produce a final, high-quality set of annotations for a Named Entity Recognition (NER) model. This consolidation is necessary because individual labeling functions can introduce errors and inconsistencies. Skweak implements two approaches for this task: **Generative Models** and **Majority Vote**.

> If you are curious about other, more sophisticated methods, you can check [this](https://aclanthology.org/2021.acl-long.482.pdf) and [this](https://dl.acm.org/doi/pdf/10.1145/3534678.3539247) paper



#### 3.1 Aggregating Labels with Generative Models

Generative models build an understanding of the latent structure behind noisy labels. The **Hidden Markov Model (HMM)** is a typical example used in sequence tasks such as NER.

##### Key Concepts

- **Sequence Representation:**  
 The text is viewed as sequence of tokens $ (x_1, x_2, \dots, x_n) $ where each token has an associated hidden true label $ (S_1, S_2, \dots, S_n) $ and observed noisy label $ (O_1, O_2, \dots, O_n) $.

- **Emission Probabilities:**  
 Represent the probability of observing $ O_i $ given the hidden state $ S_i $:  
 $$
  P(O_i \mid S_i)
  $$

- **Transition Probabilities:**  
 Define the probability of moving from one hidden state to the next:
  $$
  P(S_i \mid S_{i-1})
  $$

- **Parameter Estimation:**  
 The Baum-Welch algorithm, a variant of Expectation-Maximization, is used to estimate both emission and transition probabilities. This algorithm applies the forward-backward procedure to compute the likelihood of the observed sequence while refining model parameters.

##### Process of Label Aggregation

1. **Application of Labeling Functions:**  
 Apply several labeling functions to the unlabeled dataset to produce noisy labels.

2. **Parameter Estimation:**  
 Use the Baum-Welch algorithm to compute the emission and transition probabilities for the HMM.

3. **Label Inference:**  
 Infer the most likely sequence of true labels from the model, smoothing out inconsistencies across the noisy labels:
   $$
   \hat{S} = \arg \max_{S} \; P(S \mid O)
   $$
   where $\hat{S} $ is the sequence of aggregated labels and $ O $ is the sequence of observed labels.

##### Advantages

- **Context Awareness:**  
 Because it considers the transitions between tokens, the HMM captures the dependency between adjacent labels, leading to more coherent output.

- **Noise Reduction:**  
 The model smooths over conflicting signals from different labeling functions, reducing the chance of wrong annotations.

- **Uncertainty Quantification:**  
 The posterior probabilities provided by the model offer insights into the confidence of each predicted label.

> **Note:** The final aggregated labels stored in `doc.spans["name_ofaggregator"]` display only the most likely labels. The full posterior probabilities, which indicate the confidence for each label, are available in `doc.spans["name_ofaggregator"].attrs['probs']`.


#### 3.2 Aggregating Labels with Majority Vote

The Majority Vote approach is a simpler alternative, where the final label for each token is determined by selecting the label that appears most frequently across all labeling functions.

##### How It Works

- For every token in the sequence, count the number of labels assigned by each labeling function.
- The final label is the one with the highest count:
  $$
  \hat{l} = \operatorname{mode}\{l_1, l_2, \dots, l_k\}
  $$
  where $ l_i $ is the label from the $ i $-th labeling function.

##### Advantages

- **Simplicity:**  
 The method is straightforward to understand and apply.

- **Efficiency:**  
 It requires less computational power compared to generative models.

- **Transparency:**  
 The decision-making process is easy to trace, which can help diagnose errors in label assignments.

##### Limitations

- **Lack of Context:**  
 Majority Vote does not use sequential context or token-to-token dependencies, which may reduce accuracy in complex labeling scenarios.

- **Influence of Poor-Quality Functions:**  
 If low-quality labeling functions outnumber high-quality ones, the final label may be biased.



#### Choosing the Right Aggregation Method

Deciding between generative models and Majority Vote depends on several factors:

1. **Data Complexity:**  
 For tasks with substantial dependencies between labels, generative models provide better accuracy by accounting for sequence context.

2. **Computational Resources:**  
 When resources are limited, the simplicity and speed of Majority Vote can be advantageous.

3. **Quality of Labeling Functions:**  
 Majority Vote performs well when most labeling functions are reliable. In cases with noisy functions, generative models better handle inconsistent outputs.

4. **Interpretability:**  
 A more transparent process might be needed in some cases. Majority Vote offers clear, count-based decisions that are easier to understand.

> **Best Practice:** Compare performance on a validation set using both methods if possible. This comparison will inform which method best fits the specific requirements of your NER task.


In [5]:
spacy_docs_train[2].spans

{'drugs_gazetteer': [RITUXIMABE, Mabthera], 'lf_transformer_1': [RITUXIMABE, Mabthera, medicamento, medicação, medicação], 'lf_gliner_llama': [], 'lf_gliner_qwen': [Mabthera, RITUXIMABE], 'lf_gliner_bi_large': [Mabthera], 'lf_openai': [RITUXIMABE, Mabthera], 'lf_salt': [], 'doclevel_voter': [medicamento, RITUXIMABE, Mabthera, medicação, medicação], 'doc_majority': [medicamento, medicamento, medicamento, medicação, medicação, medicação]}

In [None]:
# Define the initial weights for the weak supervision sources. These weights will be used by the Hidden Markov Model (HMM) to combine the weak labels. You don't need to worry too much about them, as they will be optimized by the HMM. This is just a proxy for my feeling about the quality of the labelling functions.

initial_weights = {
    "drugs_gazetteer": 1.0,
    "lf_transformer_1": 0.1,
    "lf_gliner_llama": 0.3,
    "lf_gliner_qwen": 0.5,
    "lf_gliner_bi_large": 0.5,
    "lf_openai": 1.0,
    "lf_salt": 1.0,
    "doclevel_voter": 0.0,  # Let's disable the doclevel_voter for now. I have a feeling it's not very good for our task.
    "doc_majority": 0.0,  # Let's disable the doc_majority for now. I have a feeling it's not very good for our task.
}
# Create an instance of the Hidden Markov Model (HMM) for weak supervision
# The HMM is used to combine multiple weak labels into a single probabilistic label
# "hmm" is the name of the Hidden Markov Model
# labels=["MEDICAMENTO"] specifies the entity types that the HMM will consider
hmm = skweak.generative.HMM(
    "hmm", labels=["MEDICAMENTO"], initial_weights=initial_weights
)

# Fit the HMM model to the training documents
# The fit method trains the HMM model using the weak labels from the training dataset
# This step involves learning the transition and emission probabilities from the weak labels
# The spacy_docs_train contains the training documents with weak labels generated by the annotators
hmm.fit(spacy_docs_train)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Finished E-step with 9018 documents
Starting iteration 2


         1 -198635.25860047             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Finished E-step with 9018 documents
Starting iteration 3


         2 -191959.99030201   +6675.26829846


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Finished E-step with 9018 documents
Starting iteration 4


         3 -191740.85857541    +219.13172660


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Number of processed documents: 5000
Number of processed documents: 6000
Number of processed documents: 7000
Number of processed documents: 8000
Number of processed documents: 9000
Finished E-step with 9018 documents


         4 -191733.82091174      +7.03766366


In [7]:
# Print the learned parameters of the Hidden Markov Model (HMM)
# The pretty_print method displays the transition and emission probabilities in a readable format
# This helps us understand how the HMM has combined the weak labels into a single probabilistic label
# The output will show the probabilities of transitioning between different states (labels)
# and the probabilities of emitting different observations (tokens) given a state
hmm.pretty_print()

HMM model with following parameters:
Output labels: ['O', 'B-MEDICAMENTO', 'I-MEDICAMENTO']
--------
Start distribution:
O                1.0
B-MEDICAMENTO    0.0
I-MEDICAMENTO    0.0
dtype: float64
--------
Transition model:
                  O  B-MEDICAMENTO  I-MEDICAMENTO
O              0.99           0.01           0.00
B-MEDICAMENTO  0.86           0.00           0.14
I-MEDICAMENTO  0.66           0.00           0.34
--------
Labelling functions in model: ['lf_transformer_1', 'lf_gliner_qwen', 'lf_gliner_llama', 'lf_gliner_bi_large', 'drugs_gazetteer', 'lf_openai', 'lf_salt']
Emission model for: drugs_gazetteer
                 O  B-MEDICAMENTO  I-MEDICAMENTO
O              1.0            0.0            0.0
B-MEDICAMENTO  0.0            1.0            0.0
I-MEDICAMENTO  0.0            0.0            1.0
weights        1.0            1.0            1.0
--------
Emission model for: lf_gliner_bi_large
                 O  B-MEDICAMENTO  I-MEDICAMENTO
O              1.0            0.0 

In [8]:
# Create an instance of the SequentialMajorityVoter for weak supervision
# The SequentialMajorityVoter combines annotations from multiple annotators using a majority voting scheme
# "maj_voter" is the name of the voter
# This voter will sequentially process the annotations and assign the most common label to each token
maj_voter = skweak.voting.SequentialMajorityVoter(
    "maj_voter", labels=["MEDICAMENTO"], initial_weights=initial_weights
)

In [None]:
# Apply the Hidden Markov Model (HMM) to the training documents
# The hmm.pipe method processes the documents in batches, applying the HMM to generate probabilistic labels
# This step refines the weak labels by combining them into a single probabilistic label for each token
# The output is a list of spaCy documents with updated annotations based on the HMM
spacy_docs_train = list(hmm.pipe(spacy_docs_train))

In [None]:
# Apply the SequentialMajorityVoter to the training documents
# The maj_voter.pipe method processes the documents in batches, applying the majority voting scheme
# This step further refines the labels by assigning the most common label to each token based on the combined annotations
# The output is a list of spaCy documents with final annotations based on the majority vote
spacy_docs_train = list(maj_voter.pipe(spacy_docs_train))

In [10]:
spacy_docs_train[0].spans["hmm"].attrs

{'probs': {44: {'B-MEDICAMENTO': 0.7874327340197151},
  46: {'B-MEDICAMENTO': 0.7874327340197179},
  56: {'B-MEDICAMENTO': 0.7874327340197151}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_qwen',
  'lf_gliner_bi_large',
  'lf_openai']}

In [11]:
spacy_docs_train[2].spans["hmm"].attrs

{'probs': {220: {'B-MEDICAMENTO': 0.7874327098359736},
  222: {'B-MEDICAMENTO': 0.7874327340197165}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_qwen',
  'lf_gliner_bi_large',
  'lf_openai']}

In [12]:
spacy_docs_train[2].spans["maj_voter"].attrs

{'probs': {219: {'B-MEDICAMENTO': 0.10040578},
  220: {'B-MEDICAMENTO': 0.7864711},
  222: {'B-MEDICAMENTO': 0.7864711},
  275: {'B-MEDICAMENTO': 0.10040578},
  305: {'B-MEDICAMENTO': 0.10040578}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_qwen',
  'lf_gliner_bi_large',
  'lf_openai']}

In [13]:
print(spacy_docs_train[2].spans)

{'drugs_gazetteer': [RITUXIMABE, Mabthera], 'lf_transformer_1': [RITUXIMABE, Mabthera, medicamento, medicação, medicação], 'lf_gliner_llama': [], 'lf_gliner_qwen': [Mabthera, RITUXIMABE], 'lf_gliner_bi_large': [Mabthera], 'lf_openai': [RITUXIMABE, Mabthera], 'lf_salt': [], 'doclevel_voter': [medicamento, RITUXIMABE, Mabthera, medicação, medicação], 'doc_majority': [medicamento, medicamento, medicamento, medicação, medicação, medicação], 'hmm': [RITUXIMABE, Mabthera], 'maj_voter': [RITUXIMABE, Mabthera]}


In [14]:
skweak.utils.display_entities(spacy_docs_train[0], "hmm")

In [15]:
skweak.utils.display_entities(spacy_docs_train[0], "maj_voter")

In [16]:
skweak.utils.display_entities(spacy_docs_train[2], "hmm")

In [17]:
skweak.utils.display_entities(spacy_docs_train[2], "maj_voter")

In [18]:
skweak.utils.get_spans_with_probs(spacy_docs_train[2], "maj_voter")

[(RITUXIMABE, 0.7864711284637451), (Mabthera, 0.7864711284637451)]

In [19]:
skweak.utils.get_spans_with_probs(spacy_docs_train[2], "hmm")

[(RITUXIMABE, 0.7874327098359736), (Mabthera, 0.7874327340197165)]

In [20]:
from transformers import (
    AutoTokenizer,
)  # Import the AutoTokenizer from the transformers library
from tqdm.auto import tqdm  # Import tqdm for progress bars
import helpers.ner  # Import custom helper functions for Named Entity Recognition (NER)

# Load the tokenizer for the BERT model
# The tokenizer will handle tokenization and padding/truncation of input sequences
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

# List of entity names to remove from the extracted entities
# These are common terms that we do not want to include in our NER annotations
entities_to_remove = [
    "medicamento",
    "medicamentos",
    "medicação",
    "fármaco",
    "fármacos",
    "droga",
    "drogas",
]

In [21]:
# Initialize an empty list to store the training labels generated by the majority voter
train_labels_maj_voter = []

# Iterate over each document in the training dataset
# tqdm is used to display a progress bar
for doc in tqdm(spacy_docs_train):
    text = doc.text  # Extract the text from the spaCy document
    # Extract entities using the majority voter annotations
    # The entities_to_remove list is used to filter out unwanted entities
    entities = helpers.ner.extract_entities_in_gliner_format(
        doc, "maj_voter", entities_to_remove
    )
    # Convert the extracted entities to IOB format
    # IOB format is commonly used for NER tasks and stands for Inside-Outside-Beginning
    iob_format = helpers.ner.convert_to_IOB(
        entity_spans=[
            (ent["start"], ent["end"], ent["text"], ent["label"]) for ent in entities
        ],
        input_text=text,
        tokenizer=tokenizer,
    )
    # Append the text, entities, and IOB format annotations to the list
    train_labels_maj_voter.append(
        {"text": text, "entities": entities, "iob": iob_format}
    )

  0%|          | 0/10000 [00:00<?, ?it/s]

In [22]:
# Initialize an empty list to store the training labels generated by the HMM
train_labels_hmm = []

# Iterate over each document in the training dataset
# tqdm is used to display a progress bar
for doc in tqdm(spacy_docs_train):
    text = doc.text  # Extract the text from the spaCy document
    # Extract entities using the HMM annotations
    # The entities_to_remove list is used to filter out unwanted entities
    entities = helpers.ner.extract_entities_in_gliner_format(
        doc, "hmm", entities_to_remove
    )
    # Convert the extracted entities to IOB format
    # IOB format is commonly used for NER tasks and stands for Inside-Outside-Beginning
    iob_format = helpers.ner.convert_to_IOB(
        entity_spans=[
            (ent["start"], ent["end"], ent["text"], ent["label"]) for ent in entities
        ],
        input_text=text,
        tokenizer=tokenizer,
    )
    # Append the text, entities, and IOB format annotations to the list
    train_labels_hmm.append({"text": text, "entities": entities, "iob": iob_format})

  0%|          | 0/10000 [00:00<?, ?it/s]

In [23]:
# Define a function to render an example with named entities
# This function uses a helper function to visualize the named entities in the text
def render_example(example):
    # Call the helper function to render the entity data
    # text: The input text containing the named entities
    # pipeline_results: The list of entities extracted from the text
    # label_key_name: The key in the entity dictionary that contains the label (e.g., 'MEDICAMENTO')
    # colors: A dictionary specifying the colors to use for different entity labels
    helpers.ner.render_entity_data_from_pipeline(
        text=example["text"],  # The input text to be rendered
        pipeline_results=example["entities"],  # The entities extracted from the text
        label_key_name="label",  # The key in the entity dictionary that contains the label
        colors={
            "MEDICAMENTO": "lightgreen"
        },  # The color to use for the 'MEDICAMENTO' label
    )

In [24]:
render_example(train_labels_maj_voter[2])

In [25]:
render_example(train_labels_hmm[2])

In [None]:
render_example(train_labels_maj_voter[5])

In [None]:
render_example(train_labels_hmm[5])

### Step 4 - Training a Named Entity Recognition (NER) End Model Using Weak Labels

With the aggregated weak labels in place, we can now proceed to train a Named Entity Recognition (NER) model targeted at identifying drug entities in legal documents. This involves several key steps: selecting an appropriate model architecture, preparing the training data, and fine-tuning our model.

#### Selecting the Model Architecture

For this task, we will use a **BERT-based model**. **BERT (Bidirectional Encoder Representations from Transformers)** is known for its powerful performance on NER tasks due to its ability to capture bidirectional context in textual data. This capability is important for sequence labeling tasks like NER.

#### Transfer Learning and Fine-tuning

Instead of training a BERT model from scratch, we will use **transfer learning**. This involves employing a pre-trained BERT model trained on a large corpus of Portuguese text. This model will be fine-tuned on our specific task of drug entity recognition. Here are the steps involved:

1. **Loading the Pre-trained Model**: Initialize the model with weights learned from a vast Portuguese text corpus.
2. **Adapting to the Task**: Modify the model's output layer to predict drug entity labels.
3. **Training on the Data**: Fine-tune the model using our weakly labeled datasets. This involves adjusting the parameters of the pre-trained model specifically for our NER task.

#### Preparing the Training Data

We will use three different versions of the annotated training data to fine-tune the BERT model:

1. **HMM Dataset**:
    - Contains labels aggregated using the Hidden Markov Model (HMM) method.
    - Provides a probabilistic approach to label aggregation.

2. **Majority Vote Dataset**:
    - Uses the weak labels determined by the most common annotation among different labeling functions.
    - A simple yet effective method for consensus labeling.

3. **True Labels Dataset**:
    - Includes manually annotated true labels, serving as the gold standard.
    - This dataset is used only for educational purposes and allows us to benchmark the performance of our weakly supervised models. In real-world scenarios, we typically do not have access to such fully annotated data.

#### Training the Model

The training process will involve fine-tuning the BERT model on each of the three versions of the dataset. By comparing the performance of the models trained on HMM and Majority Vote datasets to the one trained on the True dataset, we can assess the effectiveness of our weak labeling strategies.

> **Note**: The True dataset is included solely for educational insights. In practical applications, the focus is primarily on applying weak labels to minimize manual annotation efforts while still achieving high-performance NER models.


In [30]:
from helpers.text import calculate_md5
import pandas as pd

In [31]:
df = pd.read_parquet("data/ner/dataset.parquet")
df_train = df[df.split == "train"].copy()
df_dev = df[df.split == "valid"].copy()
df_test = df[df.split == "test"].copy()
df_unlabeled = df[df.split == "unlabel"].copy()


df_train.reset_index(drop=True, inplace=True)
df_dev.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)
df_unlabeled.reset_index(drop=True, inplace=True)


def tuplify_list_of_arrays(list_of_arrays):
    return tuple(tuple(array) for array in list_of_arrays)


df_train["label_iob"] = df_train.label_iob.apply(tuplify_list_of_arrays)
df_dev["label_iob"] = df_dev.label_iob.apply(tuplify_list_of_arrays)
df_test["label_iob"] = df_test.label_iob.apply(tuplify_list_of_arrays)

df_train.shape, df_dev.shape, df_test.shape, df_unlabeled.shape

((914, 5), (102, 5), (254, 5), (10000, 5))

In [32]:
df_train

Unnamed: 0,uid,text,label_iob,label_span,split
0,bf9e07d680a0c85e8c4ff7b5ecbf0a3a,REQUERENTE: RONNY ROBERT BERTO FREITAS REQUERI...,"((REQUERENTE, O), (:, O), (RONNY, O), (ROBERT,...",[],train
1,dab3c550c6ad85849fdeef689ecac95a,"No caso em comento, não foi produzido nos auto...","((No, O), (caso, O), (em, O), (comento, O), (,...",[],train
2,79d65b619aa4afd1e9fea5ecced1d030,Cuida-se de pedido de tratamento público de sa...,"((Cuida, O), (-, O), (se, O), (de, O), (pedido...","[{'end': 109, 'label': 'MEDICAMENTO', 'start':...",train
3,3ba71b948d5dd8bf0b3185f5ad4c430d,9. Súmula do julgamento: A Turma Recursal dos ...,"((9, O), (., O), (Súmula, O), (do, O), (julgam...",[],train
4,b1e195ed146f6b10316c71856e45a1c6,"No tocante ao juízo de verossimilhança, apoiad...","((No, O), (tocante, O), (ao, O), (juízo, O), (...","[{'end': 1156, 'label': 'MEDICAMENTO', 'start'...",train
...,...,...,...,...,...
909,39d246fe5e3e2b54617b03733447e22c,JUSTIÇA FEDERAL DA 5a REGIÃO SEÇÃO JUDICIÁRIA ...,"((JUSTIÇA, O), (FEDERAL, O), (DA, O), (5a, O),...",[],train
910,0061871fa52981dde4cac8d4b4acc6d2,"Trata-se de execução provisória, que determino...","((Trata, O), (-, O), (se, O), (de, O), (execuç...",[],train
911,1d52ef97483cabbb900520db56f7047a,"[9][4] RE-AgR 534908/PE, AI-AgR 486816/RJ, RE-...","(([, O), (9, O), (], O), ([, O), (4, O), (], O...",[],train
912,fd0396071d4f6bf5dbea71b188225010,"(.) Sobre os antiangiogênicos, é oportuno menc...","(((, O), (., O), (), O), (Sobre, O), (os, O), ...","[{'end': 70, 'label': 'MEDICAMENTO', 'start': ...",train


Let's perform some data preparation to make our dataset ready for 🤗 Transformers

In [33]:
# Extract the labels from the training dataset as a list of lists
# Each inner list contains the labels for a single example in the training dataset
train_labels_true_iob = df_train["label_iob"].values.tolist()

train_texts_true = df_train["text"].values.tolist()
train_texts_hash = [calculate_md5(text) for text in train_texts_true]

In [34]:
# Extract the labels from the validation dataset as a list of lists
# Each inner list contains the labels for a single example in the validation dataset
valid_labels_true_iob = df_dev["label_iob"].values.tolist()

# Extract the labels from the test dataset as a list of lists
# Each inner list contains the labels for a single example in the test dataset
test_labels_true_iob = df_test["label_iob"].values.tolist()

# Extract the text from the validation dataset as a list
valid_texts_true = df_dev["text"].values.tolist()

# Extract the text from the test dataset as a list
test_texts_true = df_test["text"].values.tolist()

# Calculate the MD5 hash for each text in the validation dataset
valid_texts_hash = [calculate_md5(text) for text in valid_texts_true]

# Calculate the MD5 hash for each text in the test dataset
test_texts_hash = [calculate_md5(text) for text in test_texts_true]

In [35]:
# Separate words and tags for each example in the training dataset
# The list comprehension iterates over each example in train_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
train_words_list = [[word for word, _ in example] for example in train_labels_true_iob]
train_tags_list = [[tag for _, tag in example] for example in train_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the training dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_true_dicts = {
    "tokens": train_words_list,
    "ner_tags": train_tags_list,
    "text": train_texts_true,
    "hash": train_texts_hash,
}

In [36]:
# Separate words and tags for each example in the validation dataset
# The list comprehension iterates over each example in valid_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
valid_words_list = [[word for word, _ in example] for example in valid_labels_true_iob]
valid_tags_list = [[tag for _, tag in example] for example in valid_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the validation dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
valid_true_dicts = {
    "tokens": valid_words_list,
    "ner_tags": valid_tags_list,
    "text": valid_texts_true,
    "hash": valid_texts_hash,
}

In [37]:
# Separate words and tags for each example in the test dataset
# The list comprehension iterates over each example in test_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
text_words_list = [[word for word, _ in example] for example in test_labels_true_iob]
text_tags_list = [[tag for _, tag in example] for example in test_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the test dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
test_true_dicts = {
    "tokens": text_words_list,
    "ner_tags": text_tags_list,
    "text": test_texts_true,
    "hash": test_texts_hash,
}

In [38]:
# Extract the IOB format annotations from the HMM-labeled training data
# The list comprehension iterates over each entry in train_labels_hmm
# For each entry, it extracts the 'iob' field, which contains the IOB format annotations
hmm_iob_train = [i["iob"] for i in train_labels_hmm]

# Separate words and tags for each example in the HMM-labeled training data
# The first list comprehension iterates over each example in hmm_iob_train
# For each example, it extracts the words, creating a list of words for each example
hmm_words_list = [[word for word, _ in example] for example in hmm_iob_train]

# The second list comprehension iterates over each example in hmm_iob_train
# For each example, it extracts the tags, creating a list of tags for each example
hmm_tags_list = [[tag for _, tag in example] for example in hmm_iob_train]

# Extract the text content from the HMM-labeled training data
hmm_texts_list = [i["text"] for i in train_labels_hmm]

# Extract the md5 hash for each text in the HMM-labeled training data
hmm_texts_hash = [calculate_md5(text) for text in hmm_texts_list]

# Create a dictionary to store the tokens and named entity tags for the HMM-labeled training data
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_hmm_dicts = {
    "tokens": hmm_words_list,
    "ner_tags": hmm_tags_list,
    "text": hmm_texts_list,
    "hash": hmm_texts_hash,
}

In [39]:
# Extract the IOB format annotations from the maj_voter-labeled training data
# The list comprehension iterates over each entry in train_labels_maj_voter
# For each entry, it extracts the 'iob' field, which contains the IOB format annotations
maj_voter_iob_train = [i["iob"] for i in train_labels_maj_voter]

# Separate words and tags for each example in the maj_voter-labeled training data
# The first list comprehension iterates over each example in maj_voter_iob_train
# For each example, it extracts the words, creating a list of words for each example
maj_voter_words_list = [
    [word for word, _ in example] for example in maj_voter_iob_train
]

# The second list comprehension iterates over each example in maj_voter_iob_train
# For each example, it extracts the tags, creating a list of tags for each example
maj_voter_tags_list = [[tag for _, tag in example] for example in maj_voter_iob_train]

# Extract the text content from the maj_voter-labeled training data
maj_voter_texts_list = [i["text"] for i in train_labels_maj_voter]

# Extract the md5 hash for each text in the maj_voter-labeled training data
maj_voter_texts_hash = [calculate_md5(text) for text in maj_voter_texts_list]

# Create a dictionary to store the tokens and named entity tags for the maj_voter-labeled training data
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_maj_voter_dicts = {
    "tokens": maj_voter_words_list,
    "ner_tags": maj_voter_tags_list,
    "text": maj_voter_texts_list,
    "hash": maj_voter_texts_hash,
}

In [40]:
from datasets import (
    ClassLabel,
    Dataset,
    Features,
    Sequence,
    Value,
)  # Import necessary classes from the datasets library

# Define the mapping from label names to label IDs
# 'O' represents tokens that are not part of any named entity
# 'B-MEDICAMENTO' represents the beginning of a 'MEDICAMENTO' entity
# 'I-MEDICAMENTO' represents the inside of a 'MEDICAMENTO' entity
label_to_id = {"O": 0, "B-MEDICAMENTO": 1, "I-MEDICAMENTO": 2}

# Create the reverse mapping from label IDs to label names
# This is useful for converting label IDs back to label names
id_to_label = {v: k for k, v in label_to_id.items()}

# Extract the label names from the mapping
# This list will be used to define the ClassLabel feature in the dataset
label_names = list(label_to_id.keys())

# Define the dataset features
# The Features class specifies the schema of the dataset
# 'tokens' is a sequence of strings, representing the tokens in the text
# 'ner_tags' is a sequence of ClassLabel, representing the named entity recognition tags for each token
dataset_features = Features(
    {
        "tokens": Sequence(Value("string")),  # Sequence of tokens
        "ner_tags": Sequence(
            ClassLabel(names=label_names)
        ),  # Sequence of named entity recognition tags,
        "text": Value("string"),  # The original text
        "hash": Value("string"),  # The MD5 hash of the text
    }
)

In [41]:
from datasets import Dataset  # Import the Dataset class from the datasets library

# Create a Hugging Face Dataset from the HMM-labeled training data
# The Dataset.from_dict method converts a dictionary to a Dataset object
# train_hmm_dicts contains the tokens and named entity tags for the HMM-labeled training data
# dataset_features specifies the schema of the dataset (tokens and ner_tags)
hf_dataset_train_hmm = Dataset.from_dict(train_hmm_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the majority voter-labeled training data
# train_maj_voter_dicts contains the tokens and named entity tags for the majority voter-labeled training data
hf_dataset_train_maj_voter = Dataset.from_dict(
    train_maj_voter_dicts, features=dataset_features
)

# Create a Hugging Face Dataset from the true-labeled training data
# train_true_dicts contains the tokens and named entity tags for the true-labeled training data
hf_dataset_train_true = Dataset.from_dict(train_true_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the true-labeled validation data
# valid_true_dicts contains the tokens and named entity tags for the true-labeled validation data
hf_dataset_dev_true = Dataset.from_dict(valid_true_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the true-labeled test data
# test_true_dicts contains the tokens and named entity tags for the true-labeled test data
hf_dataset_test_true = Dataset.from_dict(test_true_dicts, features=dataset_features)

In [42]:
# Save the Hugging Face Datasets to disk

# Save the HMM-labeled training dataset to disk
hf_dataset_train_hmm.save_to_disk("outputs/ner/hf_dataset_train_hmm")

# Save the majority voter-labeled training dataset to disk
hf_dataset_train_maj_voter.save_to_disk("outputs/ner/hf_dataset_train_maj_voter")

# Save the true-labeled training dataset to disk
hf_dataset_train_true.save_to_disk("outputs/ner/hf_dataset_train_true")

# Save the true-labeled validation dataset to disk
hf_dataset_dev_true.save_to_disk("outputs/ner/hf_dataset_dev_true")

# Save the true-labeled test dataset to disk
hf_dataset_test_true.save_to_disk("outputs/ner/hf_dataset_test_true")

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/914 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/102 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/254 [00:00<?, ? examples/s]

In [43]:
# You can load it later

from datasets import load_from_disk  # Import the load_from_disk function

# Load the Hugging Face Datasets from disk
# The load_from_disk function loads the datasets saved in the specified directory

hf_dataset_train_hmm = load_from_disk("outputs/ner/hf_dataset_train_hmm")
hf_dataset_train_maj_voter = load_from_disk("outputs/ner/hf_dataset_train_maj_voter")
hf_dataset_train_true = load_from_disk("outputs/ner/hf_dataset_train_true")
hf_dataset_dev_true = load_from_disk("outputs/ner/hf_dataset_dev_true")
hf_dataset_test_true = load_from_disk("outputs/ner/hf_dataset_test_true")

In [44]:
hf_dataset_train_hmm.features["ner_tags"].feature

ClassLabel(names=['O', 'B-MEDICAMENTO', 'I-MEDICAMENTO'], id=None)

In [45]:
# get the label_to_id_dict
label_to_id = {
    k: v for v, k in enumerate(hf_dataset_train_hmm.features["ner_tags"].feature.names)
}
id_to_label = {v: k for k, v in label_to_id.items()}

label_to_id, id_to_label

({'O': 0, 'B-MEDICAMENTO': 1, 'I-MEDICAMENTO': 2},
 {0: 'O', 1: 'B-MEDICAMENTO', 2: 'I-MEDICAMENTO'})

In [46]:
# Import necessary classes from the transformers library
from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForTokenClassification,
)

MODEL_NAME = "neuralmind/bert-base-portuguese-cased"  # Specify the name of the pretrained BERT model

# Load the tokenizer for the BERT model
# The tokenizer will handle tokenization and padding/truncation of input sequences
# 'neuralmind/bert-base-portuguese-cased' is the name of the pretrained model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Create a DataCollator for token classification tasks
# The DataCollatorForTokenClassification dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
# The tokenizer is passed to the DataCollator to handle tokenization and padding
token_classification_collator = DataCollatorForTokenClassification(tokenizer)

In [47]:
from typing import Dict, List  # Import necessary types for type annotations


def tokenize_and_align_labels(examples: Dict[str, List[str]]) -> Dict[str, List[str]]:
    """
    Tokenizes the input words and aligns the labels with the tokens.

    Args:
        examples: A dictionary containing the input words and the corresponding labels.
                  - "tokens": List of words (tokens) for each example.
                  - "ner_tags": List of named entity recognition tags for each token.

    Returns:
        A dictionary containing the tokenized input words and the aligned labels.
        - "input_ids": List of token IDs for each example.
        - "attention_mask": List of attention masks for each example.
        - "labels": List of aligned labels for each token.
    """
    label_all_tokens = (
        True  # Whether to label all tokens or only the first token of each word
    )
    # Tokenize the input words
    # truncation=True: Truncate sequences to the maximum length
    # is_split_into_words=True: The input is already split into words
    # max_length=512: Maximum length of the tokenized sequences
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, max_length=512
    )

    aligned_labels = []  # List to store the aligned labels for each example
    # Iterate over each example in the input data
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(
            batch_index=i
        )  # Get the word IDs for the current example
        previous_word_idx = None  # Initialize the previous word index
        label_ids = []  # List to store the aligned labels for the current example

        # Iterate over each word ID in the tokenized input
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(
                    -100
                )  # Append -100 for special tokens (e.g., [CLS], [SEP])
            elif word_idx != previous_word_idx:
                label_ids.append(
                    label[word_idx]
                )  # Append the label for the current word
            else:
                # Append the label for the current word if label_all_tokens is True, otherwise append -100
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx  # Update the previous word index

        aligned_labels.append(
            label_ids
        )  # Append the aligned labels for the current example

    tokenized_inputs["labels"] = (
        aligned_labels  # Add the aligned labels to the tokenized inputs
    )
    return tokenized_inputs  # Return the tokenized inputs with aligned labels


# Apply the tokenization and label alignment to the training and validation datasets
# The map method applies the tokenize_and_align_labels function to each example in the dataset
# batched=True: Process the examples in batches for efficiency
tokenized_hf_dataset_train_hmm = hf_dataset_train_hmm.map(
    tokenize_and_align_labels, batched=True
)
tokenized_hf_dataset_train_maj_voter = hf_dataset_train_maj_voter.map(
    tokenize_and_align_labels, batched=True
)
tokenized_hf_dataset_train_true = hf_dataset_train_true.map(
    tokenize_and_align_labels, batched=True
)
tokenized_hf_dataset_dev_true = hf_dataset_dev_true.map(
    tokenize_and_align_labels, batched=True
)
tokenized_hf_dataset_test_true = hf_dataset_test_true.map(
    tokenize_and_align_labels, batched=True
)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [48]:
# Display the features of the 'ner_tags' field in the test dataset
# The features attribute provides information about the schema of the dataset
# 'ner_tags' is a sequence of named entity recognition tags for each token in the dataset
# This line of code will show the details of the 'ner_tags' field, such as the possible tag values and their corresponding IDs
hf_dataset_test_true.features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-MEDICAMENTO', 'I-MEDICAMENTO'], id=None), length=-1, id=None)

In [49]:
import numpy as np  # Import the NumPy library for numerical operations
from typing import Tuple, Dict  # Import Tuple and Dict for type annotations
import evaluate  # Import the evaluate library for computing evaluation metrics


def compute_metrics_for_evaluation(
    predictions_and_labels: Tuple[np.ndarray, np.ndarray],
) -> Dict[str, float]:
    """
    Computes metrics for model evaluation.

    Args:
        predictions_and_labels: A tuple containing the model predictions and the true labels.
                                - predictions: A NumPy array of shape (batch_size, sequence_length, num_labels)
                                - labels: A NumPy array of shape (batch_size, sequence_length)

    Returns:
        A dictionary containing precision, recall, f1, and accuracy metrics.
    """
    predictions, labels = (
        predictions_and_labels  # Unpack the predictions and labels from the input tuple
    )

    # Convert logits to actual predictions
    # np.argmax(predictions, axis=2) selects the index of the maximum value along the last axis (num_labels)
    # This converts the logits to the predicted label IDs
    predictions = np.argmax(predictions, axis=2)

    # Filter out the special tokens and convert IDs to labels
    # true_predictions and true_labels will contain the predicted and true labels, excluding special tokens
    true_predictions = [
        [id_to_label[pred] for (pred, label) in zip(prediction, label) if label != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [
            id_to_label[label]
            for (pred, label) in zip(prediction, label)
            if label != -100
        ]
        for prediction, label in zip(predictions, labels)
    ]

    # Load the seqeval metric for named entity recognition
    # The seqeval metric computes precision, recall, f1, and accuracy for NER tasks
    metric = evaluate.load("seqeval")

    # Compute the evaluation metrics using the true predictions and true labels
    results = metric.compute(predictions=true_predictions, references=true_labels)

    # Return the computed metrics as a dictionary
    # The dictionary contains the overall precision, recall, f1, and accuracy
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [50]:
# Load the pretrained model and tokenizer from the MODEL_NAME model
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,  # Mapping from label IDs to label names
    label2id=label_to_id,  # Mapping from label names to label IDs
)

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="./outputs/ner/bert-base-ner-true-labels",  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=24,  # Batch size for training
    per_device_eval_batch_size=24,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    fp16=False,  # Use mixed precision training with FP16 (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=1,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=2e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_true,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_dev_true,  # Tokenized validation dataset
    processing_class=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33meliasjacob[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0089,0.040877,0.0,0.0,0.0,0.986724
2,0.0548,0.019393,0.703313,0.828014,0.760586,0.994566
3,0.003,0.017162,0.735614,0.838652,0.783761,0.995184
4,0.0004,0.016903,0.751181,0.845745,0.795663,0.995513
5,0.006,0.016586,0.761675,0.838652,0.798312,0.995616
6,0.0022,0.016421,0.761218,0.842199,0.799663,0.995616
7,0.0021,0.01647,0.763666,0.842199,0.801012,0.995657


  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=140, training_loss=0.04603014853284029, metrics={'train_runtime': 259.444, 'train_samples_per_second': 24.66, 'train_steps_per_second': 0.54, 'total_flos': 1671679487062152.0, 'train_loss': 0.04603014853284029, 'epoch': 7.0})

In [52]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
metrics_true_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics help us understand the performance of the model on the test dataset
metrics_true_labels



{'eval_loss': 0.00876336358487606,
 'eval_precision': 0.8616236162361623,
 'eval_recall': 0.9296615792966157,
 'eval_f1': 0.894350462815193,
 'eval_accuracy': 0.9973300437670041,
 'eval_runtime': 3.3054,
 'eval_samples_per_second': 76.844,
 'eval_steps_per_second': 1.815,
 'epoch': 7.0}

In [53]:
import gc  # Import the garbage collection module
import torch  # Import the PyTorch library

# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
torch.cuda.empty_cache()

In [54]:
# Load the pretrained model and tokenizer
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,  # Pretrained model name
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,  # Mapping from label IDs to label names
    label2id=label_to_id,  # Mapping from label names to label IDs
)

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="./outputs/ner/bert-base-ner-hmm-labels",  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=24,  # Batch size for training
    per_device_eval_batch_size=24,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    fp16=False,  # Use mixed precision training with FP16 (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=1,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=2e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_hmm,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_dev_true,  # Tokenized validation dataset
    processing_class=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [55]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
# During training, the model's parameters will be updated to minimize the loss function
# The evaluation metrics will be logged at each step and at the end of each epoch
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0144,0.013281,0.892578,0.810284,0.849442,0.996316
2,0.0143,0.015865,0.885375,0.794326,0.837383,0.996027
3,0.0097,0.019444,0.858527,0.785461,0.82037,0.995822
4,0.0111,0.020522,0.871542,0.781915,0.824299,0.995616
5,0.0018,0.022729,0.860194,0.785461,0.821131,0.995801
6,0.0066,0.025123,0.860941,0.746454,0.79962,0.995307
7,0.0116,0.025157,0.859127,0.76773,0.810861,0.995554




TrainOutput(global_step=1463, training_loss=0.017432101916185036, metrics={'train_runtime': 1546.0277, 'train_samples_per_second': 45.277, 'train_steps_per_second': 0.946, 'total_flos': 1.744000142472288e+16, 'train_loss': 0.017432101916185036, 'epoch': 7.0})

In [56]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
# The evaluation metrics help us understand how well the model performs on unseen data
metrics_hmm_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics provide insights into the model's performance and can be used to compare different models
metrics_hmm_labels



{'eval_loss': 0.012193058617413044,
 'eval_precision': 0.9245562130177515,
 'eval_recall': 0.8294625082946251,
 'eval_f1': 0.8744316194473593,
 'eval_accuracy': 0.9964175270797776,
 'eval_runtime': 3.3072,
 'eval_samples_per_second': 76.803,
 'eval_steps_per_second': 1.814,
 'epoch': 7.0}

In [57]:
# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
# By setting these variables to None, we remove their references, making them eligible for garbage collection
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
# The garbage collector will clean up any objects that are no longer referenced in the code
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
# Clearing the CUDA cache helps in managing GPU memory more efficiently, especially when working with large models
torch.cuda.empty_cache()

In [58]:
# Load the pretrained model and tokenizer
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,  # Pretrained model name
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,  # Mapping from label IDs to label names
    label2id=label_to_id,  # Mapping from label names to label IDs
)


# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="./outputs/ner/bert-base-ner-maj_voter-labels",  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=24,  # Batch size for training
    per_device_eval_batch_size=24,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    fp16=False,  # Use mixed precision training with FP16 (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    eval_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=1,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=2e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_maj_voter,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_dev_true,  # Tokenized validation dataset
    processing_class=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [59]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
# During training, the model's parameters will be updated to minimize the loss function
# The evaluation metrics will be logged at each step and at the end of each epoch
trainer.train()



Epoch,Training Loss,Validation Loss




TrainOutput(global_step=1463, training_loss=0.0157559212491041, metrics={'train_runtime': 1545.9905, 'train_samples_per_second': 45.278, 'train_steps_per_second': 0.946, 'total_flos': 1.744000142472288e+16, 'train_loss': 0.0157559212491041, 'epoch': 7.0})

In [60]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
# The evaluation metrics help us understand how well the model performs on unseen data
metrics_maj_voter_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics provide insights into the model's performance and can be used to compare different models
metrics_maj_voter_labels



{'eval_loss': 0.016582684591412544,
 'eval_precision': 0.92,
 'eval_recall': 0.8241539482415395,
 'eval_f1': 0.8694434721736087,
 'eval_accuracy': 0.9962062963651419,
 'eval_runtime': 3.4512,
 'eval_samples_per_second': 73.597,
 'eval_steps_per_second': 1.739,
 'epoch': 7.0}

In [61]:
# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
# By setting these variables to None, we remove their references, making them eligible for garbage collection
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
# The garbage collector will clean up any objects that are no longer referenced in the code
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
# Clearing the CUDA cache helps in managing GPU memory more efficiently, especially when working with large models
torch.cuda.empty_cache()

In [62]:
import pandas as pd

# Create a DataFrame to compare evaluation metrics from different models
# The DataFrame will contain metrics from three different models: True Labels, HMM Labels, and Maj Voter Labels

# Create a DataFrame with metrics from the three models
# Each row in the DataFrame corresponds to a different metric (e.g., loss, precision, recall, F1 score)
# Each column corresponds to a different model (True Labels, HMM Labels, Maj Voter Labels)
df_metrics = pd.DataFrame(
    [
        metrics_true_labels,
        metrics_hmm_labels,
        metrics_maj_voter_labels,
    ],  # List of dictionaries containing metrics
    index=[
        "True Labels",
        "HMM Labels",
        "Maj Voter Labels",
    ],  # Index labels for the DataFrame
)

# Display the DataFrame
# This will show the evaluation metrics for each model in a tabular format
# The DataFrame provides a clear and organized way to compare the performance of different models
df_metrics

Unnamed: 0,eval_loss,eval_precision,eval_recall,eval_f1,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
True Labels,0.008763,0.861624,0.929662,0.89435,0.99733,3.3054,76.844,1.815,7.0
HMM Labels,0.012193,0.924556,0.829463,0.874432,0.996418,3.3072,76.803,1.814,7.0
Maj Voter Labels,0.016583,0.92,0.824154,0.869443,0.996206,3.4512,73.597,1.739,7.0


### Performance Evaluation

We evaluated the performance of our approach using a test dataset. Here are the results:


| Data | F1-Score | Loss|
|----------|----------|----------|
| Real labels | **0.894350** | **0.008763** |
| HMM labels | 0.874432 | 0.012193 |
| Majority Vote labels | 0.869443 | 0.016583 |

- **Real labels**: These are the gold standard manually annotated labels, and should be considered the upper bound of model performance.
- **HMM labels**: Labels generated by a Hidden Markov Model, which combines multiple labeling functions using a probabilistic approach.
- **Majority vote labels**: Labels determined by the majority voting method.

### Time Efficiency

Manual labeling of data is time-consuming and often impractical for large datasets. In our case:

- Each manual label takes approximately 216 seconds (72 seconds per person, with three people involved).
- For 914 documents (like our training labeled dataset), this results in 914 x 216 seconds = 194,424 seconds, which translates to approximately 55 hours of human labor.
- Had we labeled the 10,000 documents in our unlabeled dataset manually, it would have taken 10,000 x 216 seconds = 2,160,000 seconds, or around 600 hours of human labor.

With weak supervision techniques, we can significantly reduce labeling time, making the annotation process cheaper and more flexible.


> **Note**: In real-world scenarios, the benefits of weak supervision become even more pronounced as the size of the dataset increases. The ability to efficiently and accurately label large volumes of data is key for the practical application of NER models.


## Takeaways
- **Practical Annotation Efficiency**: Weak supervision can reduce annotation time from hundreds of hours to minutes, making NER feasible for large-scale applications across domains like healthcare, legal, and finance.

- **Quality Through Diversity**: Combining diverse labeling functions (pattern-based, knowledge-based, model-based) creates strong training data that captures different aspects of entity recognition.

- **Cost-Benefit Balance**: The slight decrease in model performance using weak supervision (2-3% F1-score reduction) is often outweighed by the massive reduction in annotation costs and time.

- **Specialized Domain Adaptation**: Weak supervision enables rapid adaptation of NER to specialized domains (like identifying medications in legal documents) without extensive domain-specific manual annotation.

- **Flexible NLP Systems**: The approach makes it feasible to build and maintain NER systems for many entity types across multiple domains, supporting applications that would be impractical with traditional annotation methods.

- **Complementary Strengths**: Different labeling functions and aggregation methods have complementary strengths, with generative models (HMM) slightly outperforming majority voting by better capturing dependencies between labels.

- **Programmatic Data Creation**: Shifting focus from manual data annotation to creating effective labeling functions transforms the data creation workflow, emphasizing programming skills over repetitive labeling tasks.

- **Continuous Improvement Framework**: The iterative nature of weak supervision enables continuous refinement of NER systems as requirements progress and new data becomes available.


# Questions

1. What is Named Entity Recognition (NER), and why is it important in NLP?

2. How do labeling functions reduce the reliance on manual data annotation in the weak supervision framework?

3. In what ways can pre-trained models, such as BERT or GLiNER, contribute to the labeling process?

4. What is the role of document-level labeling in enhancing the consistency of entity annotations?

5. How does Skweak combine outputs from multiple labeling functions to produce a single, higher-quality set of annotations?

6. What advantage does an HMM-based generative model have over simple majority voting when aggregating labels?

7. Why is transfer learning particularly useful for specialized NER tasks, such as identifying drug names in legal documents?

8. How can iterative refinement improve the quality of both the labeling functions and the resulting NER model?

9. What are some key time-saving benefits of applying weak supervision techniques compared to fully manual labeling?

10. How do the weakly supervised NER models trained on HMM or majority vote labels compare in performance to those trained on manual gold annotations?

`Answers are commented inside this cell.`

<!-- 
1. NER is a process of identifying and classifying specific entities—like persons, organizations, locations, or medications—in unstructured text. It is critical because it converts textual data into structured form, aiding efficient information retrieval and analysis.

2. Labeling functions programmatically assign tags to text segments based on rules, dictionaries, or model outputs, thus greatly reducing the need for labor-intensive manual annotations and enabling faster creation of large labeled datasets.

3. Pre-trained models like BERT or GLiNER already capture extensive language knowledge. When integrated as labeling functions or initial annotators, they can provide more accurate entity predictions and help detect entities even in contexts with sparse training data.

4. Document-level labeling ensures that an entity reference consistently receives the same label throughout an entire document. This is particularly useful in domains where a single mention of an entity can have multiple forms or ambiguous contexts within the same text.

5. Skweak aggregates labels through methods such as majority voting or generative modeling, effectively reconciling conflicts and utilizing the strengths of each labeling source to form a strong consensus for every token or segment.

6. Hidden Markov Models (HMMs) can model dependencies between labels and learn the reliability of each labeling function. This statistical approach often outperforms straightforward majority voting by better handling conflicting or noisy labels.

7. Transfer learning allows practitioners to take advantage of language knowledge gained from large corpora, then adapt these pre-trained models to domain-specific tasks—like identifying pharmaceutical terms in legal texts—without needing an enormous domain-annotated dataset.

8. Iterative refinement involves examining where the system mislabels or produces inconsistencies, then refining or adding new labeling functions accordingly. This cycle of improvement can gradually boost both annotation quality and model accuracy.

9. Weak supervision drastically cuts the time spent on manual annotation. Instead of labeling thousands of documents by hand, developers can create a handful of labeling functions that scale quickly, saving hundreds of hours of tedious work.

10. Although models trained on weakly supervised labels (HMM- or majority vote-derived) may not perfectly match the performance of fully gold-labeled models, they often come close. The small drop in accuracy can be a worthwhile trade-off for the significant reduction in labeling costs and time. -->