# Advanced Topics in Weak Supervision - Name Entity Recognition
## Learning with Limited Labels: Weak Supervision and Uncertainty-Aware Training
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

# Summary

## Keypoints
- **Named Entity Recognition (NER)**: Task in NLP to identify and classify named entities in text into predefined categories, transforming unstructured text into structured data.

- **Weak Supervision**: Techniques to efficiently annotate large datasets for NER tasks, reducing the reliance on labor-intensive manual labeling.

- **Labeling Functions**: Use of multiple methods such as gazetteers, pre-trained transformer models, regex patterns, and zero-shot learning (GLiNER, NuNER, LangChain).

- **Skweak Framework**: Framework for combining weak labels using generative models (e.g., Hidden Markov Models) and majority voting to enhance the quality of annotations.

- **Document-Level Labeling**: Application of document-level labeling functions to ensure label consistency within documents, improving annotation accuracy.

- **Transfer Learning**: Using pre-trained models like BERT, fine-tuning them on weakly labeled datasets for tasks like drug entity recognition in legal documents.

- **Iterative Refinement**: Continuous improvement of labeling functions and model performance through iterative refinement.

- **Time Efficiency**: Highlighting significant time savings achieved through weak supervision compared to traditional manual labeling efforts.

## Takeaways
- **Efficiency in Data Annotation**: Weak supervision techniques, as implemented in the Skweak framework, can drastically reduce the time and effort required for data annotation, especially for large-scale datasets.

- **Enhanced Model Performance**: Combining multiple labeling functions and using generative models for label aggregation can significantly improve the quality of annotations, leading to better-performing NER models.

- **Adaptability and Scalability**: The iterative refinement process allows NER models to continuously improve, making them adaptable to evolving data requirements and scalable across different domains.

- **Cost-Effective Approach**: Weak supervision provides a cost-effective alternative to manual labeling, easing the development of robust NER models with minimal human intervention.

- **Transfer Learning Benefits**: Leveraging pre-trained models, such as BERT, fine-tuned on weakly labeled data, can achieve performance close to models trained on fully annotated datasets.

In [1]:
import os
import pandas as pd

os.environ['TOKENIZERS_PARALLELISM'] = "false"
#os.environ['CUDA_VISIBLE_DEVICES'] = "-1"



# Introduction Named Entity Recognition


## What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is a task within the broader field of Information Extraction. It involves identifying and classifying named entities in a text into predefined categories. The primary goal of NER is to convert unstructured textual data into structured information that machines can easily read and analyze.

Named Entity Recognition (NER) is a foundational technique in Natural Language Processing (NLP) that transforms unstructured text into structured data by identifying and classifying named entities. This process is vital for various applications, making it easier for machines to understand and use textual information effectively.

### Key Concepts

**Entities**: In NER, entities refer to specific pieces of information within a text that are of interest. Any word or phrase that represents something specific in the real world can be considered an entity.
**Named Entities**: These are specific, identifiable entities mentioned in the text. They are categorized into various groups, such as:

- **`PERSON`**: Names of individuals.
- **`ORG`**: Names of organizations.
- **`GPE`**: Geopolitical entities, including countries, cities, and states.
- **`TIME`**: Time expressions, such as specific times of the day.
- **`DATE`**: Date expressions, including specific dates and periods.
- **...among others**: There are other categories, depending on the application and domain.

### Objectives of NER

The primary objective of NER is to extract structured information from unstructured text. This process involves:

1. **Detection**: Identifying the spans of text that correspond to named entities.
2. **Classification**: Assigning each detected entity to a predefined category.

### Importance of NER

NER is essential for various applications, including:

- **Information Retrieval**: Enhancing search engines to retrieve more relevant results.
- **Question Answering Systems**: Improving the accuracy of systems designed to answer user queries.
- **Content Recommendation**: Personalizing content based on identified entities.
- **Text Summarization**: Extracting key information to generate concise summaries.
- **Graph Databases**: Populating knowledge graphs with structured information.

### Example

Consider the following sentence:

> "Elias Jacob is a professor at the Federal University of Rio Grande do Norte, located in Natal, Brazil."

In this sentence, NER would identify and classify the named entities as:

- **`PERSON`**: Elias Jacob
- **`ORG`**: Federal University of Rio Grande do Norte
- **`GPE`**: Natal, Brazil
- **`PROFESSION`**: professor

> Note: GPE stands for Geopolitical Entity, which includes countries, cities, and states.

### Visual Representation

To better understand how NER works, refer to the visual example below, which highlights named entities within a legal text:

<p align="center">
<img src="images/NER.png" alt="NER Example" style="width: 80%; height: 80%"/>
</p>

### Potential Questions

- **How does NER handle ambiguous cases?**
    - NER systems often rely on context and sophisticated algorithms to disambiguate entities. For instance, "Apple" could refer to a fruit or a tech company, and the surrounding text helps determine the correct classification.

- **What are the challenges in NER?**
    - Some of the main challenges include handling ambiguous entities, dealing with different languages and dialects, and recognizing entities in noisy or informal text (e.g., social media posts).

- **Can NER be customized for specific domains?**
    - Yes, NER systems can be trained on domain-specific data to improve their accuracy for particular applications, such as medical texts or legal documents.

In [2]:
# Import the necessary libraries from the spaCy package
import spacy
from spacy import displacy

# Load the pre-trained spaCy model for Portuguese
# 'pt_core_news_lg' is a large model with more accuracy and features
# You can also use 'pt_core_news_sm' for a smaller, faster model with fewer features
# To install the model, run: python -m spacy download pt_core_news_lg
nlp = spacy.load('pt_core_news_lg')

# Define a sample text in Portuguese for Named Entity Recognition (NER)
text_example = (
    "Meu nome é Elias Jacob e eu moro em Natal, Rio Grande do Norte. "
    "Eu trabalho no Instituto Metrópole Digital, que é a unidade mais bacana da UFRN. "
    "Desde 2021 eu também trabalho como Corregedor da UFRN. "
    "Quando a pandemia começou, no início de 2020, eu estava com as malas prontas para uma viagem de férias para o Japão. "
    "Eu até fui buscar meu visto no Consulado em Recife, mas, quando chegou mais perto da viagem, meus voos foram todos cancelados pela United Airlines e eu não viajei. "
    "No dia 11 de novembro de 2023, eu estive no show do Roger Waters em São Paulo. "
    "Sempre que visito a cidade, eu dou uma passada no Kidoairaku, meu restaurante japonês favorito lá."
)

# Process the text using the spaCy model
# This step performs tokenization, part-of-speech tagging, and named entity recognition
doc = nlp(text_example)

# Use the displaCy visualizer to render the named entities in the text
# The 'style' parameter is set to 'ent' to visualize named entities
# The 'jupyter' parameter is set to True to display the visualization in a Jupyter Notebook
displacy.render(doc, style='ent', jupyter=True)

In [3]:
from typing import List, Tuple  # Import List and Tuple from the typing module for type annotations

def extract_named_entities_from_text(input_text: str) -> List[Tuple[str, str]]:
    """
    Extract named entities from a given text.

    Args:
        input_text (str): The input text.

    Returns:
        List[Tuple[str, str]]: A list of tuples where each tuple contains the named entity and its label.
    """
    # Parse the text with spaCy
    # The nlp object processes the input text and returns a parsed Doc object
    parsed_text = nlp(input_text)
    
    # Initialize an empty list to store the named entities
    named_entities = []
    
    # Iterate over the named entities in the parsed text
    # The parsed_text.ents attribute contains a list of named entities identified in the text
    for entity in parsed_text.ents:
        # Append the entity text and label to the list
        # entity.text is the named entity, and entity.label_ is its label (e.g., PERSON, ORG, LOC)
        named_entities.append((entity.text, entity.label_))
    
    # Return the list of named entities and their labels
    return named_entities

# Test the function with an example text
# This will extract named entities from the text_example and print them
extract_named_entities_from_text(text_example)

[('Elias Jacob', 'PER'),
 ('Natal', 'LOC'),
 ('Rio Grande do Norte', 'LOC'),
 ('Instituto Metrópole Digital', 'ORG'),
 ('UFRN', 'LOC'),
 ('Corregedor da UFRN', 'MISC'),
 ('Japão', 'LOC'),
 ('Consulado', 'MISC'),
 ('Recife', 'LOC'),
 ('United Airlines', 'ORG'),
 ('Roger Waters', 'PER'),
 ('São Paulo', 'LOC'),
 ('Kidoairaku', 'ORG')]


# Illustrative Use Cases of NER

Let's dive deeper into real-world applications of NER:

- **Healthcare Text Analysis**: In clinical note analysis, NER can help identify information about diseases, symptoms, treatments, medications which can aid in improved medical decision making.
- **News Articles**: For a news reading app that curates articles, NER can help extract information about people, organizations, and locations mentioned in the article. This can then be used to categorize or tag the articles or improve article recommendations.
- **Customer Support**: In a customer support scenario, NER can be used to identify and separate out the important pieces of information from a customer’s query like name, email address, phone number. This can allow for more efficient handling of customer requests.

In essence, NER's utility stretches across various domains, transforming unstructured textual data into structured, machine-readable data, thereby enabling more sophisticated and nuanced analyses.

## Why not use regular expressions?

Regex is a powerful tool for text processing and pattern matching. However, it is not a good choice for NER. The main reason is that regular expressions are not able to generalize well to unseen data. For instance, if we want to extract all the names of people in a text, we can use a regular expression such as [A-Z][a-z]*\s[A-Z][a-z]*.
This regular expression will match all the names of people that have a first name and a last name. However, it will not match names that have a middle name or initial. It will also not match names that have a hyphen, connectives (such as de, a, do, dos, das), suffixes (such as Júnior or Neto). It will also match strings that are not actually names such as the word “Doctor” or “Professor”.

## Why not use a dictionary?

A dictionary is a good choice for NER if we have a small number of entities that we want to extract. However, it is not a good choice if we have a large number of entities. For instance, if we want to extract all the names of people in a text, we can use a dictionary that contains all the names of people in the world. However, this dictionary will be very large and it will be difficult to maintain. It will also be difficult to update the dictionary when new names are added to the world.

## Illustrative Use Cases of Named Entity Recognition (NER)

Let's dive deeper into real-world applications of NER:


### Healthcare Text Analysis
In the healthcare sector, NER can be used to extract and structure critical information from clinical notes and medical literature. This includes identifying and categorizing:

- **Diseases**: Recognizing mentions of diseases.
- **Symptoms**: Identifying symptoms described in patient records.
- **Treatments**: Extracting information about various treatments administered.
- **Medications**: Identifying prescribed medications.

**Benefits**:
- **Improved Medical Decision Making**: Providing clinicians with structured and accessible patient data.
- **Automated Cohort Identification**: For clinical trials and research studies based on specific medical conditions.
- **Enhanced Patient Risk Stratification**: By analyzing medical histories for potential risk factors.


### News Articles
For news aggregation and analysis platforms, NER offers significant advantages:

- **Content Categorization**: By extracting entities like people, organizations, and locations, articles can be efficiently tagged and categorized.
- **Enhanced Recommendations**: By understanding user preferences based on frequently engaged entities, more personalized content suggestions can be made.
- **Trend Analysis**: Tracking the frequency of certain entities over time helps in identifying and analyzing trends.


### Customer Support
NER streamlines customer support operations by:

- **Automating Ticket Routing**: Directing queries to the appropriate department based on extracted entities like product names or issue types.
- **Enhancing Chatbot Capabilities**: Enabling chatbots to understand and respond to user requests more effectively by identifying key entities within conversations.
- **Personalizing Customer Interactions**: Recognizing customer information such as names and order histories to provide more tailored support.

**Benefits**:
- **Improved Efficiency**: Faster and more accurate handling of customer queries.
- **Better Customer Experience**: More personalized and context-relevant interactions.

# Weak Supervision for Named Entity Recognition

## Setting the Stage

Let's imagine a scenario where we want to train a Named Entity Recognition (NER) model to identify entities in a text. However, we have limited labeled data available for training. Specifically, imagine the following:

1. You work for the legal department of the local governenment.
2. Every day, people file lawsuits demanding your government to pay them for drugs that they might need to treat their diseases. Remember: according to the Brazilian Constitution, the government must provide free healthcare to everyone, including free drugs.
3. At the sime time, the government has [UNICAT](http://www.unicat.rn.gov.br/), a agency responsible to provide drugs to the population. However, the agency don't know exactly what drugs are being demanded by the population.
4. Remember: each lawsuit represents an extra cost to the government. Therefore, it would be better to avoid lawsuits that are not necessary by making the drugs available to the population before the lawsuits are filed.
5. It would be very useful to have a system that could automatically identify the names of the drugs mentioned in the lawsuits. This would allow you to quickly analyze past lawsuits and identify patterns that could help you make better decisions in the future.
6. Your goal is to build a NER model that can identify the names of the drugs mentioned in the lawsuits, so that you can analyze the data and give UNICAT the information they need to make the drugs available to the population before the lawsuits are filed.

In this scenario, you have a limited number of labeled examples where the names of the drugs are annotated. You could manually annotate more examples, but this would be time-consuming and expensive. Instead, you decide to use weak supervision to train your NER model with limited labeled data.

## Our Data

You'll work with a dataset of legal documents containing lawsuits filed against the government. Each document contains a text description of the lawsuit, and your task is to identify the names of the drugs mentioned in these texts.
Your train data contains 826 legal documents with no annotations, but you have a small set of labeled examples where the names of the drugs are annotated. Your development set contains 100 legal documents with annotations for evaluation. Your test set contains 255 legal documents with annotations for final evaluation. Labels use [IOB tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), where "B" indicates the beginning of an entity, and "I" indicates the continuation of an entity.

In [4]:
df_train = pd.read_parquet('data/ner/train.parquet')
df_valid = pd.read_parquet('data/ner/valid.parquet')
df_test = pd.read_parquet('data/ner/test.parquet')

df_train.shape, df_valid.shape, df_test.shape

((826, 2), (255, 3), (100, 3))

In [5]:
df_train

Unnamed: 0,uid,text
0,8ddbcd4624582098ea5a635311e2fdce,FUNDAMENTAÇÃO\nCuida-se de ação ordinária sob ...
1,e7f8185d32e003c0fe9df13ab08011fd,"RELATÓRIO\nTrata-se de Ação Ordinária, com ped..."
2,579cecacdd9fed378f341a5d5375ea16,"Nessa linha de intelecção, subsumindo as dispo..."
3,be2d81920265b7ffb168cee17076178f,Sugiro liberar a medicação em questão quando a...
4,848d8b4cdd2cb5d03a4a5060ce9ce0f2,"Trata-se de Ação Especial, com pedido de tutel..."
...,...,...
821,97b26e1d778d9c0a2604e4713241d351,Considerando que o perito nomeado por este juí...
822,015cf835a1eb4549347d93076c3e615e,"Relatório dispensado, nos termos do art. 38 da..."
823,f2377f7fe9be82680bc340884135d2ec,A hipossuficiência financeira para as prestaçõ...
824,22856c7022cc2cdba9fde68ac31bb4c7,PODER JUDICIÁRIO\nJUSTIÇA FEDERAL NO RIO GRAND...


In [6]:
df_valid

Unnamed: 0,uid,text,labels
0,cc7d620236e02ee78d9977dbc7985584,TIPO “A”\nTrata-se de ação proposta por FELIPE...,"[('TIPO', 'O'), ('“', 'O'), ('A', 'O'), ('”', ..."
1,3d8df5105d43c30a7dcd7aa602fb895e,Como não houve inclusão pelo Ministério da Saú...,"[('Como', 'O'), ('não', 'O'), ('houve', 'O'), ..."
2,fb6d2d54a1357f72dd4fdd36afb2f2a8,"8. Ressalte-se, ademais, que a Nota Técnica n....","[('8', 'O'), ('.', 'O'), ('Ressalte', 'O'), ('..."
3,a47ff22ae748832734a1fa90a870cc5a,. Acerca do fármacopleiteado e da patologia ap...,"[('.', 'O'), ('Acerca', 'O'), ('do', 'O'), ('f..."
4,fe2d89c89c60d342bec6d7ab725ba26e,"Ademais, há nos autos laudo médico emitido por...","[('Ademais', 'O'), (',', 'O'), ('há', 'O'), ('..."
...,...,...,...
250,404eb0d966568da63bbcad6931e5427e,11) Considerando a natureza da patologia e em ...,"[('11', 'O'), (')', 'O'), ('Considerando', 'O'..."
251,4f2b2cbcb1a2587786e5f716e08b49cb,"Assim, tratando-se de medicamentosque não fora...","[('Assim', 'O'), (',', 'O'), ('tratando', 'O')..."
252,08be5705a599185821d1c2bbcd5395a9,DEFIRO O PEDIDO DE TUTELA PROVISÓRIA DE URGÊNC...,"[('DEFIRO', 'O'), ('O', 'O'), ('PEDIDO', 'O'),..."
253,9cb6d5b410655255080adc9af99930aa,"Outrossim, seguindo a linha de diversos preced...","[('Outrossim', 'O'), (',', 'O'), ('seguindo', ..."


In [7]:
df_valid['labels'] = df_valid['labels'].apply(lambda x: eval(x))
df_test['labels'] = df_test['labels'].apply(lambda x: eval(x))

In [8]:
print(df_valid.iloc[0]['text'])

TIPO “A”
Trata-se de ação proposta por FELIPE EMANUEL MONTEIRO CAVALCANTE, representado por sua genitora FRANCISCA FLAUBÉRIA QUEIROZ MONTEIRO, com pedido de antecipação dos efeitos da tutela, em desfavor da UNIÃO FEDERAL, ESTADO DO RIO GRANDE DO NORTE e do MUNICÍPIO DE PAU DOS FERROS, objetivando o fornecimento da segunda dose da vacina MENINGOCÓCIDA B, a ser custeado solidariamente pelos entes federativos.Relatório dispensado, na forma do art. 38 da Lei 9.099/95 c/c o art. 1o da Lei 10.259/01.
I – FUNDAMENTAÇÃO
1. PRELIMINAR: ILEGITIMIDADE PASSIVA 
Não merecer acolhida a preliminar de ilegitimidade passiva suscitada pela União, por ser a saúde competência comum aos entes federados, reclamando uma ação conjunta no propósito de cumprir o dever estatal consubstanciado na Constituição Federal, como se extrai das prescrições legais:
 “Art. 23. É competência comum da União, dos Estados, do Distrito Federal e dos Municípios: (.)
II - cuidar da saúde e assistência pública, da proteção e garan

In [9]:
df_valid.iloc[0]['labels']

[('TIPO', 'O'),
 ('“', 'O'),
 ('A', 'O'),
 ('”', 'O'),
 ('Trata', 'O'),
 ('-', 'O'),
 ('se', 'O'),
 ('de', 'O'),
 ('ação', 'O'),
 ('proposta', 'O'),
 ('por', 'O'),
 ('FELIPE', 'O'),
 ('EMANUEL', 'O'),
 ('MONTEIRO', 'O'),
 ('CAVALCANTE', 'O'),
 (',', 'O'),
 ('representado', 'O'),
 ('por', 'O'),
 ('sua', 'O'),
 ('genitora', 'O'),
 ('FRANCISCA', 'O'),
 ('FLAUBÉRIA', 'O'),
 ('QUEIROZ', 'O'),
 ('MONTEIRO', 'O'),
 (',', 'O'),
 ('com', 'O'),
 ('pedido', 'O'),
 ('de', 'O'),
 ('antecipação', 'O'),
 ('dos', 'O'),
 ('efeitos', 'O'),
 ('da', 'O'),
 ('tutela', 'O'),
 (',', 'O'),
 ('em', 'O'),
 ('desfavor', 'O'),
 ('da', 'O'),
 ('UNIÃO', 'O'),
 ('FEDERAL', 'O'),
 (',', 'O'),
 ('ESTADO', 'O'),
 ('DO', 'O'),
 ('RIO', 'O'),
 ('GRANDE', 'O'),
 ('DO', 'O'),
 ('NORTE', 'O'),
 ('e', 'O'),
 ('do', 'O'),
 ('MUNICÍPIO', 'O'),
 ('DE', 'O'),
 ('PAU', 'O'),
 ('DOS', 'O'),
 ('FERROS', 'O'),
 (',', 'O'),
 ('objetivando', 'O'),
 ('o', 'O'),
 ('fornecimento', 'O'),
 ('da', 'O'),
 ('segunda', 'O'),
 ('dose', 'O'),
 (

## Why Nots?

Considering the scenario described above, let's discuss why other methods, when considered alone, might not be the best fit for this task.

### Why Not Manual Annotation?

While manual annotation is a reliable method for creating labeled datasets, it has several limitations:

- **Time-Consuming**: Manual annotation is labor-intensive and time-consuming, especially for large datasets.
- **Costly**: Hiring annotators or using crowdsourcing platforms can be expensive.
- **Subjectivity**: Annotations may vary between annotators, leading to inconsistencies.
- **Scalability Issues**: Manual annotation is not easily scalable to large datasets or frequent updates.


### Why Not Use Regular Expressions?

While regular expressions (regex) are powerful for text processing and pattern matching, they have significant limitations for NER:

- **Lack of Generalization**: Regex rules are often too rigid to generalize well to unseen data. For example, a pattern designed to match names may not account for variations like middle names, hyphens, or cultural variations.
- **Overgeneralization**: Regex might incorrectly identify non-entity text that matches the pattern (e.g., mistaking "New York" for a person's name).
- **Contextual Blindness**: Regex cannot consider the surrounding context, which is vital for accurate entity recognition.
- **Maintenance Challenges**: As language evolves, maintaining an exhaustive set of regex patterns becomes increasingly complex and time-consuming.

### Why Not Use a Dictionary?

While dictionaries can be useful for NER in certain scenarios, they also have significant drawbacks:

- **Scalability Issues**: Maintaining an up-to-date dictionary for broad entity categories (like all person names) is practically impossible due to the vast and ever-changing nature of language.
- **Ambiguity Challenges**: Many words can be both entity and non-entity, making dictionary-based approaches prone to errors.
- **Incomplete Coverage**: A dictionary might miss newly coined terms or entities, leading to incomplete extraction.

> **Note**: Machine learning models can learn context and generalize from training data, making them more adaptable and accurate for NER tasks compared to regex or dictionary-based approaches.

### Why Not Use a Pre-Trained Model?

While pre-trained models offer a quick and effective way to perform NER, they may not always be suitable for specific use cases due to the following reasons:

- **Domain-Specific Knowledge**: Pre-trained models might not capture domain-specific entities or terminologies.
- **Fine-Grained Control**: Customizing pre-trained models for specific entity types or constraints can be challenging.
- **Data Privacy Concerns**: Using pre-trained models might expose sensitive data to third-party services, raising privacy concerns.

> **Note**: Training a custom NER model on domain-specific data can address these limitations and provide more tailored entity recognition capabilities.

### Why Not Use Zero-Shot Learning?

Zero-shot learning is a powerful technique that allows models to generalize to unseen classes. However, it has some limitations for NER tasks:

- **Limited Contextual Understanding**: Zero-shot learning may struggle with complex entity relationships and context-dependent entity recognition.
- **Data Efficiency**: Training zero-shot models often requires large amounts of data to generalize effectively to unseen entities.
- **Fine-Grained Entity Recognition**: For tasks requiring fine-grained entity classification, zero-shot learning may not provide the necessary granularity.

> **Note**: Zero-shot learning can be a valuable tool for NER tasks, especially when dealing with novel entities or limited labeled data, but it may not always outperform supervised learning approaches in all scenarios.


While these methods have their advantages, they also come with limitations that can impact the accuracy, scalability, and adaptability of NER systems. Good thing is that we can use Weak Supervision to use the best of these methods to train a NER model. Weak supervision offers a solution to this problem by leveraging various sources of noisy or weak supervision to train models effectively.

## New Tool: [Skweak](https://github.com/NorskRegnesentral/skweak)

Skweak is a versatile, Python-based software toolkit designed for Natural Language Processing (NLP) developers. It facilitates the application of weak supervision to various NLP tasks, particularly sequence labeling (our case). While Snorkel is a popular tool for weak supervision, it does not directly support sequence labeling tasks like Named Entity Recognition (NER). We would need way too much workarounds to use Snorkel for NER. Skweak is specifically tailored for sequence labeling tasks, such as Named Entity Recognition (NER).

### Features of Skweak

- **Labeling Functions**:
    - These are custom rules or heuristics that generate noisy labels based on patterns, external sources, or other criteria.
    - Labeling functions are essential for reducing the manual effort in data annotation.

- **Label Aggregation**:
    - Skweak combines the outputs of multiple labeling functions into a single, aggregated label for each data point.
    - This process leverages models like HMM to ensure that the final labels are as accurate as possible despite the noise in individual labeling functions.

- **Support for Named Entity Recognition (NER)**:
    - Skweak is particularly useful for NER tasks, allowing developers to create labeling functions aimed at recognizing entities within text.
    - This feature is crucial for tasks requiring the identification of names, dates, locations, and other entities in text data.

### Detailed Explanation

**Labeling Functions**:
- Labeling functions are at the heart of Skweak. They are designed to apply domain-specific heuristics to label data points. For example, a labeling function might tag all capitalized words as potential entities.
- These functions can incorporate a variety of sources, including dictionaries, regular expressions, and pre-trained models, to generate noisy labels.

**Generative Model for Aggregation**:
- After the initial labeling, Skweak uses a generative model to aggregate these noisy labels.
- The Hidden Markov Model (HMM) is one such model used for this purpose. It helps in accounting for dependencies between labels and smoothing out inconsistencies.
- Alternatively, a Majority Vote approach can be used, where the most common label among the functions is chosen as the final label.

**Named Entity Recognition (NER)**:
- In NER tasks, Skweak can be particularly powerful. For instance, you can create multiple labeling functions that identify entities based on different criteria, such as context or specific keywords.
- The aggregated results from these functions can significantly improve the accuracy of NER tasks compared to using a single labeling function or manual annotation.

### Additional Resources

- For more details on Skweak, refer to the [GitHub repository](https://github.com/NorskRegnesentral/skweak).
- The tool is discussed in depth in the paper [skweak: Weak Supervision Made Easy for NLP](https://arxiv.org/abs/2104.09683).

> **Note**: Understanding the basic principles of weak supervision and generative models like HMM can significantly enhance the effective use of Skweak. I recommend exploring these concepts further to exploit Skweak optimally for NER tasks.

## Workflow with Skweak for Named Entity Recognition (NER)

This section outlines a extensive workflow for using Skweak, a framework for weak supervision in machine learning, in a Named Entity Recognition (NER) task.

### 1. Data Preparation

- **Organize Text Data**: Structure your text data in a format suitable for processing. This often involves tokenizing the text and performing basic preprocessing. This also involves converting the text data into a format that can be used by Skweak (spacy Doc objects).

- **Create Data Splits**:
    - **Labeled Data**: Prepare a small set of examples manually annotated with entities for training and validation.
    - **Unlabeled Data**: Gather a larger corpus without annotations to apply Skweak's weak supervision techniques.
    - **Held-out Test Set**: Reserve a set of data with ground truth annotations for final evaluation.

> Note: The quality and diversity of the data are crucial for the success of your NER model.

### 2. Labeling Functions Creation

Develop a set of labeling functions that generate noisy labels based on various heuristics and rules:

- **Patterns**: Use regular expressions or pattern-matching techniques (e.g., capitalization, specific sequences of words).
- **Gazetteers**: Apply lists of known entities.
- **Context-based Rules**: Identify entities based on surrounding words or typical contexts.
- **External Knowledge**: Call APIs or use databases such as Wikipedia for additional context.
- **Pre-trained Models**: Incorporate outputs from existing models like spaCy or the Google Natural Language API.

> Aim for coverage and diversity to capture different aspects of entity recognition. Start with high-precision rules and gradually introduce more complex functions for better recall.

### 3. Weak Supervision

- **Apply Labeling Functions**: Use your functions to generate noisy labels for the unlabeled data.
- **Aggregate Labels**: Use Skweak's generative models (e.g., Hidden Markov Model) to combine these noisy labels into a coherent set of annotations. This step resolves conflicts and leverages the collective insights of all labeling functions.

### 4. Model Training

- **Prepare Training Data**: Combine the small set of manually labeled data with the weakly supervised data.
- **Choose a Model**: Select an appropriate NER model, such as a Conditional Random Field (CRF) or a Transformer-based model like BERT.
- **Train the Model**: Use the aggregated labels as targets. Consider transfer learning to exploit pre-existing language models.

### 5. Evaluation

- **Performance Assessment**: Evaluate the trained model on the held-out test set with ground truth annotations.
- **Metrics**: Measure performance using precision, recall, and F1 score. More advanced metrics like span-based F1 or partial matching can also be useful.
- **Error Analysis**: Examine misclassified examples to identify common errors and areas for improvement.

### 6. Iterative Refinement

The Skweak workflow's strength lies in its iterative nature:

- **Refine Labeling Functions**: Improve existing functions or develop new ones based on the errors observed during evaluation.
- **Tune Aggregation Methods**: Experiment with different models or parameters to optimize label quality.
- **Model Improvements**: Apply architectural changes, hyperparameter tuning, or advanced techniques to enhance model performance.
- **Data Augmentation**: Create targeted functions or increase data for underrepresented entity types.

> Remember that weak supervision and iterative refinement are key to improving the NER model over time. Each cycle of refinements will help in enhancing the model's precision and robustness.

By following this workflow, you can exploit Skweak to build a robust NER system with minimal manual labeling, focusing on continuous improvement and adaptation.

## Step 1 - Convert Text Data to Spacy Doc Objects

The first step in the Skweak workflow is to convert your text data into Spacy Doc objects. Spacy is a popular NLP library that provides efficient tokenization, part-of-speech tagging, and named entity recognition capabilities. By converting your text data into Spacy Doc objects, you can exploit Spacy's functionalities for further processing and analysis.

In [10]:
# Import the spaCy library for natural language processing
# spaCy provides tools for tokenization, part-of-speech tagging, named entity recognition, and more
import spacy

# Import the skweak library for weak supervision
# skweak allows us to combine multiple weak supervision sources to create high-quality training data
import skweak

In [11]:
# Load the spaCy model for Portuguese
# 'pt_core_news_lg' is a large model with more accuracy and features
# You can also use 'pt_core_news_sm' for a smaller, faster model with fewer features
# To install the model, run: python -m spacy download pt_core_news_lg
nlp = spacy.load("pt_core_news_lg")

# Process the training dataset using the spaCy pipeline
# The nlp.pipe method processes the text in batches, which is more efficient than processing each text individually
# df_train['text'].values contains the text data from the training dataset
spacy_docs_train = list(nlp.pipe(df_train['text'].values))

# Process the validation dataset using the spaCy pipeline
# df_valid['text'].values contains the text data from the validation dataset
spacy_docs_valid = list(nlp.pipe(df_valid['text'].values))

# Process the test dataset using the spaCy pipeline
# df_test['text'].values contains the text data from the test dataset
spacy_docs_test = list(nlp.pipe(df_test['text'].values))

# Save the processed spaCy documents to disk
# This avoids the need to run the spaCy pipeline again, saving time in future runs
# skweak.utils.docbin_writer writes the spaCy documents to a binary file
# The first argument is the list of spaCy documents, and the second argument is the file path
skweak.utils.docbin_writer(spacy_docs_train, "data/bin/ner/spacy_docs_train.bin")
skweak.utils.docbin_writer(spacy_docs_valid, "data/bin/ner/spacy_docs_valid.bin")
skweak.utils.docbin_writer(spacy_docs_test, "data/bin/ner/spacy_docs_test.bin")

Write to data/bin/ner/spacy_docs_train.bin...done
Write to data/bin/ner/spacy_docs_valid.bin...done
Write to data/bin/ner/spacy_docs_test.bin...done


In [12]:
# Load the processed spaCy documents from disk
# This avoids the need to run the spaCy pipeline again, saving time in future runs
# skweak.utils.docbin_reader reads the spaCy documents from a binary file
# The first argument is the file path, and the second argument is the name of the spaCy model used for processing

# Load the training documents
spacy_docs_train = skweak.utils.docbin_reader("data/bin/ner/spacy_docs_train.bin", spacy_model_name="pt_core_news_lg")

# Load the validation documents
spacy_docs_valid = skweak.utils.docbin_reader("data/bin/ner/spacy_docs_valid.bin", spacy_model_name="pt_core_news_lg")

# Load the test documents
spacy_docs_test = skweak.utils.docbin_reader("data/bin/ner/spacy_docs_test.bin", spacy_model_name="pt_core_news_lg")

# Convert the loaded documents to lists
# This step ensures that the documents are in a list format, which is easier to work with in subsequent steps
spacy_docs_train = list(spacy_docs_train)
spacy_docs_valid = list(spacy_docs_valid)
spacy_docs_test = list(spacy_docs_test)

In [13]:
spacy_docs_train[:2]

[FUNDAMENTAÇÃO
 Cuida-se de ação ordinária sob o rito sumaríssimo, com pedido de tutela antecipada, manejada por Natália Samara Araújo Rosalem em face da União e do Estado da Paraíba, objetivando a condenação dos réus no dever de fornecer à parte autora o medicamento Micofenolato Mofetil (CELLCEPT) – 500mg, para a utilização de 3 comprimidos ao dia, nos termos da prescrição médica e enquanto perdurar a indicação clínica.
 Em apertada síntese, a parte autora afirma ser portadora de lúpus eritematoso sistêmico (CID M 32.8) e que necessita do tratamento ora requerido. Aduz que há urgência no atendimento do seu pleito, sob risco de piora irreversível no seu caso clínico em caso de não realização do procedimento pleiteado.
 Conforme relatado pelo médico particular Dr. Eduardo Sérgio Ramalho (CRM – 3295/PB), o medicamento requerido é a única alterantiva possível e que ainda não está sendo utilizada.
 Contudo, segundo alega, tal medicamento não é fornecido pelo SUS, não tendo a requerente con

### Step 2 - Create Labeling Functions

Labelling functions are at the central of skweak. They can be constructed in several ways. The key idea behind all labelling functions is that they take a Doc object as input, and returns a list of (token-level) spans with associated labels.

For sequence labelling, the spans simply corresponds to the entities one wish to detect. For text classification tasks (such as sentiment analysis), the span corresponds to the full text you wish to classify (which may be a sentence, or perhaps the full document).

There are several heuriestics that can be used to create labelling functions with skweak. I recommend you to check the [documentation](https://github.com/NorskRegnesentral/skweak/wiki/Step-1:-Labelling-functions) for more details.

#### 2.1 Using a predefined list of drugs (gazetteer)

One common approach to creating labelling functions is to use a predefined list of entities, also known as a gazetteer. In our case, we can use a list of drug names to create a labelling function that identifies drug entities in the text.
We'll load the drug list from the ["Preço Máximo ao Consumidor"](https://www.gov.br/anvisa/pt-br/assuntos/medicamentos/cmed/precos) (Maximum Price to the Consumer) database, which contains information about the maximum prices of drugs in Brazil. This list contains the names of various drugs that are commonly prescribed and used in healthcare.

In [14]:
pmc = pd.read_excel('data/ner/pmc_20240903.xls', header=None)
pmc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,Secretaria Executiva - CMED,,,,,,,,,,...,,,,,,,,,,
1,LISTA DE PREÇOS DE MEDICAMENTOS - PREÇOS FÁBRI...,,,,,,,,,,...,,,,,,,,,,
2,Publicada em 03/09/2024 às 16h00min.,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,Esta lista apresenta os preços dos medicamento...,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24682,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012007,1015500910141,7896226109657,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,50.50,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24683,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012107,1015500910158,7896226109664,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,57.01,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24684,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHÉ LABORATÓRIOS FARMACÊUTICOS S.A,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,140.79,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24685,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATÓRIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.42,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim


In [15]:
pmc.head(50)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,Secretaria Executiva - CMED,,,,,,,,,,...,,,,,,,,,,
1,LISTA DE PREÇOS DE MEDICAMENTOS - PREÇOS FÁBRI...,,,,,,,,,,...,,,,,,,,,,
2,Publicada em 03/09/2024 às 16h00min.,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,Esta lista apresenta os preços dos medicamento...,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,A lista de Preços de Medicamentos contempla o ...,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,Nesta lista foi incluída a alíquota de ICMS 0%...,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


In [16]:
# drop the first 41 rows
pmc = pmc.drop(range(41))
pmc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
41,SUBSTÂNCIA,CNPJ,LABORATÓRIO,CÓDIGO GGREM,REGISTRO,EAN 1,EAN 2,EAN 3,PRODUTO,APRESENTAÇÃO,...,PMC 22% ALC,RESTRIÇÃO HOSPITALAR,CAP,CONFAZ 87,ICMS 0%,ANÁLISE RECURSAL,LISTA DE CONCESSÃO DE CRÉDITO TRIBUTÁRIO (PIS/...,COMERCIALIZAÇÃO 2022,TARJA,DESTINAÇÃO COMERCIAL
42,21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL,18.459.628/0001-15,BAYER S.A.,538912020009303,1705600230032,7891106000956,-,-,BAYCUTEN N,"10 MG/G + 0,443 MG/G CREM DERM CT BG AL X 40 G",...,45.70,Não,Não,Não,Não,,Negativa,Sim,- (*),Sim
43,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505107701157215,1018003900019,7896016806469,-,-,ORENCIA,250 MG PO LIOF SOL INJ CT 1 FA + SER DESCARTÁVEL,...,,Sim,Sim,Não,Não,,Positiva,Sim,Tarja Vermelha,Sim
44,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505113100020505,1018003900078,7896016808197,-,-,ORENCIA,125 MG/ML SOL INJ SC CT 4 SER PREENC VD TRANS ...,...,11381.34,Não,Sim,Sim,Não,,Positiva,Sim,- (*),Sim
45,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060021902,1126001990018,7896382708442,-,-,VERZENIOS,50 MG COM REV CT BL AL AL X 30,...,4812.49,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,Não
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24682,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012007,1015500910141,7896226109657,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,50.50,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24683,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012107,1015500910158,7896226109664,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,57.01,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24684,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHÉ LABORATÓRIOS FARMACÊUTICOS S.A,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,140.79,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24685,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATÓRIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.42,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim


In [17]:
# Make the first row the header
pmc.columns = pmc.iloc[0]
pmc = pmc.drop(41)
pmc


41,SUBSTÂNCIA,CNPJ,LABORATÓRIO,CÓDIGO GGREM,REGISTRO,EAN 1,EAN 2,EAN 3,PRODUTO,APRESENTAÇÃO,...,PMC 22% ALC,RESTRIÇÃO HOSPITALAR,CAP,CONFAZ 87,ICMS 0%,ANÁLISE RECURSAL,LISTA DE CONCESSÃO DE CRÉDITO TRIBUTÁRIO (PIS/COFINS),COMERCIALIZAÇÃO 2022,TARJA,DESTINAÇÃO COMERCIAL
42,21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL,18.459.628/0001-15,BAYER S.A.,538912020009303,1705600230032,7891106000956,-,-,BAYCUTEN N,"10 MG/G + 0,443 MG/G CREM DERM CT BG AL X 40 G",...,45.70,Não,Não,Não,Não,,Negativa,Sim,- (*),Sim
43,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505107701157215,1018003900019,7896016806469,-,-,ORENCIA,250 MG PO LIOF SOL INJ CT 1 FA + SER DESCARTÁVEL,...,,Sim,Sim,Não,Não,,Positiva,Sim,Tarja Vermelha,Sim
44,ABATACEPTE,56.998.982/0001-07,BRISTOL-MYERS SQUIBB FARMACÊUTICA LTDA,505113100020505,1018003900078,7896016808197,-,-,ORENCIA,125 MG/ML SOL INJ SC CT 4 SER PREENC VD TRANS ...,...,11381.34,Não,Sim,Sim,Não,,Positiva,Sim,- (*),Sim
45,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060021902,1126001990018,7896382708442,-,-,VERZENIOS,50 MG COM REV CT BL AL AL X 30,...,4812.49,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,Não
46,ABEMACICLIBE,43.940.618/0001-44,ELI LILLY DO BRASIL LTDA,507619060022102,1126001990034,7896382708466,-,-,VERZENIOS,100 MG COM REV CT BL AL AL X 30,...,9624.92,Não,Não,Não,Não,,Negativa,Sim,Tarja Vermelha,Não
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24682,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012007,1015500910141,7896226109657,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,50.50,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24683,ÓXIDO CÚPRICO;ACETATO DE RACEALFATOCOFEROL;BET...,60.726.692/0001-81,MARJAN INDÚSTRIA E COMÉRCIO LTDA,524821050012107,1015500910158,7896226109664,-,-,VITERGAN ZINCO,COM REV CT BL AL PLAS PVC/PE/PVDC TRANS X 15,...,57.01,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24684,ÓXIDO CÚPRICO;SELENATO DE SÓDIO;ACETATO DE RAC...,60.659.463/0029-92,ACHÉ LABORATÓRIOS FARMACÊUTICOS S.A,500500101118422,1057302060014,7896658000010,-,-,ACCUVIT,COM REV CT FR PLAS OPC X 30,...,140.79,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim
24685,ÓXIDO DE MAGNÉSIO;SIMETICONA;HIDRÓXIDO DE ALUM...,61.190.096/0001-92,EUROFARMA LABORATÓRIOS S.A.,508011804138416,1004306960107,7891317469610,7891317020118,-,SIMECO PLUS,120 MG/ML + 60 MG/ML + 7 MG/ML SUS OR CT FR VD...,...,14.42,Não,Não,Não,Não,,Negativa,Não,Tarja Sem Tarja,Sim


In [18]:
substance_names = pmc['SUBSTÂNCIA'].values
substance_names[:5]

array(['21-ACETATO DE DEXAMETASONA;CLOTRIMAZOL', 'ABATACEPTE',
       'ABATACEPTE', 'ABEMACICLIBE', 'ABEMACICLIBE'], dtype=object)

In [19]:
brand_names = pmc['PRODUTO'].values
brand_names[:5]

array(['BAYCUTEN N', 'ORENCIA', 'ORENCIA', 'VERZENIOS', 'VERZENIOS'],
      dtype=object)

In [20]:
import re

# Initialize an empty list to store drug names
drugs = []

# Split pharmacological substances
for substance in substance_names:
    # Split the substance names by commas, semicolons, or plus signs
    split_substance = re.split(r'[,;\+]', substance)
    # Strip any leading or trailing whitespace from each split part
    split_substance = [s.strip() for s in split_substance]
    # Extend the drugs list with the split parts
    drugs.extend(split_substance)

# Extend the drugs list with brand names
drugs.extend(brand_names)

# Remove duplicates by converting the list to a set and back to a list
drugs = list(set(drugs))

In [21]:
drugs[:10]

['FORTEN',
 'TACROLIMO',
 'ATEROMA',
 'EZET',
 'LOCERYL',
 'RIVACRIST',
 'TRAZIMERA',
 'CONTRACTUBEX',
 'LISADOR DIP',
 'MEPSEVII']

In [22]:
'ACIDO GLUTAMICO' in drugs

False

In [23]:
from helpers.text import remove_accented_characters

# Remove accented characters from each drug name
# This is useful for standardizing the drug names for further processing
drugs.extend([remove_accented_characters(d) for d in drugs])

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each drug name appears only once in the list
drugs = list(set(drugs))

# Display the first 10 drug names
# This is useful for quickly checking the contents of the list
drugs[:10]

['EZET',
 'FORTEN',
 'TACROLIMO',
 'ATEROMA',
 'LOCERYL',
 'RIVACRIST',
 'TRAZIMERA',
 'CONTRACTUBEX',
 'LISADOR DIP',
 'MEPSEVII']

In [24]:
'ACIDO GLUTAMICO' in drugs

True

In [25]:
# Convert all drug names to lowercase
# This ensures that the comparison of drug names is case-insensitive
# The list comprehension iterates over each drug name in the 'drugs' list and converts it to lowercase
drugs = [d.lower() for d in drugs]

# Remove duplicate drug names
# The set() function removes duplicates by converting the list to a set, which only keeps unique elements
# The list() function converts the set back to a list
drugs = list(set(drugs))

# Display the first 10 drug names
# This allows us to inspect a sample of the processed drug names
# The slicing operation [:10] retrieves the first 10 elements from the 'drugs' list
drugs[:10]

['picbam',
 'lfm-leflunomida',
 'riduzi',
 'ecalta',
 'hormotrop',
 'diuremida',
 'brometo de ipratroprio',
 'obinutuzumabe',
 'nidazofarma',
 'neumosin']

In [26]:
def remove_hydrated_compounds(text: str) -> str:
    """
    Remove words related to hydrated compounds from the input text.

    Args:
        text (str): The input text containing chemical compound names.

    Returns:
        str: The text with hydrated compound names removed.
    """
    # Replace words like "monoidratada" with an empty string
    cleaned_text = re.sub(r' \b\S*idratad[oa]\b', '', text, flags=re.IGNORECASE)

    # Remove any extra spaces that may result from the substitution
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text

print(remove_hydrated_compounds('ACIDO GLUTAMICO MONOIDRATADO')) # Expected: 'ACIDO GLUTAMICO'
print(remove_hydrated_compounds('cloridrato de lidocaína monoidratada')) # Expected: 'cloridrato de lidocaína'
print(remove_hydrated_compounds('Metilprednisolona')) # Expected: 'Metilprednisolona'

ACIDO GLUTAMICO
cloridrato de lidocaína
Metilprednisolona


In [27]:
# Apply the remove_hydrated_compounds function to each drug name in the drugs list
drugs = [remove_hydrated_compounds(d) for d in drugs]

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each drug name appears only once in the list
drugs = list(set(drugs))

# Calculate the number of unique drug names
# This is useful for understanding the size of the dataset after processing
len(drugs)

8016

In [28]:
def extract_active_ingredient(full_name: str) -> str:
    """
    Extract the active ingredient name from the full medication name.

    Args:
        full_name (str): The full name of the medication.

    Returns:
        str: The name of the active ingredient.
    """
    # List of words that indicate the start of the active ingredient name
    indicators = ['de', 'do', 'da', 'dos', 'das']
    
    # Split the full name into words and convert to lowercase
    words = full_name.lower().split()
    
    # Find the index of the first indicator word
    active_ingredient_index = None
    for i, word in enumerate(words):
        if word in indicators and i < len(words) - 1:
            active_ingredient_index = i
            break
    
    # If an indicator is found, return the words after it
    if active_ingredient_index is not None:
        return ' '.join(words[active_ingredient_index + 1:])
    else:
        # If no indicator is found, return the full name
        return full_name

# Examples of usage
print(extract_active_ingredient("micofenolato de pantoprazol"))  # Should return "pantoprazol"
print(extract_active_ingredient("cloridrato de metformina"))     # Should return "metformina"
print(extract_active_ingredient("paracetamol"))                  # Should return "paracetamol"

pantoprazol
metformina
paracetamol


In [29]:
# Apply the extract_active_ingredient function to each drug name in the drugs list
drugs.extend([extract_active_ingredient(d) for d in drugs])

# Remove duplicates by converting the list to a set and back to a list
# This ensures that each active ingredient appears only once in the list
drugs = list(set(drugs))

In [30]:
# Filter the drugs list to keep only the names with more than 4 characters
# This step helps to remove very short names that are likely not meaningful drug names
drugs = [d for d in drugs if len(d) > 4]

# Calculate the number of remaining drug names
# This gives an idea of how many drug names are left after filtering
len(drugs)

8579

In [31]:
drugs[:5]

['picbam', 'lfm-leflunomida', 'riduzi', 'ecalta', 'hormotrop']

In [32]:
# Import the json module to handle JSON data
import json

# Initialize an empty dictionary to store the drug names
json_drugs = {}

# Add the list of drug names to the dictionary under the key 'MEDICAMENTO'
json_drugs['MEDICAMENTO'] = drugs

# Save the dictionary as a JSON file
# The ensure_ascii=False parameter allows for non-ASCII characters to be saved correctly
# The indent=4 parameter makes the JSON file more readable by adding indentation
with open('data/ner/drugs_gazetteer.json', 'w') as f:
    json.dump(json_drugs, f, ensure_ascii=False, indent=4)

# Load the JSON file into a gazetteer for weak supervision
# The extract_json_data function reads the JSON file and prepares it for use with skweak
tries_drugs = skweak.gazetteers.extract_json_data('data/ner/drugs_gazetteer.json', spacy_model="pt_core_news_lg")

# Create a GazetteerAnnotator for labeling the data
# The GazetteerAnnotator uses the gazetteer to annotate text with drug names
# The case_sensitive=False parameter makes the annotation case-insensitive
lf_drugs_gazetteer = skweak.gazetteers.GazetteerAnnotator("drugs_gazetteer", tries_drugs, case_sensitive=False)

Extracting data from data/ner/drugs_gazetteer.json
Populating trie for class MEDICAMENTO (number: 8579)


In [33]:
# Define a sample text containing drug names for testing the function
text = "O paciente foi medicado com ácido glutâmico monoidratado , Cloridrato de lidocaína e Rivotril ."

# Process the text using the spaCy NLP pipeline
# This step tokenizes the text and applies linguistic annotations
doc = nlp(text)

# Apply the GazetteerAnnotator to the processed text
# This annotates the text with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the text
# This function highlights the recognized drug names in the text for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")

In [34]:
# Select the first document from the training set of spaCy documents
# This document will be used to demonstrate the annotation process
doc = spacy_docs_train[0]

# Apply the GazetteerAnnotator to the selected document
# This annotates the document with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the document
# This function highlights the recognized drug names in the document for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")

In [35]:
# Select the first document from the training set of spaCy documents
# This document will be used to demonstrate the annotation process
doc = spacy_docs_train[1]

# Apply the GazetteerAnnotator to the selected document
# This annotates the document with drug names using the gazetteer
lf_drugs_gazetteer(doc)

# Display the annotated entities in the document
# This function highlights the recognized drug names in the document for visualization
skweak.utils.display_entities(doc, "drugs_gazetteer")

#### 2.2 Leveraging Pretrained Transformer Models for Labeling Functions

In addition to rule-based approaches, pretrained language models offer a powerful alternative for constructing labeling functions. These models, trained on vast amounts of text data, can be employed to identify entities within our target text.

This section focuses on using a pretrained Named Entity Recognition (NER) model from the Hugging Face Transformers library to build a labeling function specifically for identifying **drug entities**. We'll be applying the `pucpr/clinicalnerpt-chemical` model. This model is particularly well-suited for our purpose as it has been fine-tuned on a corpus of clinical text data, enabling it to effectively recognize chemical entities, including drug names.

**Why this model?**

- **Domain Specificity:** Fine-tuning on clinical text makes the model more accurate for our use case compared to a general-purpose NER model.
- **Direct Applicability:** The model's output directly aligns with our goal of identifying drug entities, simplifying the labeling function creation process.

This approach leverages the power of transfer learning, allowing us to benefit from the extensive training these models have undergone and apply their knowledge to our specific task. You can find more details about this model on its Hugging Face model card: [https://huggingface.co/pucpr/clinicalnerpt-chemical](https://huggingface.co/pucpr/clinicalnerpt-chemical).

In [36]:
# Import the pipeline function from the transformers library
# This function is used to create a named entity recognition (NER) pipeline
from transformers import pipeline

# Create an NER pipeline using a pre-trained model for clinical chemical entities
# The model 'pucpr/clinicalnerpt-chemical' is specifically trained for recognizing chemical entities in clinical texts
# The aggregation_strategy="first" parameter ensures that only the first sub-token of a word is used for entity recognition
# The device=-1 parameter indicates that the pipeline should run on the CPU (use 0 or a positive integer for GPU)
ner_pipeline = pipeline("ner", model='pucpr/clinicalnerpt-chemical', aggregation_strategy="first", device=-1)

# Apply the NER pipeline to the text of the spaCy document
# This step performs named entity recognition on the text, identifying chemical entities
ner_results = ner_pipeline(doc.text)

# Display the NER results
# This will show the recognized chemical entities along with their positions and labels
print(ner_results)

2024-09-20 14:46:35.816812: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-20 14:46:35.828291: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-20 14:46:35.831665: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-20 14:46:35.841040: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


pytorch_model.bin:  43%|####2     | 304M/709M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'entity_group': 'ChemicalDrugs', 'score': 0.5152051, 'word': 'medicamento', 'start': 278, 'end': 289}, {'entity_group': 'ChemicalDrugs', 'score': 0.92696726, 'word': 'reuquinol', 'start': 301, 'end': 310}, {'entity_group': 'ChemicalDrugs', 'score': 0.9846222, 'word': 'hidroxicloroquina', 'start': 312, 'end': 329}, {'entity_group': 'ChemicalDrugs', 'score': 0.8043461, 'word': 'pregabalina', 'start': 339, 'end': 350}, {'entity_group': 'ChemicalDrugs', 'score': 0.77164274, 'word': 'medicamento', 'start': 715, 'end': 726}, {'entity_group': 'ChemicalDrugs', 'score': 0.89936185, 'word': 'reuquinol', 'start': 727, 'end': 736}, {'entity_group': 'ChemicalDrugs', 'score': 0.9620483, 'word': 'hidroxicloquina', 'start': 738, 'end': 753}]


Here's how the pipeline object works for any given text

In [37]:
from helpers.ner import TransformerNERAnnotator, render_entity_data_from_pipeline

# Define a sample text containing drug names for testing the NER pipeline
sample_text = "O paciente foi medicado com ácido glutâmico monoidratado, Cloridrato de lidocaína e Rivotril."

# Apply the NER pipeline to the sample text
# This step performs named entity recognition on the text, identifying chemical entities
ner_results = ner_pipeline(sample_text)

# Render the annotated entities in the sample text
# This function highlights the recognized chemical entities in the text for visualization
render_entity_data_from_pipeline(sample_text, ner_results)

Now we can transform it into a labelling function that can be used with Skweak.

In [38]:
# Load a spaCy model for Portuguese language processing
# The "pt_core_news_lg" model is a large model that provides detailed linguistic annotations
nlp = spacy.load("pt_core_news_lg")

# Define a custom label mapping for the NER annotator
# This mapping translates the model's "ChemicalDrugs" label to "MEDICAMENTO"
custom_label_mapping_lf_transformer_1 = {
    "ChemicalDrugs": "MEDICAMENTO",
}

# Create an instance of the TransformerNERAnnotator with the specified parameters
# - "lf_transformer_1": Name of the annotator
# - model_name: Pre-trained model for recognizing chemical entities in clinical texts
# - label_mapping: Custom label mapping defined above
# - score_threshold: Minimum confidence score for an entity to be considered valid
# - words_to_skip: List of words to ignore during annotation. This is useful for excluding false positives
lf_transformer_1 = TransformerNERAnnotator(
    "lf_transformer_1",
    model_name='pucpr/clinicalnerpt-chemical',
    label_mapping=custom_label_mapping_lf_transformer_1,
    score_threshold=0.5,  # Set your desired threshold here
    words_to_skip=['medicamento', 'medicamentos']
)

In [39]:
# Apply the TransformerNERAnnotator to the spaCy document
# This step annotates the document with named entities using the pre-trained model
lf_transformer_1(doc)

# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_transformer_1")

#### 2.3 Using Zero-Shot NER with GLiNER for Enhanced Entity Recognition


Named Entity Recognition (NER) systems traditionally focus on identifying a pre-defined set of entity types. However, this approach proves limiting when dealing with diverse and evolving entity types in real-world data. GLiNER, a novel NER model, overcomes this limitation by employing bidirectional transformer encoders, similar to the architecture of BERT, to detect **any** entity type. This capability distinguishes GLiNER from traditional NER systems and positions it as an efficient alternative to large language models (LLMs), particularly in resource-constrained environments where deploying large LLMs can be impractical.

##### Evolution and Advancements in GLiNER Architecture

Early iterations of GLiNER relied on older encoder architectures such as BERT and DeBERTA. These versions, trained on relatively smaller datasets, lacked the benefits of modern optimization techniques like flash attention and were limited by a restricted context window of 512 tokens, hindering their performance and applicability to tasks requiring broader textual context.

To address these limitations, recent developments in GLiNER have focused on:

* **Advanced Encoder Architectures:** Shifting from older architectures to more advanced ones that capitalize on the LLM2Vec method. This method transforms the initial decoder model into a bidirectional encoder, leading to enhanced performance.

* **Extensive Pre-training:** Pre-training the model on a massive scale using the Wikipedia corpus and masked token prediction tasks. This results in several advantages, including:
* **Integration of Flash Attention:** Enabling faster training and inference processes.
* **Expanded Context Window:** Extending the context window up to 32k tokens, allowing the model to capture longer-range dependencies within the text, which is crucial for understanding complex relationships and improving accuracy in tasks requiring broader textual context.
* **Improved Generalization:** The model's ability to generalize and perform well on unseen data is significantly enhanced.

These advancements collectively contribute to a more robust and efficient GLiNER model capable of handling diverse NER tasks.

##### Key Advantages of the Enhanced GLiNER Model

The latest GLiNER model offers substantial improvements over its predecessors, including:

* **Enhanced Performance and Generalization:** Exhibiting superior performance and better generalization capabilities due to architectural improvements and extensive pre-training.
* **Flash Attention Support:** Integrating flash attention for faster training and inference, making it more efficient for real-world applications.
* **Extended Context Window:** Expanding the context window to accommodate up to 32k tokens, allowing for a more thorough understanding of textual relationships and improved accuracy in tasks requiring a wider range of textual information.

For a more in-depth understanding of the GLiNER model and its evolution, refer to the research paper available [here](https://arxiv.org/pdf/2406.12925).

In [40]:
# Import the GLiNER class from the gliner library
# GLiNER is used for named entity recognition (NER) tasks
from gliner import GLiNER

# Import the helpers.ner module
# This module contains helper functions for NER tasks
import helpers.ner 

# Load a pre-trained GLiNER model
# The model "knowledgator/gliner-llama-1.3B-v1.0" is specifically trained for NER tasks
# The from_pretrained method loads the model weights and configuration
model_gliner_llama = GLiNER.from_pretrained("knowledgator/gliner-llama-1.3B-v1.0")

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/682 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.33k [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.82G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

rng_state.pth:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

trainer_state.json:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading decoder model using LLM2Vec...


Here's how the gliner model works for any given text

In [41]:
# Extract the text content from the spaCy document
# This text will be used as input for the GLiNER model
text = doc.text

# Define the list of labels to be recognized by the GLiNER model
# In this case, we are interested in recognizing entities labeled as 'medicamento'
labels = ['medicamento']

# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_llama.predict_entities(text, labels, threshold=0.5)

# Display the prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text using the helpers.ner module
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name='label')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'start': 301, 'end': 310, 'text': 'REUQUINOL', 'label': 'medicamento', 'score': 0.95865797996521}, {'start': 312, 'end': 329, 'text': 'HIDROXICLOROQUINA', 'label': 'medicamento', 'score': 0.9063383936882019}, {'start': 339, 'end': 350, 'text': 'PREGABALINA', 'label': 'medicamento', 'score': 0.745392918586731}, {'start': 727, 'end': 736, 'text': 'REUQUINOL', 'label': 'medicamento', 'score': 0.9443965554237366}, {'start': 738, 'end': 753, 'text': 'HIDROXICLOQUINA', 'label': 'medicamento', 'score': 0.8267961144447327}]


In [42]:
# Load a pre-trained GLiNER model
# The model "knowledgator/gliner-qwen-1.5B-v1.0" is specifically trained for NER tasks
# The from_pretrained method loads the model weights and configuration
model_gliner_qwen = GLiNER.from_pretrained("knowledgator/gliner-qwen-1.5B-v1.0")

# Extract the text content from the spaCy document
# This text will be used as input for the GLiNER model
text = doc.text

# Define the list of labels to be recognized by the GLiNER model
# In this case, we are interested in recognizing entities labeled as 'medicamento'
labels = ['medicamento']

# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_qwen.predict_entities(text, labels, threshold=0.5)

# Print the prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text
# This function highlights the recognized entities in the text for visualization
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name='label')

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.37k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.89G [00:00<?, ?B/s]

rng_state.pth:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/640 [00:00<?, ?B/s]

trainer_state.json:   0%|          | 0.00/2.73k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading decoder model using LLM2Vec...


                            You should to consider manually add new tokens to tokenizer or to load tokenizer with added tokens.


[]


In [43]:
# Load a pre-trained GLiNER model
# The model "knowledgator/gliner-bi-large-v1.0" is specifically trained for NER tasks
# The from_pretrained method loads the model weights and configuration
model_gliner_bi_large = GLiNER.from_pretrained("knowledgator/gliner-bi-large-v1.0")

# Extract the text content from the spaCy document
# This text will be used as input for the GLiNER model
text = doc.text

# Define the list of labels to be recognized by the GLiNER model
# In this case, we are interested in recognizing entities labeled as 'medicamento'
labels = ['medicamento']

# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text, labels, and a confidence threshold as input
# The threshold parameter specifies the minimum confidence score for an entity to be considered valid
result = model_gliner_bi_large.predict_entities(text, labels, threshold=0.5)

# Render the prediction results
# This function highlights the recognized entities in the text for visualization
# The label_key_name parameter specifies the key name for the entity labels in the result
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name='label')

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/5.82k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

trainer_state.json:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Now we can transform them into a labelling function that can be used with Skweak.

In [44]:
# Create an instance of GLiNERAnnotator for the "llama" model
# This annotator is configured to recognize entities labeled as "medicamento"
# - annotator_name: Unique name for the annotator instance
# - model_name: Pre-trained model for NER tasks
# - labels: List of entity labels to recognize
# - score_threshold: Minimum confidence score for an entity to be considered valid
# - words_to_skip: List of words to ignore during annotation
lf_gliner_llama = helpers.ner.GLiNERAnnotator(
    annotator_name="lf_gliner_llama",
    model_name="knowledgator/gliner-llama-1.3B-v1.0",
    labels=["medicamento"],
    score_threshold=0.6,
    words_to_skip=["medicamento", "medicamentos"],
)

# Create an instance of GLiNERAnnotator for the "qwen" model
# This annotator is configured similarly to the "llama" annotator but uses a different pre-trained model
lf_gliner_qwen = helpers.ner.GLiNERAnnotator(
    annotator_name="lf_gliner_qwen",
    model_name="knowledgator/gliner-qwen-1.5B-v1.0",
    labels=["medicamento"],
    score_threshold=0.6,
    words_to_skip=["medicamento", "medicamentos"],
)

# Create an instance of GLiNERAnnotator for the "bi_large" model
# This annotator is configured similarly to the previous annotators but uses another different pre-trained model
lf_gliner_bi_large = helpers.ner.GLiNERAnnotator(
    annotator_name="lf_gliner_bi_large",
    model_name="knowledgator/gliner-bi-large-v1.0",
    labels=["medicamento"],
    score_threshold=0.6,
    words_to_skip=["medicamento", "medicamentos"],
)

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading decoder model using LLM2Vec...


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading decoder model using LLM2Vec...


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

In [45]:
# Apply the GLiNER model with the "llama" configuration to the spaCy document
# This step annotates the document with named entities using the "llama" pre-trained model
lf_gliner_llama(doc)

# Apply the GLiNER model with the "qwen" configuration to the spaCy document
# This step annotates the document with named entities using the "qwen" pre-trained model
lf_gliner_qwen(doc)

# Apply the GLiNER model with the "bi_large" configuration to the spaCy document
# This step annotates the document with named entities using the "bi_large" pre-trained model
lf_gliner_bi_large(doc)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


RELATÓRIO
Trata-se de Ação Ordinária, com pedido de tutela antecipada, proposta por WIGNETT NASCIMENTO SILVA em face da UNIÃO, do ESTADO DO RIO GRANDE DO NORTE e do MUNICÍPIO DE MOSSORÓ/RN, por meio da qual a parte autora pleiteia o fornecimento imediato, de forma gratuita, do medicamento denominado REUQUINOL (HIDROXICLOROQUINA) 400mg e PREGABALINA 75 mg, nas quantidades e formas prescritas pela médica que acompanha a parte autora, no escopo de tratar SÍNDROME DE SJÖGREN ASSOCIADO A POLIARTRALGIA CRÔNICA (CID 10 M35 / M 25.5), de que se diz portadora. 
É o que importa relatar.Decido.
PRELIMINAR DE FALTA DE INTERESSE DE AGIR (UNIÃO)
A União alega que faltaria à parte autora interesse de agir, uma vez que o medicamento REUQUINOL (HIDROXICLOQUINA) é fornecido pelo SUS.
Ocorre que, consoante declaração expedida pelos próprios agentes do Poder Público, o medicamento se encontra em falta.
Assim sendo, o provimento jurisdicional buscado reveste-se de utilidade, uma vez que, não obstante previ

In [46]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_llama")

In [47]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_qwen")

In [48]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
skweak.utils.display_entities(doc, "lf_gliner_bi_large")

#### 2.4 Using Zero-Shot NER with NuNER for Enhanced Entity Recognition

**NuNER** is a novel approach to building a specialized NER model that leverages the power of large language models (LLMs) to enhance the efficiency of smaller, task-specific models. This approach addresses the limitations of traditional NER models, which often struggle with data efficiency and generalization across different entity types and domains.

##### NuNER: A Task-Specific Foundation Model

Traditional NER models, typically based on pre-trained transformer encoders like BERT, often require extensive fine-tuning on human-annotated data for specific entity types and domains. While large language models (LLMs) like GPT have shown impressive zero-shot NER capabilities, their size makes them computationally expensive for real-world applications.

NuNER aims to bridge this gap by creating a **task-specific foundation model** specifically for NER. Unlike **domain-specific** foundation models like SciBERT (for scientific text) or BioBERT (for biomedical text), which are common, task-specific models are rare due to the lack of suitable datasets. NuNER leverages the power of LLMs to overcome this bottleneck.

##### NuNER Creation Process:

1. **Dataset Creation with LLM Annotation**:
    - A large, diverse dataset (a subset of the C4 corpus) is automatically annotated with entity labels using an LLM (GPT-3.5). This eliminates the need for costly and time-consuming human annotation.
    - The LLM is prompted to identify a wide range of entities and assign them appropriate concepts (entity types/topics). This unconstrained approach allows for the extraction of a diverse set of entities, going beyond traditional NER datasets.
    - This results in a dataset with millions of entity annotations spanning hundreds of thousands of unique concepts, exhibiting a more realistic long-tailed distribution of concept frequencies compared to curated datasets.

2. **Pre-training with Contrastive Learning**:
    - A smaller model (RoBERTa-base) is further pre-trained on this LLM-annotated dataset using contrastive learning. This method helps the model learn to distinguish between similar entities by focusing on contrasting features.

##### Factors Influencing NuNER's Performance:

Ablation studies reveal that **concept diversity** and **dataset size** are crucial factors contributing to NuNER’s performance.

- Increasing the variety of concepts in the pre-training dataset leads to better generalization across different entity types.
- Larger pre-training datasets further improve performance, indicating the model’s capacity to learn from more data.

Surprisingly, **text diversity** (using C4 vs. Wikipedia) has a less significant impact when the LLM annotation process remains consistent. This highlights the importance of the LLM-driven annotation in capturing diverse entities and concepts.

NuNER demonstrates the feasibility of leveraging LLMs to develop highly effective and efficient task-specific models for NER. This approach holds immense potential for other NLP tasks, paving the way for more accessible and adaptable solutions in real-world applications.

> For a more in-depth understanding of the NuNER model and its evolution, refer to the research paper available [here](https://arxiv.org/abs/2402.15343).

Here's how the NuNer model works for any given text

In [49]:
# Load a pre-trained NuNerZero model
# The from_pretrained method loads the model weights and configuration
nuner_zero = GLiNER.from_pretrained("numind/NuNerZero")

# Use the GLiNER model to predict entities in the text
# The predict_entities method takes the text and labels as input
# The labels specify the types of entities to recognize
result = nuner_zero.predict_entities(text, labels)

# Merge adjacent entities in the prediction results
# This step combines entities that are next to each other into a single entity
# The merge_adjacent_entities function takes the prediction results and the original text as input
result = helpers.ner.merge_adjacent_entities(result, text)

# Display the final prediction results
# This will show the recognized entities along with their positions and confidence scores
print(result)

# Render the annotated entities in the text
# This function highlights the recognized entities in the text for visualization
helpers.ner.render_entity_data_from_pipeline(text, result, label_key_name='label')


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

NuZero_token_token_metrics.txt:   0%|          | 0.00/961 [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/634 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.80G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

zero_shot_performance_unzero_token.png:   0%|          | 0.00/43.1k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'start': 301, 'end': 310, 'text': 'REUQUINOL', 'label': 'medicamento', 'score': 0.9951326251029968}, {'start': 312, 'end': 329, 'text': 'HIDROXICLOROQUINA', 'label': 'medicamento', 'score': 0.6647377610206604}, {'start': 339, 'end': 350, 'text': 'PREGABALINA', 'label': 'medicamento', 'score': 0.9287405014038086}, {'start': 727, 'end': 736, 'text': 'REUQUINOL', 'label': 'medicamento', 'score': 0.989434540271759}]


Now we can transform it into a labelling function that can be used with Skweak.

In [50]:
# Display the annotated entities in the document
# This function highlights the recognized entities in the document for visualization
lf_nuner_zero = helpers.ner.GLiNERAnnotator(
    annotator_name="lf_nuner_zero",
    model_name="numind/NuNerZero",
    labels=["medicamento"],
    score_threshold=0.6,
    words_to_skip=["medicamento", "medicamentos"],
)

lf_nuner_zero(doc)
skweak.utils.display_entities(doc, "lf_nuner_zero")

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


#### 2.5 Using Zero-Shot NER with LLMs and Function Calling Capabilities

Instruction-tuned large language models (LLMs) have revolutionized natural language processing tasks, including Named Entity Recognition (NER). These powerful models, such as GPT-4, can perform zero-shot learning, enabling them to tackle tasks without explicit training on labeled data. By leveraging the function calling capabilities of LLMs, we can create custom functions that extract entities from text, effectively performing NER without extensive training or fine-tuning.

##### LangChain: Bridging LLMs and NER Tasks

[LangChain](https://www.langchain.com/) serves as a crucial intermediary, simplifying the process of interacting with LLMs for various language-related tasks, including NER. This platform offers several key features that make it an ideal choice for integrating LLMs into NER workflows:

1. **Seamless LLM Integration**: LangChain supports a wide range of popular LLMs, including OpenAI's GPT models and Google's BERT, ensuring compatibility with state-of-the-art language models.

2. **Intuitive API and Documentation**: The platform provides a well-documented, user-friendly API, complete with code examples and tutorials, assisting easy incorporation of LLMs into applications.

3. **Flexible Input and Output Handling**: LangChain supports various input formats and offers customizable output handling, allowing for versatile processing of diverse content types.

4. **Task-Specific Modules**: Pre-configured modules optimized for common language tasks, including NER, streamline the process of achieving high-quality results.

##### Implementing NER with LangChain and LLMs

To perform NER using LangChain and LLMs, follow these general steps:

1. **Setup and Installation**: Install LangChain and configure necessary API keys for the chosen LLM.

2. **Model Initialization**: Import required LangChain modules and initialize the LLM instances with desired configurations.

3. **Input Preparation**: Preprocess and format the input text to ensure compatibility with the chosen LLM.

4. **Entity Extraction**: Use LangChain's API to pass the prepared input to the LLM and generate output containing extracted entities.

5. **Post-processing**: Process the generated output to extract relevant entity information and integrate it into your application's workflow.

##### Structured Data Extraction with Pydantic

To systematically extract and structure entity information, we employ Pydantic, a powerful library for data validation and settings management using Python type annotations. Pydantic allows us to define schemas that act as blueprints for the entities we wish to extract, ensuring consistency and adherence to predefined formats.

###### Key Benefits of Using Pydantic for NER

- **Data Consistency**: Schemas ensure extracted data follows a uniform structure.
- **Type Validation**: Pydantic's type checking reduces the risk of errors in extracted data.
- **Improved Readability**: Declarative schema definitions enhance code maintainability.
- **Efficient Processing**: Structured data facilitates easier analysis and downstream task integration.

> With LLMs, we transform unstructured text into structured, actionable data, enhancing both the efficiency and effectiveness of legal text analysis.

We'll use OpenAI GPT-4o-mini to perform zero-shot NER on legal documents, extracting entities with high accuracy and minimal manual intervention. This approach showcases the power of LLMs in automating complex NER tasks and streamlining information extraction processes.

> *Note: You'll need to an OpenAI API key to access GPT-4o-mini through LangChain. Ensure you have the necessary permissions and credentials to apply the model effectively.*

In [51]:
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional, List
from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
import os

# Load environment variables from .env file
# This is useful for keeping sensitive information like API keys out of your code
load_dotenv()  # You are expected to have a .env file with the OpenAI API KEY `OPENAI_API_KEY`

# Retrieve the OpenAI API key from environment variables
# We're only displaying the first 5 characters for security reasons
# This is a good practice to verify the key is loaded without exposing it entirely
api_key_preview = os.getenv('OPENAI_API_KEY')[:5]
print(f"First 5 characters of API key: {api_key_preview}")

First 5 characters of API key: sk-zY


In [52]:
# Define a Pydantic model for Medicamento
# This model represents information about a medication
# Note that I'm using Portuguese text to define this class, as it will be passed to the Ollama model and I want it to work with Portuguese texts. This kinda "prime" the model to work with Portuguese texts.
class Medicamento(BaseModel):
    """
    Informação sobre um medicamento.
    """
    nome: Optional[str] = Field(
        default_factory=str,
        description="Nome comercial ou genérico do medicamento."
    )
    principio_ativo: Optional[str] = Field(
        default_factory=str,
        description="Princípio ativo do medicamento."
    )
    dosagem: Optional[str] = Field(
        default_factory=str,
        description="Dosagem do medicamento."
    )

# Define a Pydantic model for a list of Medicamentos
# This model represents a list of medications
class ListaMedicamentos(BaseModel):
    """
    Lista de medicamentos.
    """
    medicamentos: Optional[List[Medicamento]] = Field(
        default_factory=list,
        description="Lista de medicamentos."
    )

# Create a chat prompt template for the model
# The system message primes the model to extract named entities related to medications
# The human message is a placeholder for the text input
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Você é um algoritmo perfeito para extração de entidades nomeadas sobre medicamentos. "
            "Você deve extrair informações relevantes do texto, exatamente como está escrito no texto. "
            "Se você não souber o valor de um atributo solicitado para extrair, retorne nulo para o valor do atributo.",
        ),
        ("human", "{text}"),
    ]
)

# Initialize the ChatOpenAI model
# - model: Specify the version of the model to use
# - temperature: Controls the randomness of the output (0.0 to 1.0)

model_openai = ChatOpenAI(
    model='gpt-4o-mini',
    temperature=0.0
)

# Invoke the model with a test message to ensure it's working
model_openai.invoke('Oi!')


AIMessage(content='Oi! Como posso ajudar você hoje?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 9, 'total_tokens': 17, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_1bb46167f9', 'finish_reason': 'stop', 'logprobs': None}, id='run-27aa44c8-dff1-4a03-8687-b24805a78b5a-0', usage_metadata={'input_tokens': 9, 'output_tokens': 8, 'total_tokens': 17})

In [53]:
# Create an extractor by combining the prompt template with the OpenAI model
# The extractor is configured to output structured data in the form of ListaMedicamentos
# include_raw=True ensures that the raw response from the model is included in the output
# with_retry is used to handle retries in case of failures, with a maximum of 3 attempts
# wait_exponential_jitter=True adds randomness to the wait time between retries to avoid collision
openai_ner = prompt | model_openai.with_structured_output(ListaMedicamentos, include_raw=False).with_retry(
    stop_after_attempt=3,  # Retry up to 3 times in case of failure
    wait_exponential_jitter=True  # Add randomness to the wait time between retries
)

Here's how the Zero-Shot model works for any given text

In [54]:
result = openai_ner.invoke(text)
print(result)

medicamentos=[Medicamento(nome='REUQUINOL', principio_ativo='HIDROXICLOROQUINA', dosagem='400mg'), Medicamento(nome='PREGABALINA', principio_ativo=None, dosagem='75 mg')]


Now we can transform it into a labelling function that can be used with Skweak.

In [55]:
lf_langchain_openai = helpers.ner.LangChainAnnotator(
    annotator_name="lf_langchain_openai",
    langchain_runnable=openai_ner,
    pydantic_model=ListaMedicamentos,
    words_to_skip=["medicamento", "medicamentos"],
)

lf_langchain_openai(doc)
skweak.utils.display_entities(doc, "lf_langchain_openai")


#### 2.6 Using Regex Patterns for Labelling Functions

Regular expressions (regex) are a powerful tool for pattern matching in text data. By defining specific patterns that correspond to entities of interest, we can create labeling functions that automatically identify these entities in the text. This approach is particularly useful for Named Entity Recognition (NER) tasks where entities follow consistent patterns or formats.

##### Application to Drug Names

In pharmacological data, especially when handling drug names in Portuguese, regex is exceptionally practical. The typical format for drug names includes the salt and the substance name. For instance, in "Cloridrato de Propranolol," "Cloridrato" is the salt, and "Propranolol" is the substance. Recognizing such patterns is crucial for creating effective regex patterns.

##### Example Pattern: Drug Names in Portuguese

- **Generic Structure**: The pattern typically follows the format `salt + " de " + substance name`.
- **Example**: "Cloridrato de Propranolol"
- **Regex Pattern Design**: The pattern must:
    - Identify common salts (e.g., "Cloridrato", "Sulfato").
    - Recognize the connector "de".
    - Capture the subsequent substance name.

##### Crafting Effective Regex Patterns

To create robust labeling functions using regex:

1. **Start Broad, Refine Gradually**: Begin with a general pattern and iteratively refine it to improve precision.
2. **Account for Variations**: Consider different spellings, abbreviations, or formatting of the same entity.
3. **Use Capturing Groups**: Isolate specific parts of the match for further processing or validation.
4. **Incorporate Boundaries**: Use word boundaries (\b) to ensure you're matching whole words or phrases.

##### Challenges and Considerations

- **Coverage**: Ensure the regex pattern covers various common salts and their variations.
- **Edge Cases**: Be mindful of drug names that do not follow the standard format or include additional descriptors.
- **False Positives/Negatives**: Regularly validate the regex against a diverse dataset to minimize incorrect labeling.
- **Iterative Refinement**: Develop regex patterns iteratively, refining them based on practical results and edge cases encountered.
- **Testing and Validation**: Consistently test regex patterns against annotated datasets to ensure accuracy and reliability.

> **Note**: While regex is a powerful method for automating NER tasks, balance is essential. Capture as many relevant entities as possible without introducing excessive complexity that might lead to errors.

In [56]:
# Lets get some idea for salt names

# Initialize an empty list to store potential salt names
salts = []

# Iterate over each drug name in the drugs list
for drug in drugs:
    # Split the drug name into words
    split_drug = drug.split()
    
    # Check if the drug name has more than one word
    if len(split_drug) > 1:
        # Check if the second word is 'de'
        if split_drug[1] == 'de':
            # Check if the first word ends with 'ato', 'ito', or 'eto'
            if split_drug[0].endswith('ato') or split_drug[0].endswith('ito') or split_drug[0].endswith('eto'):
                # Append the first word to the salts list
                salts.append(split_drug[0])

# Remove duplicates from the salts list by converting it to a set and back to a list
salts = list(set(salts))

# Display the unique salt names
salts # we'll use this list to create a regex pattern to extract drugs from the text. Check the next cell.

['nitrato',
 'amoxicilina+clavulanato',
 'oxalato',
 'sulfeto',
 'selenito',
 'isocaproato',
 'mononitrato',
 'bitartarato',
 'metotrexato',
 'hialuronato',
 'antimoniato',
 'maleato',
 'fusidato',
 'malato',
 'subacetato',
 'propilenoglicolato',
 'mesilato',
 'furoato',
 'butilbrometo',
 'gadopentetato',
 'fumarato',
 'bicarbonato',
 'sulfato',
 'palmitato',
 'lisinato',
 'cloridrato',
 'isetionato',
 'polissulfato',
 'glicerofosfato',
 'dinitrato',
 'besilato',
 'acetato',
 'extrato',
 'tosilato',
 'pidolato',
 'bissulfato',
 'brimonidina+maleato',
 'undecilato',
 'borato',
 'iodeto',
 'tartarato',
 'hemifumarato',
 'docusato',
 'carbonato',
 'alfaoxofenilpropionato',
 'hidroxinaftoato',
 'estolato',
 'hexamidina+cloridrato',
 'aspartato',
 'paracetamol+fosfato',
 'decanoato',
 'betacipionato',
 'metilbrometo',
 'hidroxibenzoato',
 'esilato',
 'etabonato',
 'estradiol+acetato',
 'pantotenato',
 'subgalato',
 'levodopa+cloridrato',
 'mucato',
 'racealfaoxobetametilbutanoato',
 'dimale

In [57]:
# Select the 9th document from the training set of spaCy documents
doc = spacy_docs_train[8]

# Print the text of the selected document for reference
print(doc.text)


24.No tocante à existência de outros medicamentos no mercado brasileiro com efeitos farmacológicos idênticos ou similares ao SPIRIVA RESPIMAT, o perito informou que “O medicamento em questão faz parte de um grupo farmacológico de medicamentos denominados anticolinérgicos; neste grupo temos os de curta duração, resentado no Brasil pelo BROMETO DE IPATRÓPIO que tem indicação para sintomas eventuais da DPOC e os de longa duração representados pelo BROMETO DE TIOTRÓPIO que segundo o consenso brasileiro de DPOC reduz o número de exacerbações e hospitalizações e melhora a qualidade de vida relacionada ao estado de saúde, comparado com placebo e ipratrópio”. 
25.Conquanto tal informação conflite, em princípio, com aquela constante na Cartilha de Apoio Médico e Científico ao Judiciário, disponibilizada no site da Federação das Unimeds do Estado de São Paulo (URL da qual se extrai que “Existe evidência de efetividade, mas não de superioridade da droga brometo de tiotrópio no controle dos pacien

In [58]:
# Initialize the SaltAnnotator from the helpers.ner module
lf_salt = helpers.ner.SaltAnnotator()

# Apply the SaltAnnotator to the selected document
lf_salt(doc)

# Display the entities annotated by the SaltAnnotator using skweak's display_entities function
skweak.utils.display_entities(doc, "lf_salt")

#### 2.7 Apply Labeling Functions

Labelling functions are fundamental to skweak. These functions can be created using various methods. Essentially, labelling functions operate by taking a Doc object as input and outputting a list of spans at the token level, each accompanied by a label.

In the context of sequence labelling, the spans represent the entities that need to be identified. For tasks involving text classification, such as sentiment analysis, the span pertains to the entire text segment that requires classification, which could be a single sentence or the entire document.

##### Applying a Single Labeling Function

To apply an individual labeling function to a document:

- Use the command: `name_of_lf_object(doc)`
- This operation modifies the document by adding new annotations from the labeling function.
- To verify that your labeling function has worked correctly, inspect `doc.spans["name_of_your_labeling_function"]`. This attribute should contain the detected spans along with their labels, accessible via the attribute `label_`.

##### Combining Multiple Labeling Functions

If you have multiple labeling functions, it can be cumbersome to manage and apply each one individually. To streamline this process, use the `CombinedAnnotator` from the `base` module.

##### Steps to Combine Labeling Functions:

1. **Instantiate CombinedAnnotator:**
    - Create an instance of `CombinedAnnotator`.

2. **Add Each Labeling Function:**
    - Use the `add_annotator` method to include all your labeling functions.
    - Example:

    ```python
    combined = CombinedAnnotator()
    combined.add_annotator(lf_object1)
    combined.add_annotator(lf_object2)
    ```

##### Applying Combined Labeling Functions

Once you have created a combined annotator:

1. For a single document, apply the combined annotator as you would an individual function.
2. For multiple documents, use the `pipe` method for efficient processing:
    - Example: `docs = list(combined.pipe(docs))`
        - **Lazy Evaluation:** The `pipe` method processes documents in a lazy manner, computing results only when needed, saving time and memory.


##### Important Considerations

- **Efficiency:** Using combined annotators and the `pipe` method can significantly speed up processing for large document collections.
- **Flexibility:** You can easily add or remove labeling functions from the combined annotator as needed.
- **Verification:** Always check the output of your labeling functions to ensure they are working as expected.

>
> **Note**: After applying labeling functions, consider using the `DocBin` format for efficient storage and retrieval of large batches of annotated documents (refer to the previous section for details).
>

In [None]:
# Create a CombinedAnnotator instance
# The CombinedAnnotator allows us to combine multiple weak supervision sources (annotators)
# Each annotator will provide its own annotations, which will be combined to create a final set of annotations
combined = skweak.base.CombinedAnnotator()

# Add various annotators to the CombinedAnnotator
# Each annotator is responsible for a different type of annotation or uses a different method to generate annotations

# Add a gazetteer-based annotator for drug names
combined.add_annotator(lf_drugs_gazetteer)

# Add a transformer-based annotator (e.g., BERT, RoBERTa)
combined.add_annotator(lf_transformer_1)

# Add a GLINER-based annotator using the LLaMA model
combined.add_annotator(lf_gliner_llama)

# Add a GLINER-based annotator using the Qwen model
combined.add_annotator(lf_gliner_qwen)

# Add a GLINER-based annotator using a large BiLSTM model
combined.add_annotator(lf_gliner_bi_large)

# Add a NuNERZero-based annotator
combined.add_annotator(lf_nuner_zero)

# Add a LangChain-based annotator using OpenAI's models
combined.add_annotator(lf_langchain_openai)

# Add a regex-based annotator for salt names
combined.add_annotator(lf_salt)

# Apply the combined annotators to the training documents
# The combined.pipe method processes the documents in batches, applying all the annotators
# This step generates the combined annotations for each document in the training dataset
spacy_docs_train = list(combined.pipe(spacy_docs_train))

# Note: The running time for this process is approximately 1 hour and 40 minutes
# The time required may vary depending on the size of the dataset and the complexity of the annotators

In [15]:
# We can save the weakly annotated spacy_docs to disk to avoid running the labelling functions again
skweak.utils.docbin_writer(spacy_docs_train, "data/bin/ner/spacy_docs_train_annotated.bin")

Write to data/bin/ner/spacy_docs_train_annotated.bin...done


In [60]:
import skweak

# Now we can load them back
spacy_docs_train = skweak.utils.docbin_reader("data/bin/ner/spacy_docs_train_annotated.bin", spacy_model_name="pt_core_news_lg")
spacy_docs_train = list(spacy_docs_train)

#### 2.8 Document-Level Labelling Functions

Skweak provides a powerful mechanism for creating document-level labelling functions that capitalize on the global context of a document. One significant feature is the ability to exploit label consistency within a document. This means that entities appearing multiple times within the same document are likely to belong to the same category. For example, the entity ["Frontal"](https://www.saudedireta.com.br/catinc/drugs/bulas/frontal.pdf) could refer to either the a brand name of the drug Alprazolam or a region in front of something, but within a single document, it is unlikely to refer to both.

##### DocumentMajorityAnnotator

The **DocumentMajorityAnnotator** is a labelling function designed to use this label consistency. Here’s how it works:

1. **Initial Predictions**: The process begins by using predictions from another labelling function. This could be any function that assigns preliminary labels to entities within the document.
2. **Frequency Computation**: For each unique entity string in the document, the frequency of each assigned label is calculated.
3. **Majority Label Selection**: The most common label for each entity string is selected.
4. **Label Assignment**: This majority label is then consistently assigned to every occurrence of the entity throughout the document.

##### Aggregating Labelling Functions

To accurately count the label frequencies, it is essential to aggregate predictions from multiple labelling functions. There are two main approaches to achieve this:

1. **MajorityVoter**:
    - This is the simplest and quickest method.
    - It aggregates the predictions from all available labelling functions by taking a majority vote.
    - This approach ensures that the most frequently predicted label across all functions is selected.

2. **Generative Model**:
    - This method is more complex and involves fitting a full generative model.
    - The model is first fitted without the document-level functions to establish baseline predictions.
    - It is then refitted with the document-level functions included, allowing for more refined and accurate predictions.
    - Although this approach is more computationally intensive, it can provide more nuanced results by considering the dependencies between different labelling functions.

In [61]:
# Create a MajorityVoter instance for aggregating annotations
# The MajorityVoter combines annotations from multiple annotators using a majority voting scheme
# "doclevel_voter" is the name of the voter
# ["MEDICAMENTO"] specifies the entity types to consider for voting (in this case, "MEDICAMENTO")
# initial_weights={"doc_majority":0.0} sets the initial weight for the "doc_majority" annotator to 0.0
# This means we do not want to include the "doc_majority" annotator itself in the vote
majority_voter = skweak.aggregation.MajorityVoter(
    "doclevel_voter", 
    ["MEDICAMENTO"], 
    initial_weights={"doc_majority": 0.0}
)

# Apply the MajorityVoter to the training documents
# The majority_voter.pipe method processes the documents in batches, applying the majority voting scheme
# This step generates the final aggregated annotations for each document in the training dataset
spacy_docs_train = list(majority_voter.pipe(spacy_docs_train))

In [62]:
# Create a DocumentMajorityAnnotator instance
# The DocumentMajorityAnnotator assigns labels to documents based on the majority vote of the annotations
# "doc_majority" is the name of the annotator
# "doclevel_voter" is the name of the voter used for majority voting
# case_sensitive=False indicates that the annotation process is case-insensitive
doc_majority = skweak.doclevel.DocumentMajorityAnnotator("doc_majority", "doclevel_voter", case_sensitive=False)

# Apply the DocumentMajorityAnnotator to the training documents
# The doc_majority.pipe method processes the documents in batches, applying the majority voting scheme at the document level
# This step generates the final aggregated annotations for each document in the training dataset
spacy_docs_train = list(doc_majority.pipe(spacy_docs_train))


### Step 3 - Generating Aggregated Labels

Having generated annotations from multiple labeling functions, we now need to combine these potentially noisy labels into a final, high-quality dataset for our NER model. This process of consolidating annotations from various sources is known as **label aggregation**. Skweak provides two primary methods for achieving this: generative models and majority voting.

#### 3.1 Aggregating Labels with Generative Models

Generative models offer a powerful approach to label aggregation by learning the fundamental structure and dependencies between labels. In the context of NER, a **Hidden Markov Model (HMM)** is a particularly well-suited generative model. Let's break down how it works:

1. **Sequence Representation**: The HMM treats the text as a sequence of tokens, with each token assigned a corresponding label.

2. **States and Transitions**: The model assumes hidden "states" represent the true fundamental labels and attempts to infer these states from the observed, noisy labels assigned by our labeling functions. Transitions between states are governed by probabilities, reflecting the likelihood of moving from one label to another.

3. **Learning Model Parameters**: The HMM estimates two crucial sets of parameters:
    - **Emission Probabilities**: These represent the probability of observing a particular noisy label given a specific hidden state (true label).
    - **Transition Probabilities**: These represent the probability of transitioning between hidden states.

4. **Baum-Welch Algorithm**: This is an Expectation-Maximization (EM) algorithm variant used to estimate the emission and transition probabilities. It applies the forward-backward algorithm to compute the necessary statistics.

**Process:**
1. **Application of Labeling Functions**: Apply multiple labeling functions to the unlabeled data, resulting in a set of noisy labels.
2. **Parameter Estimation**: Use the Baum-Welch algorithm to estimate the emission and transition probabilities for the HMM.
3. **Label Aggregation**: The HMM combines the noisy labels to generate the most likely sequence of labels, smoothing out inconsistencies.

**Advantages:**
- **Context-Aware**: They consider the relationships between adjacent labels, leading to more coherent annotations.
- **Noise Reduction**: By leveraging patterns in the data, they can effectively filter out noise from individual labeling functions.
- **Uncertainty Quantification**: These models provide posterior probabilities for each label, offering insights into the model's confidence.

> **Note**: While the final annotations in `doc.spans["name_of_aggregator"]` show only the most likely labels, the full posterior probabilities can be accessed via `doc.spans["name_of_aggregator"].attrs['probs']`.

#### 3.2 Aggregating Labels with Majority Vote

For situations where simplicity and computational efficiency are priorities, Skweak also offers the **Majority Vote** method.

**How Majority Vote Works:**
1. For each token, count the labels assigned by all labeling functions.
2. Select the label with the highest count as the final annotation.

**Advantages:**
- **Simplicity**: Easy to implement and understand.
- **Efficiency**: Computationally less demanding than generative models.
- **Transparency**: The decision process is straightforward to interpret.

**Limitations:**
- **Lack of Context**: Doesn't consider label dependencies or sequence information.
- **Susceptibility to Noise**: Can be heavily influenced by low-quality labeling functions if they outnumber high-quality ones.

#### Choosing the Right Aggregation Method

The choice between generative models and Majority Vote depends on your specific NER task and resources:

1. **Data Complexity**: For tasks with complex label dependencies, generative models are often superior.
2. **Computational Resources**: If processing time or computational power is limited, Majority Vote might be preferable.
3. **Labeling Function Quality**: With highly reliable labeling functions, Majority Vote can perform well. For noisier functions, generative models are more robust.
4. **Interpretability Needs**: If understanding the decision process is crucial, Majority Vote offers more transparency.

> **Best Practice**: When possible, experiment with both methods and compare their performance on a validation set to determine the most effective approach for your specific NER task.

In [63]:
# Create an instance of the Hidden Markov Model (HMM) for weak supervision
# The HMM is used to combine multiple weak labels into a single probabilistic label
# "hmm" is the name of the Hidden Markov Model
# labels=["MEDICAMENTO"] specifies the entity types that the HMM will consider (in this case, "MEDICAMENTO")
hmm = skweak.generative.HMM("hmm", labels=["MEDICAMENTO"])

# Fit the HMM model to the training documents
# The fit method trains the HMM model using the weak labels from the training dataset
# This step involves learning the transition and emission probabilities from the weak labels
# The spacy_docs_train contains the training documents with weak labels generated by the annotators
hmm.fit(spacy_docs_train)

Starting iteration 1
Finished E-step with 826 documents
Starting iteration 2


         1  -32201.73755650             +nan


Finished E-step with 826 documents
Starting iteration 3


         2  -31813.62040693    +388.11714957


Finished E-step with 826 documents
Starting iteration 4


         3  -31771.67621353     +41.94419340


Finished E-step with 826 documents


         4  -31759.32330449     +12.35290903


In [64]:
# Print the learned parameters of the Hidden Markov Model (HMM)
# The pretty_print method displays the transition and emission probabilities in a readable format
# This helps us understand how the HMM has combined the weak labels into a single probabilistic label
# The output will show the probabilities of transitioning between different states (labels)
# and the probabilities of emitting different observations (tokens) given a state
hmm.pretty_print()

HMM model with following parameters:
Output labels: ['O', 'B-MEDICAMENTO', 'I-MEDICAMENTO']
--------
Start distribution:
O                1.0
B-MEDICAMENTO    0.0
I-MEDICAMENTO    0.0
dtype: float64
--------
Transition model:
                  O  B-MEDICAMENTO  I-MEDICAMENTO
O              0.99           0.01           0.00
B-MEDICAMENTO  0.70           0.05           0.25
I-MEDICAMENTO  0.66           0.00           0.34
--------
Labelling functions in model: ['lf_gliner_bi_large', 'lf_salt', 'lf_gliner_qwen', 'lf_langchain_openai', 'lf_transformer_1', 'doc_majority', 'drugs_gazetteer', 'lf_gliner_llama', 'lf_nuner_zero']
Emission model for: doc_majority
                  O  B-MEDICAMENTO  I-MEDICAMENTO
O              1.00           0.00           0.00
B-MEDICAMENTO  0.29           0.71           0.00
I-MEDICAMENTO  0.43           0.05           0.52
weights        1.00           1.00           1.00
--------
Emission model for: drugs_gazetteer
                  O  B-MEDICAMENTO  I-MED

In [65]:
# Create an instance of the SequentialMajorityVoter for weak supervision
# The SequentialMajorityVoter combines annotations from multiple annotators using a majority voting scheme
# "maj_voter" is the name of the voter
# labels=["MEDICAMENTO"] specifies the entity types to consider for voting (in this case, "MEDICAMENTO")
# This voter will sequentially process the annotations and assign the most common label to each token
maj_voter = skweak.voting.SequentialMajorityVoter("maj_voter", labels=["MEDICAMENTO"])

In [66]:
# Apply the Hidden Markov Model (HMM) to the training documents
# The hmm.pipe method processes the documents in batches, applying the HMM to generate probabilistic labels
# This step refines the weak labels by combining them into a single probabilistic label for each token
# The output is a list of spaCy documents with updated annotations based on the HMM
spacy_docs_train = list(hmm.pipe(spacy_docs_train))

# Apply the SequentialMajorityVoter to the training documents
# The maj_voter.pipe method processes the documents in batches, applying the majority voting scheme
# This step further refines the labels by assigning the most common label to each token based on the combined annotations
# The output is a list of spaCy documents with final annotations based on the majority vote
spacy_docs_train = list(maj_voter.pipe(spacy_docs_train))

In [67]:

spacy_docs_train[0].spans['hmm'].attrs

{'probs': {47: {'B-MEDICAMENTO': 0.999999999993701},
  48: {'I-MEDICAMENTO': 0.9999999999693117},
  49: {'I-MEDICAMENTO': 0.9999942257605258},
  50: {'B-MEDICAMENTO': 0.9999999994686206},
  51: {'I-MEDICAMENTO': 0.9999957757910435}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_llama',
  'lf_gliner_qwen',
  'lf_nuner_zero',
  'lf_langchain_openai']}

In [68]:

spacy_docs_train[2].spans['hmm'].attrs

{'probs': {97: {'B-MEDICAMENTO': 0.9999998236204718},
  265: {'B-MEDICAMENTO': 0.9999999637318389},
  266: {'I-MEDICAMENTO': 0.9997333780839212},
  268: {'B-MEDICAMENTO': 0.9999998236204718}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_llama',
  'lf_nuner_zero',
  'doc_majority']}

In [69]:
spacy_docs_train[2].spans['maj_voter'].attrs

{'probs': {94: {'B-MEDICAMENTO': 0.75},
  97: {'B-MEDICAMENTO': 0.98780483},
  164: {'B-MEDICAMENTO': 0.75},
  265: {'B-MEDICAMENTO': 0.9473684},
  266: {'B-MEDICAMENTO': 0.4736842, 'I-MEDICAMENTO': 0.4736842},
  268: {'B-MEDICAMENTO': 0.98780483},
  303: {'B-MEDICAMENTO': 0.75}},
 'aggregated': True,
 'sources': ['drugs_gazetteer',
  'lf_transformer_1',
  'lf_gliner_llama',
  'lf_nuner_zero',
  'doc_majority']}

In [70]:
print(spacy_docs_train[2].spans)

{'drugs_gazetteer': [escitalopram, Espran, escitalopram], 'lf_transformer_1': [escitalopram, escitalopram, medicamento Espran], 'lf_gliner_llama': [escitalopram, escitalopram, Espran], 'lf_gliner_qwen': [], 'lf_gliner_bi_large': [], 'lf_nuner_zero': [escitalopram, escitalopram, Espran], 'lf_langchain_openai': [], 'lf_salt': [], 'doclevel_voter': [escitalopram, medicamento, Espran, escitalopram], 'doc_majority': [medicamento, escitalopram, medicamento, medicamento, escitalopram, medicamento], 'hmm': [escitalopram, medicamento Espran, escitalopram], 'maj_voter': [medicamento, escitalopram, medicamento, medicamento, Espran, escitalopram, medicamento]}


In [71]:
skweak.utils.display_entities(spacy_docs_train[0], "hmm")

In [72]:
skweak.utils.display_entities(spacy_docs_train[0], "maj_voter")

In [73]:
skweak.utils.display_entities(spacy_docs_train[2], "hmm")

In [74]:
skweak.utils.display_entities(spacy_docs_train[2], "maj_voter")

In [75]:
skweak.utils.get_spans_with_probs(spacy_docs_train[2], "maj_voter")

[(medicamento, 0.75),
 (escitalopram, 0.9878048300743103),
 (medicamento, 0.75),
 (medicamento, 0.9473683834075928),
 (Espran, 0.9473683834075928),
 (escitalopram, 0.9878048300743103),
 (medicamento, 0.75)]

In [76]:
skweak.utils.get_spans_with_probs(spacy_docs_train[2], "hmm")

[(escitalopram, 0.9999998236204718),
 (medicamento Espran, 0.9998666709078801),
 (escitalopram, 0.9999998236204718)]

In [77]:
from transformers import AutoTokenizer  # Import the AutoTokenizer from the transformers library
from tqdm.auto import tqdm  # Import tqdm for progress bars
import helpers.ner  # Import custom helper functions for Named Entity Recognition (NER)

# Load the tokenizer for the BERT model
# The tokenizer will handle tokenization and padding/truncation of input sequences
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

# List of entity names to remove from the extracted entities
# These are common terms that we do not want to include in our NER annotations
entities_to_remove = ['medicamento', 'medicamentos', 'medicação', 'fármaco', 'fármacos', 'droga', 'drogas']

# Initialize an empty list to store the training labels generated by the majority voter
train_labels_maj_voter = []

# Iterate over each document in the training dataset
# tqdm is used to display a progress bar
for doc in tqdm(spacy_docs_train):
    text = doc.text  # Extract the text from the spaCy document
    # Extract entities using the majority voter annotations
    # The entities_to_remove list is used to filter out unwanted entities
    entities = helpers.ner.extract_entities_in_gliner_format(doc, "maj_voter", entities_to_remove)
    # Convert the extracted entities to IOB format
    # IOB format is commonly used for NER tasks and stands for Inside-Outside-Beginning
    iob_format = helpers.ner.convert_to_IOB(
        entity_spans=[(ent['start'], ent['end'], ent['text'], ent['label']) for ent in entities],
        input_text=text,
        tokenizer=tokenizer
    )
    # Append the text, entities, and IOB format annotations to the list
    train_labels_maj_voter.append({
        'text': text,
        'entities': entities,
        'iob': iob_format
    })


  0%|          | 0/826 [00:00<?, ?it/s]

In [78]:

# Initialize an empty list to store the training labels generated by the HMM
train_labels_hmm = []

# Iterate over each document in the training dataset
# tqdm is used to display a progress bar
for doc in tqdm(spacy_docs_train):
    text = doc.text  # Extract the text from the spaCy document
    # Extract entities using the HMM annotations
    # The entities_to_remove list is used to filter out unwanted entities
    entities = helpers.ner.extract_entities_in_gliner_format(doc, "hmm", entities_to_remove)
    # Convert the extracted entities to IOB format
    # IOB format is commonly used for NER tasks and stands for Inside-Outside-Beginning
    iob_format = helpers.ner.convert_to_IOB(
        entity_spans=[(ent['start'], ent['end'], ent['text'], ent['label']) for ent in entities],
        input_text=text,
        tokenizer=tokenizer
    )
    # Append the text, entities, and IOB format annotations to the list
    train_labels_hmm.append({
        'text': text,
        'entities': entities,
        'iob': iob_format
    })

  0%|          | 0/826 [00:00<?, ?it/s]

In [79]:
# Define a function to render an example with named entities
# This function uses a helper function to visualize the named entities in the text
def render_example(example):
    # Call the helper function to render the entity data
    # text: The input text containing the named entities
    # pipeline_results: The list of entities extracted from the text
    # label_key_name: The key in the entity dictionary that contains the label (e.g., 'MEDICAMENTO')
    # colors: A dictionary specifying the colors to use for different entity labels
    helpers.ner.render_entity_data_from_pipeline(
        text=example['text'],  # The input text to be rendered
        pipeline_results=example['entities'],  # The entities extracted from the text
        label_key_name='label',  # The key in the entity dictionary that contains the label
        colors={'MEDICAMENTO': 'lightgreen'}  # The color to use for the 'MEDICAMENTO' label
    )

In [80]:
render_example(train_labels_maj_voter[2])

In [81]:
render_example(train_labels_hmm[2])

In [82]:
render_example(train_labels_maj_voter[22])


In [83]:
render_example(train_labels_hmm[22])

### Step 4 - Training a Named Entity Recognition (NER) Model Using Weak Labels

With the aggregated weak labels in place, we can now proceed to train a Named Entity Recognition (NER) model targeted at identifying drug entities in legal documents. This involves several key steps: selecting an appropriate model architecture, preparing the training data, and fine-tuning our model.

#### Selecting the Model Architecture

For this task, we will use a **BERT-based model**. **BERT (Bidirectional Encoder Representations from Transformers)** is known for its powerful performance on NER tasks due to its ability to capture bidirectional context in textual data. This capability is crucial for sequence labeling tasks like NER.

#### Transfer Learning and Fine-tuning

Instead of training a BERT model from scratch, we will use **transfer learning**. This involves leveraging a pre-trained BERT model trained on a large corpus of Portuguese text. This model will be fine-tuned on our specific task of drug entity recognition. Here are the steps involved:

1. **Loading the Pre-trained Model**: Initialize the model with weights learned from a vast Portuguese text corpus.
2. **Adapting to the Task**: Modify the model's output layer to predict drug entity labels.
3. **Training on the Data**: Fine-tune the model using our weakly labeled datasets. This involves adjusting the parameters of the pre-trained model specifically for our NER task.

#### Preparing the Training Data

We will use three different versions of the annotated training data to fine-tune the BERT model:

1. **HMM Dataset**:
    - Contains labels aggregated using the Hidden Markov Model (HMM) method.
    - Provides a probabilistic approach to label aggregation.

2. **Majority Vote Dataset**:
    - Uses the weak labels determined by the most common annotation among different labeling functions.
    - A simple yet effective method for consensus labeling.

3. **True Labels Dataset**:
    - Includes manually annotated true labels, serving as the gold standard.
    - This dataset is used only for educational purposes and allows us to benchmark the performance of our weakly supervised models. In real-world scenarios, we typically do not have access to such fully annotated data.

#### Training the Model

The training process will involve fine-tuning the BERT model on each of the three versions of the dataset. By comparing the performance of the models trained on HMM and Majority Vote datasets to the one trained on the True dataset, we can assess the effectiveness of our weak labeling strategies.

##### Key Considerations

- **Bidirectional Context**: BERT's ability to understand context from both directions is particularly advantageous for NER tasks.
- **Transfer Learning**: Using a pre-trained model significantly reduces the computational resources and data required for training.
- **Weak Labels**: Despite being weak, these labels provide considerable value and reduce the need for extensive manual annotation.
- **Model Evaluation**: The True Labels dataset serves as a benchmark to understand the performance ceiling of our models trained with weak labels.

> **Note**: The True dataset is included solely for educational insights. In practical applications, the focus is primarily on applying weak labels to minimize manual annotation efforts while still achieving high-performance NER models.

By leveraging the powerful BERT architecture and carefully preparing the training data, we can effectively train an NER model to identify drug entities in legal documents using weakly supervised annotations. This approach is both cost-effective and demonstrates the practical application of advanced machine learning techniques in real-world scenarios.


In [126]:
from helpers.text import calculate_md5


Let's perform some data preparation to make our dataset ready for 🤗 Transformers

In [127]:
# Load the training dataset with labels from a Parquet file
df_train_with_labels = pd.read_parquet('data/ner/train_with_labels.parquet')

# Convert the string representation of the labels to a list of tuples
# The eval function evaluates the string as a Python expression, converting it to a list of tuples
# Each tuple contains a token and its corresponding label
df_train_with_labels['labels'] = df_train_with_labels['labels'].apply(lambda x: eval(x))

# Extract the labels from the training dataset as a list of lists
# Each inner list contains the labels for a single example in the training dataset
train_labels_true_iob = df_train_with_labels['labels'].values.tolist()

train_texts_true = df_train_with_labels['text'].values.tolist()
train_texts_hash = [calculate_md5(text) for text in train_texts_true]

In [128]:

# Extract the labels from the validation dataset as a list of lists
# Each inner list contains the labels for a single example in the validation dataset
valid_labels_true_iob = df_valid['labels'].values.tolist()

# Extract the labels from the test dataset as a list of lists
# Each inner list contains the labels for a single example in the test dataset
test_labels_true_iob = df_test['labels'].values.tolist()

# Extract the text from the validation dataset as a list
valid_texts_true = df_valid['text'].values.tolist()

# Extract the text from the test dataset as a list
test_texts_true = df_test['text'].values.tolist()

# Calculate the MD5 hash for each text in the validation dataset
valid_texts_hash = [calculate_md5(text) for text in valid_texts_true]

# Calculate the MD5 hash for each text in the test dataset
test_texts_hash = [calculate_md5(text) for text in test_texts_true]

In [129]:
# Separate words and tags for each example in the training dataset
# The list comprehension iterates over each example in train_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
words_list = [[word for word, _ in example] for example in train_labels_true_iob]
tags_list = [[tag for _, tag in example] for example in train_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the training dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_true_dicts = {
    "tokens": words_list,
    "ner_tags": tags_list,
    "text": train_texts_true,
    "hash": train_texts_hash
}


In [130]:
# Separate words and tags for each example in the validation dataset
# The list comprehension iterates over each example in valid_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
words_list = [[word for word, _ in example] for example in valid_labels_true_iob]
tags_list = [[tag for _, tag in example] for example in valid_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the validation dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
valid_true_dicts = {
    "tokens": words_list,
    "ner_tags": tags_list,
    "text": valid_texts_true,
    "hash": valid_texts_hash
}


In [131]:
# Separate words and tags for each example in the test dataset
# The list comprehension iterates over each example in test_labels_true_iob
# For each example, it extracts the words and tags, creating two separate lists
words_list = [[word for word, _ in example] for example in test_labels_true_iob]
tags_list = [[tag for _, tag in example] for example in test_labels_true_iob]

# Create a dictionary to store the tokens and named entity tags for the test dataset
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
test_true_dicts = {
    "tokens": words_list,
    "ner_tags": tags_list,
    "text": test_texts_true,
    "hash": test_texts_hash
    
}

In [132]:
# Extract the IOB format annotations from the HMM-labeled training data
# The list comprehension iterates over each entry in train_labels_hmm
# For each entry, it extracts the 'iob' field, which contains the IOB format annotations
iob_train_hmm = [i['iob'] for i in train_labels_hmm]

# Separate words and tags for each example in the HMM-labeled training data
# The first list comprehension iterates over each example in iob_train_hmm
# For each example, it extracts the words, creating a list of words for each example
words_list = [[word for word, _ in example] for example in iob_train_hmm]

# The second list comprehension iterates over each example in iob_train_hmm
# For each example, it extracts the tags, creating a list of tags for each example
tags_list = [[tag for _, tag in example] for example in iob_train_hmm]

# Create a dictionary to store the tokens and named entity tags for the HMM-labeled training data
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_hmm_dicts = {
    "tokens": words_list,
    "ner_tags": tags_list,
    "text": train_texts_true,
    "hash": train_texts_hash
}

In [133]:
assert [i[0][0] for i in iob_train_hmm] == [i[0][0] for i in train_labels_true_iob]

In [134]:
# Extract the IOB format annotations from the majority voter-labeled training data
# The list comprehension iterates over each entry in train_labels_maj_voter
# For each entry, it extracts the 'iob' field, which contains the IOB format annotations
iob_train_maj_voter = [i['iob'] for i in train_labels_maj_voter]

# Separate words and tags for each example in the majority voter-labeled training data
# The first list comprehension iterates over each example in iob_train_maj_voter
# For each example, it extracts the words, creating a list of words for each example
words_list = [[word for word, _ in example] for example in iob_train_maj_voter]

# The second list comprehension iterates over each example in iob_train_maj_voter
# For each example, it extracts the tags, creating a list of tags for each example
tags_list = [[tag for _, tag in example] for example in iob_train_maj_voter]

# Create a dictionary to store the tokens and named entity tags for the majority voter-labeled training data
# 'tokens' contains the list of words for each example
# 'ner_tags' contains the list of named entity tags for each example
train_maj_voter_dicts = {
    "tokens": words_list,  # List of words for each example
    "ner_tags": tags_list,  # List of named entity tags for each example
    "text": train_texts_true,
    "hash": train_texts_hash
}

In [135]:
assert [i[0][0] for i in iob_train_maj_voter] == [i[0][0] for i in train_labels_true_iob]

In [136]:
from datasets import ClassLabel, Dataset, Features, Sequence, Value  # Import necessary classes from the datasets library

# Define the mapping from label names to label IDs
# 'O' represents tokens that are not part of any named entity
# 'B-MEDICAMENTO' represents the beginning of a 'MEDICAMENTO' entity
# 'I-MEDICAMENTO' represents the inside of a 'MEDICAMENTO' entity
label_to_id = {'O': 0, 'B-MEDICAMENTO': 1, 'I-MEDICAMENTO': 2}

# Create the reverse mapping from label IDs to label names
# This is useful for converting label IDs back to label names
id_to_label = {v: k for k, v in label_to_id.items()}

# Extract the label names from the mapping
# This list will be used to define the ClassLabel feature in the dataset
label_names = list(label_to_id.keys())

# Define the dataset features
# The Features class specifies the schema of the dataset
# 'tokens' is a sequence of strings, representing the tokens in the text
# 'ner_tags' is a sequence of ClassLabel, representing the named entity recognition tags for each token
dataset_features = Features(
    {
        'tokens': Sequence(Value('string')),  # Sequence of tokens
        'ner_tags': Sequence(ClassLabel(names=label_names)),  # Sequence of named entity recognition tags, 
        'text': Value('string'),  # The original text
        'hash': Value('string')  # The MD5 hash of the text
    }
)


In [137]:
from datasets import Dataset  # Import the Dataset class from the datasets library

# Create a Hugging Face Dataset from the HMM-labeled training data
# The Dataset.from_dict method converts a dictionary to a Dataset object
# train_hmm_dicts contains the tokens and named entity tags for the HMM-labeled training data
# dataset_features specifies the schema of the dataset (tokens and ner_tags)
hf_dataset_train_hmm = Dataset.from_dict(train_hmm_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the majority voter-labeled training data
# train_maj_voter_dicts contains the tokens and named entity tags for the majority voter-labeled training data
hf_dataset_train_maj_voter = Dataset.from_dict(train_maj_voter_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the true-labeled training data
# train_true_dicts contains the tokens and named entity tags for the true-labeled training data
hf_dataset_train_true = Dataset.from_dict(train_true_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the true-labeled validation data
# valid_true_dicts contains the tokens and named entity tags for the true-labeled validation data
hf_dataset_valid_true = Dataset.from_dict(valid_true_dicts, features=dataset_features)

# Create a Hugging Face Dataset from the true-labeled test data
# test_true_dicts contains the tokens and named entity tags for the true-labeled test data
hf_dataset_test_true = Dataset.from_dict(test_true_dicts, features=dataset_features)

In [138]:
# Save the Hugging Face Datasets to disk

# Save the HMM-labeled training dataset to disk
hf_dataset_train_hmm.save_to_disk('outputs/ner/hf_dataset_train_hmm')

# Save the majority voter-labeled training dataset to disk
hf_dataset_train_maj_voter.save_to_disk('outputs/ner/hf_dataset_train_maj_voter')

# Save the true-labeled training dataset to disk
hf_dataset_train_true.save_to_disk('outputs/ner/hf_dataset_train_true')

# Save the true-labeled validation dataset to disk
hf_dataset_valid_true.save_to_disk('outputs/ner/hf_dataset_valid_true')

# Save the true-labeled test dataset to disk
hf_dataset_test_true.save_to_disk('outputs/ner/hf_dataset_test_true')



Saving the dataset (0/1 shards):   0%|          | 0/826 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/826 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/826 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/255 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

In [139]:
from typing import Dict, List  # Import necessary types for type annotations

def tokenize_and_align_labels(examples: Dict[str, List[str]]) -> Dict[str, List[str]]:
    """
    Tokenizes the input words and aligns the labels with the tokens.

    Args:
        examples: A dictionary containing the input words and the corresponding labels.
                  - "tokens": List of words (tokens) for each example.
                  - "ner_tags": List of named entity recognition tags for each token.

    Returns:
        A dictionary containing the tokenized input words and the aligned labels.
        - "input_ids": List of token IDs for each example.
        - "attention_mask": List of attention masks for each example.
        - "labels": List of aligned labels for each token.
    """
    label_all_tokens = True  # Whether to label all tokens or only the first token of each word
    # Tokenize the input words
    # truncation=True: Truncate sequences to the maximum length
    # is_split_into_words=True: The input is already split into words
    # max_length=512: Maximum length of the tokenized sequences
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, max_length=512)

    aligned_labels = []  # List to store the aligned labels for each example
    # Iterate over each example in the input data
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Get the word IDs for the current example
        previous_word_idx = None  # Initialize the previous word index
        label_ids = []  # List to store the aligned labels for the current example

        # Iterate over each word ID in the tokenized input
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Append -100 for special tokens (e.g., [CLS], [SEP])
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])  # Append the label for the current word
            else:
                # Append the label for the current word if label_all_tokens is True, otherwise append -100
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx  # Update the previous word index

        aligned_labels.append(label_ids)  # Append the aligned labels for the current example

    tokenized_inputs["labels"] = aligned_labels  # Add the aligned labels to the tokenized inputs
    return tokenized_inputs  # Return the tokenized inputs with aligned labels

# Apply the tokenization and label alignment to the training and validation datasets
# The map method applies the tokenize_and_align_labels function to each example in the dataset
# batched=True: Process the examples in batches for efficiency
tokenized_hf_dataset_train_hmm = hf_dataset_train_hmm.map(tokenize_and_align_labels, batched=True)
tokenized_hf_dataset_train_maj_voter = hf_dataset_train_maj_voter.map(tokenize_and_align_labels, batched=True)
tokenized_hf_dataset_train_true = hf_dataset_train_true.map(tokenize_and_align_labels, batched=True)
tokenized_hf_dataset_valid_true = hf_dataset_valid_true.map(tokenize_and_align_labels, batched=True)
tokenized_hf_dataset_test_true = hf_dataset_test_true.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/826 [00:00<?, ? examples/s]

Map:   0%|          | 0/826 [00:00<?, ? examples/s]

Map:   0%|          | 0/826 [00:00<?, ? examples/s]

Map:   0%|          | 0/255 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [140]:
# Display the features of the 'ner_tags' field in the test dataset
# The features attribute provides information about the schema of the dataset
# 'ner_tags' is a sequence of named entity recognition tags for each token in the dataset
# This line of code will show the details of the 'ner_tags' field, such as the possible tag values and their corresponding IDs
hf_dataset_test_true.features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-MEDICAMENTO', 'I-MEDICAMENTO'], id=None), length=-1, id=None)

In [109]:
import numpy as np  # Import the NumPy library for numerical operations
from typing import Tuple, Dict  # Import Tuple and Dict for type annotations
import evaluate  # Import the evaluate library for computing evaluation metrics

def compute_metrics_for_evaluation(predictions_and_labels: Tuple[np.ndarray, np.ndarray]) -> Dict[str, float]:
    """
    Computes metrics for model evaluation.

    Args:
        predictions_and_labels: A tuple containing the model predictions and the true labels.
                                - predictions: A NumPy array of shape (batch_size, sequence_length, num_labels)
                                - labels: A NumPy array of shape (batch_size, sequence_length)

    Returns:
        A dictionary containing precision, recall, f1, and accuracy metrics.
    """
    predictions, labels = predictions_and_labels  # Unpack the predictions and labels from the input tuple

    # Convert logits to actual predictions
    # np.argmax(predictions, axis=2) selects the index of the maximum value along the last axis (num_labels)
    # This converts the logits to the predicted label IDs
    predictions = np.argmax(predictions, axis=2)

    # Filter out the special tokens and convert IDs to labels
    # true_predictions and true_labels will contain the predicted and true labels, excluding special tokens
    true_predictions = [
        [id_to_label[pred] for (pred, label) in zip(prediction, label) if label != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_label[label] for (pred, label) in zip(prediction, label) if label != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Load the seqeval metric for named entity recognition
    # The seqeval metric computes precision, recall, f1, and accuracy for NER tasks
    metric = evaluate.load("seqeval")
    
    # Compute the evaluation metrics using the true predictions and true labels
    results = metric.compute(predictions=true_predictions, references=true_labels)

    # Return the computed metrics as a dictionary
    # The dictionary contains the overall precision, recall, f1, and accuracy
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [110]:
# Import necessary classes from the transformers library
from transformers import AutoModelForTokenClassification, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForTokenClassification

MODEL_NAME = "neuralmind/bert-base-portuguese-cased"  # Specify the name of the pretrained BERT model

# Load the tokenizer for the BERT model
# The tokenizer will handle tokenization and padding/truncation of input sequences
# 'neuralmind/bert-base-portuguese-cased' is the name of the pretrained model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Create a DataCollator for token classification tasks
# The DataCollatorForTokenClassification dynamically pads the input sequences to the maximum length in the batch
# This ensures that all sequences in a batch have the same length, which is required for efficient processing
# The tokenizer is passed to the DataCollator to handle tokenization and padding
token_classification_collator = DataCollatorForTokenClassification(tokenizer)

In [127]:
# Load the pretrained model and tokenizer from the MODEL_NAME model
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME, 
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,         # Mapping from label IDs to label names
    label2id=label_to_id          # Mapping from label names to label IDs
)

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir='./outputs/ner/bert-large-ner-true-labels',  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=6,  # Batch size for training
    per_device_eval_batch_size=6,   # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy='epoch',  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=4,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=3e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_true,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_valid_true,  # Tokenized validation dataset
    tokenizer=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [128]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.0636,0.056807,0.685925,0.916758,0.784718,0.982304
1,0.0484,0.040741,0.752227,0.901972,0.820321,0.985769
2,0.0286,0.029915,0.766569,0.934283,0.842157,0.988515
4,0.0147,0.021161,0.868915,0.93839,0.902317,0.992855
5,0.0142,0.02043,0.884257,0.935104,0.90897,0.993209
6,0.0102,0.020196,0.88831,0.934283,0.910717,0.993258




TrainOutput(global_step=119, training_loss=0.05570434930757815, metrics={'train_runtime': 257.6219, 'train_samples_per_second': 22.444, 'train_steps_per_second': 0.462, 'total_flos': 1370250539182008.0, 'train_loss': 0.05570434930757815, 'epoch': 6.898550724637682})

In [129]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
metrics_true_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics help us understand the performance of the model on the test dataset
metrics_true_labels



{'eval_loss': 0.019314151257276535,
 'eval_precision': 0.9295039164490861,
 'eval_recall': 0.9523809523809523,
 'eval_f1': 0.9408033826638478,
 'eval_accuracy': 0.9940722112448357,
 'eval_runtime': 2.2595,
 'eval_samples_per_second': 44.258,
 'eval_steps_per_second': 3.983,
 'epoch': 6.898550724637682}

In [130]:
import gc  # Import the garbage collection module
import torch  # Import the PyTorch library

# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
torch.cuda.empty_cache()

In [131]:
# Load the pretrained model and tokenizer
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,  # Pretrained model name
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,  # Mapping from label IDs to label names
    label2id=label_to_id  # Mapping from label names to label IDs
)

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir='./outputs/ner/bert-large-ner-hmm-labels',  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=6,  # Batch size for training
    per_device_eval_batch_size=6,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy='epoch',  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=4,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=3e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_hmm,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_valid_true,  # Tokenized validation dataset
    tokenizer=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [132]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
# During training, the model's parameters will be updated to minimize the loss function
# The evaluation metrics will be logged at each step and at the end of each epoch
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.0825,0.055876,0.688912,0.910186,0.78424,0.982718
1,0.0631,0.046057,0.707692,0.932092,0.804538,0.984302
2,0.0322,0.039361,0.787234,0.952355,0.861958,0.988918
4,0.0237,0.038968,0.797115,0.953176,0.868188,0.988367
5,0.0226,0.041672,0.789809,0.950712,0.862823,0.987609
6,0.0213,0.041599,0.788732,0.950712,0.86218,0.987688




TrainOutput(global_step=119, training_loss=0.06485164079408184, metrics={'train_runtime': 259.2494, 'train_samples_per_second': 22.303, 'train_steps_per_second': 0.459, 'total_flos': 1370250539182008.0, 'train_loss': 0.06485164079408184, 'epoch': 6.898550724637682})

In [133]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
# The evaluation metrics help us understand how well the model performs on unseen data
metrics_hmm_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics provide insights into the model's performance and can be used to compare different models
metrics_hmm_labels



{'eval_loss': 0.03901085630059242,
 'eval_precision': 0.8185982592762254,
 'eval_recall': 0.9561262707330123,
 'eval_f1': 0.8820335636722606,
 'eval_accuracy': 0.9868870127537274,
 'eval_runtime': 2.1678,
 'eval_samples_per_second': 46.13,
 'eval_steps_per_second': 4.152,
 'epoch': 6.898550724637682}

In [134]:
# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
# By setting these variables to None, we remove their references, making them eligible for garbage collection
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
# The garbage collector will clean up any objects that are no longer referenced in the code
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
# Clearing the CUDA cache helps in managing GPU memory more efficiently, especially when working with large models
torch.cuda.empty_cache()

In [135]:
# Load the pretrained model and tokenizer
# The model is configured for token classification with the specified number of labels
pretrained_language_model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,  # Pretrained model name
    num_labels=len(id_to_label),  # Number of unique labels in the dataset
    id2label=id_to_label,  # Mapping from label IDs to label names
    label2id=label_to_id  # Mapping from label names to label IDs
)


# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir='./outputs/ner/bert-large-ner-maj_voter-labels',  # Directory to save model checkpoints and logs
    num_train_epochs=7,  # Number of training epochs
    per_device_train_batch_size=6,  # Batch size for training
    per_device_eval_batch_size=6,  # Batch size for evaluation
    weight_decay=0.01,  # Weight decay for regularization
    seed=271828,  # Random seed for reproducibility
    bf16=True,  # Use bfloat16 precision for training (if supported by hardware)
    save_total_limit=1,  # Limit the total number of saved checkpoints
    logging_steps=1,  # Log training metrics every step
    eval_steps=1,  # Evaluate the model every step
    save_steps=1,  # Save the model every step
    metric_for_best_model="eval_f1",  # Metric to determine the best model
    greater_is_better=True,  # Higher metric value is better
    logging_strategy="steps",  # Log metrics at each step
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy='epoch',  # Save the model at the end of each epoch
    load_best_model_at_end=True,  # Load the best model at the end of training
    do_train=True,  # Perform training
    do_eval=True,  # Perform evaluation
    gradient_accumulation_steps=4,  # Accumulate gradients over multiple steps
    push_to_hub=False,  # Do not push the model to the Hugging Face Hub
    learning_rate=3e-5,  # Learning rate for the optimizer
    overwrite_output_dir=True,  # Overwrite the output directory if it exists
)

# Create a Trainer instance to handle training and evaluation
trainer = Trainer(
    model=pretrained_language_model,  # The model to be trained
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_hf_dataset_train_maj_voter,  # Tokenized training dataset
    eval_dataset=tokenized_hf_dataset_valid_true,  # Tokenized validation dataset
    tokenizer=tokenizer,  # Tokenizer for preprocessing the data
    data_collator=token_classification_collator,  # Data collator for dynamic padding
    compute_metrics=compute_metrics_for_evaluation,  # Function to compute evaluation metrics
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [136]:
# Start the training process using the Trainer instance
# This will train the model on the training dataset and evaluate it on the validation dataset
# The training process will follow the configurations specified in the TrainingArguments
# During training, the model's parameters will be updated to minimize the loss function
# The evaluation metrics will be logged at each step and at the end of each epoch
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
0,0.0838,0.057611,0.675343,0.916484,0.777649,0.982304
1,0.0609,0.050697,0.688613,0.942223,0.795699,0.983702
2,0.0312,0.047453,0.757675,0.952903,0.844148,0.98758
4,0.0244,0.050153,0.779556,0.96057,0.860648,0.987855
5,0.02,0.052997,0.766681,0.956462,0.851121,0.986861
6,0.0309,0.053237,0.766754,0.958653,0.852032,0.986901




TrainOutput(global_step=119, training_loss=0.06407456434167483, metrics={'train_runtime': 256.9282, 'train_samples_per_second': 22.504, 'train_steps_per_second': 0.463, 'total_flos': 1370250539182008.0, 'train_loss': 0.06407456434167483, 'epoch': 6.898550724637682})

In [137]:
# Evaluate the model on the tokenized test dataset using the Trainer instance
# This will return a dictionary of evaluation metrics such as loss, precision, recall, and F1 score
# The evaluation metrics help us understand how well the model performs on unseen data
metrics_maj_voter_labels = trainer.evaluate(tokenized_hf_dataset_test_true)

# Display the evaluation metrics
# These metrics provide insights into the model's performance and can be used to compare different models
metrics_maj_voter_labels



{'eval_loss': 0.05154959857463837,
 'eval_precision': 0.7837837837837838,
 'eval_recall': 0.962011771000535,
 'eval_f1': 0.8638001441268317,
 'eval_accuracy': 0.984834098899125,
 'eval_runtime': 2.2025,
 'eval_samples_per_second': 45.402,
 'eval_steps_per_second': 4.086,
 'epoch': 6.898550724637682}

In [138]:
# Set the pretrained language model and trainer to None
# This helps in releasing the memory allocated to these objects
# By setting these variables to None, we remove their references, making them eligible for garbage collection
pretrained_language_model = None
trainer = None

# Force the garbage collector to release unreferenced memory
# This is useful to free up memory that is no longer needed
# The garbage collector will clean up any objects that are no longer referenced in the code
gc.collect()

# Empty the CUDA cache
# This releases GPU memory that was allocated by PyTorch but is no longer needed
# Clearing the CUDA cache helps in managing GPU memory more efficiently, especially when working with large models
torch.cuda.empty_cache()

In [140]:
# Create a DataFrame to compare evaluation metrics from different models
# The DataFrame will contain metrics from three different models: True Labels, HMM Labels, and Maj Voter Labels

# Create a DataFrame with metrics from the three models
# Each row in the DataFrame corresponds to a different metric (e.g., loss, precision, recall, F1 score)
# Each column corresponds to a different model (True Labels, HMM Labels, Maj Voter Labels)
df_metrics = pd.DataFrame(
    [metrics_true_labels, metrics_hmm_labels, metrics_maj_voter_labels],  # List of dictionaries containing metrics
    index=['True Labels', 'HMM Labels', 'Maj Voter Labels']  # Index labels for the DataFrame
)

# Display the DataFrame
# This will show the evaluation metrics for each model in a tabular format
# The DataFrame provides a clear and organized way to compare the performance of different models
df_metrics

Unnamed: 0,eval_loss,eval_precision,eval_recall,eval_f1,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
True Labels,0.019314,0.929504,0.952381,0.940803,0.994072,2.2595,44.258,3.983,6.898551
HMM Labels,0.039011,0.818598,0.956126,0.882034,0.986887,2.1678,46.13,4.152,6.898551
Maj Voter Labels,0.05155,0.783784,0.962012,0.8638,0.984834,2.2025,45.402,4.086,6.898551


## Final Considerations

In this class, we explored the **Skweak framework** for **weak supervision** in **Named Entity Recognition (NER)** tasks. This framework leverages various techniques to efficiently annotate text data and train NER models with minimal manual effort. Let's probe deeper into the key components and results of our approach.

### Key Components of Skweak Framework

1. **Labeling Functions**:
    - These are user-defined functions that apply heuristic rules or external knowledge sources to label data automatically. Labeling functions can be tailored to specific tasks or domains, providing initial annotations without human intervention.

2. **Generative Models**:
    - Generative models in Skweak combine the outputs of multiple labeling functions to create a consensus label for each data instance. These models can learn the accuracy and correlations of labeling functions, thereby improving the quality of the annotations.

3. **Majority Voting**:
    - This is a simpler technique where the label assigned to a data instance is the one most frequently suggested by the labeling functions. While less sophisticated than generative models, majority voting is a quick way to aggregate labels.

### Iterative Refinement Process

The Skweak framework supports an iterative refinement process, allowing continuous improvement of model performance. This process involves:

- **Initial Annotation**: Using labeling functions and weak supervision techniques to generate initial labels.
- **Model Training**: Training NER models on these weakly labeled datasets.
- **Evaluation and Feedback**: Assessing model performance and refining labeling functions or generative models based on feedback.
- **Re-annotation**: Updating annotations with improved labeling functions and repeating the cycle.

This iterative approach ensures that the NER model can adapt to evolving data requirements and improve over time.

### Performance Evaluation

We evaluated the performance of our approach using a test dataset. Here are the results:

| Data | F1-Score | Loss|
|----------|----------|----------|
| **Real labels** | **0.940803** | **0.019314** |
| HMM labels | 0.882034 | 0.039011 |
| Majority vote labels | 0.863800 | 0.051550 |

- **Real labels**: These are the gold standard manually annotated labels, providing the highest performance benchmark.
- **HMM labels**: Labels generated by a Hidden Markov Model, which combines multiple labeling functions using a probabilistic approach.
- **Majority vote labels**: Labels determined by the majority voting method.

### Time Efficiency

Manual labeling of data is time-consuming and often impractical for large datasets. In our case:

- Each manual label takes approximately 144 seconds (72 seconds per person, with two people involved).
- For 826 documents, this results in 826 x 144 seconds = 118,944 seconds, which translates to approximately 33 hours of human labor.

By using weak supervision techniques, we can significantly reduce this time, making the annotation process more efficient and scalable.

### Conclusion

The Skweak framework provides a powerful and flexible approach for weak supervision in NER tasks. By leveraging labeling functions, generative models, and majority voting, we can efficiently annotate large datasets and continuously improve model performance through iterative refinement. This approach not only saves substantial time and effort but also enables the development of robust NER models adaptable to various domains and evolving data requirements.

> **Note**: In real-world scenarios, the benefits of weak supervision become even more pronounced as the size of the dataset increases. The ability to efficiently and accurately label large volumes of data is crucial for the practical application of NER models.

# Questions

1. What is Named Entity Recognition (NER) and what is its purpose?

2. How can weak supervision techniques reduce the reliance on manual data labeling for NER tasks?

3. What are some examples of methods used in labeling functions for NER?

4. How does the Skweak framework contribute to improving the quality of annotations in NER?

5. What is the significance of document-level labeling in the context of NER?

6. How can transfer learning be leveraged to enhance NER performance, especially for specialized domains like legal documents?

7. What is meant by "iterative refinement" in the context of NER models and their labeling functions?

8. How does the time efficiency of weak supervision compare to traditional manual labeling for NER tasks?

9. What key benefits does weak supervision offer in terms of cost-effectiveness and scalability for NER model development?

10. Can models trained on weakly labeled data achieve comparable performance to those trained on fully annotated datasets?

`Answers are commented inside this cell.`

<!--
1. Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that identifies and classifies named entities mentioned in unstructured text into predefined categories such as person names, locations, organizations, dates, etc. This process essentially transforms unstructured text into structured data, which is invaluable for various downstream applications.

2. Weak supervision techniques offer a way to efficiently annotate large datasets for NER tasks without the need for extensive manual labeling. Instead of manually labeling every word in a sentence, weak supervision leverages labeling functions that automatically generate noisy labels based on predefined rules, heuristics, or existing knowledge sources.

3. Labeling functions can apply a variety of methods including: Gazetteer lookups (matching against lists of known entities), regular expression patterns, leveraging pre-trained transformer models like BERT, or employing zero-shot learning techniques like those used in tools such as GLiNER, NuNER, and LangChain.

4. Skweak is a framework designed to improve the quality of annotations generated by these labeling functions. It combines the outputs (weak labels) from multiple labeling functions using generative models like Hidden Markov Models and applies techniques like majority voting to reconcile discrepancies, leading to more accurate and robust annotations.

5. Document-level labeling functions play a crucial role in ensuring label consistency across an entire document. Instead of treating sentences in isolation, these functions consider the broader context and relationships between entities mentioned within a document. This complete approach leads to improved annotation accuracy, especially in cases where entity relationships and co-references are important for correct classification.

6. Transfer learning is extremely valuable in NER, particularly when dealing with specialized domains like legal documents. Pre-trained language models like BERT capture rich linguistic representations from massive text corpora. Fine-tuning these models on a weakly labeled dataset of legal texts allows them to adapt their knowledge to the specific terminology and context of legal language, achieving high performance on tasks like drug entity recognition without requiring vast amounts of manually labeled legal data.

7. Iterative refinement is a cyclical process of improvement. It involves analyzing the performance of the NER model, identifying areas where it makes mistakes, and then refining the labeling functions to address these weaknesses. This might involve adding new rules, modifying existing ones, or incorporating additional knowledge sources. The refined labels are then used to retrain or further fine-tune the model, leading to incremental improvements in its accuracy over time.

8. Weak supervision offers significant time savings compared to manual labeling. By automating a substantial portion of the annotation process through labeling functions, weak supervision reduces the need for human annotators to meticulously label each data instance, especially in large-scale datasets. This efficiency is particularly valuable in real-world scenarios where data is abundant, and time constraints are often a major factor.

9. From a cost-effectiveness standpoint, weak supervision reduces the reliance on expensive manual annotation, making it a more budget-friendly approach for developing NER models. Additionally, its iterative nature and ability to incorporate diverse labeling functions make it a scalable solution, adaptable to evolving data requirements and applicable across a wide range of domains.

10. Yes, models trained on weakly labeled data, particularly when enhanced by techniques like transfer learning from pre-trained models and iterative refinement of labeling functions, can achieve performance comparable to models trained on fully annotated datasets, demonstrating the efficacy and practicality of weak supervision in real-world NER applications. -->