# Token Classification / Named Entity Recognition (NER)
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

# Summary

## Keypoints

- Named Entity Recognition (NER) is a crucial task in Information Extraction, focusing on identifying and classifying specific entities within text.

- NER transforms unstructured textual data into structured information, bridging the gap between human language and machine-readable formats.

- Common entity types in NER include PERSON, ORG (Organization), GPE (Geopolitical Entity), TIME, DATE, LOCATION, PRODUCT, and EVENT.

- NER faces challenges such as ambiguity, handling new or rare entities, nested entities, and domain specificity.

- Techniques for NER include rule-based approaches, machine learning methods (supervised and unsupervised), and deep learning approaches.

- The IOB2 (Inside-Outside-Beginning) tagging scheme is widely used in NER for labeling tokens as part of entities.

- Transfer learning is a powerful technique in NER, allowing models to leverage knowledge from pre-trained language models.

- Domain adaptation is crucial when applying NER to specialized fields like legal or medical texts.

- Transformer models, such as BERT, have revolutionized NER by capturing context more effectively than traditional methods.

- Large language models can be used for weak supervision in NER tasks, helping to generate initial labels for datasets.

## Takeaways

- NER is a fundamental task in NLP with wide-ranging applications across industries, from information retrieval to compliance monitoring.

- The choice of NER technique depends on the specific task, available data, and computational resources. Hybrid approaches often yield the best results.

- Transfer learning and domain adaptation significantly improve NER performance, especially when working with limited labeled data in specialized domains.

- The effectiveness of NER models heavily relies on the quality and relevance of training data. Domain-specific datasets are crucial for high-performance NER in specialized fields.

- Transformer-based models have set new benchmarks in NER tasks, but they come with challenges such as handling long texts and computational requirements.

- Weak supervision techniques, like using large language models for initial labeling, can significantly reduce the manual effort required in preparing NER datasets.

- Evaluating NER models requires considering multiple metrics, including precision, recall, and F1 score, as well as performance on specific entity types.

- Implementing NER in real-world applications often requires addressing practical challenges such as text preprocessing, handling long documents, and integrating with existing systems.

- The field of NER is rapidly evolving, with new techniques and models constantly emerging. Staying updated with the latest advancements is crucial for practitioners.

- While automated NER systems have become highly sophisticated, human expertise remains valuable for fine-tuning models and handling complex or ambiguous cases.

# Loading our data

In [1]:
# Load our data.
import pandas as pd

df_train = pd.read_parquet('data/healthcare/train.parquet')
df_test = pd.read_parquet('data/healthcare/test.parquet')
df_valid = pd.read_parquet('data/healthcare/valid.parquet')

# Let's see the first rows of our data.
df_train.head()

Unnamed: 0,text,uid,text_pt,Asthma,CAD,CHF,Depression,Diabetes,Gallstones,GERD,Gout,Hypercholesterolemia,Hypertension,Hypertriglyceridemia,OA,Obesity,OSA,PVD,Venous Insufficiency
0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,1,1,0,1,0,1,0,0,1,0,1,0,0,1,1
1,368346277 | EMH | 64927307 | | 815098 | 3/29/1...,2,368346277 | EMH | 64927307 | | 815098 | 29/03/...,0,1,0,0,0,0,0,0,1,1,0,0,1,0,1,0
2,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,4,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,1,1,1,0,1,0,1,0,1,1,1,0,1,0,0,1
3,614370301 | OH | 58149804 | | 530586 | 10/21/1...,7,614370301 | OH | 58149804 | | 530586 | 21/10/1...,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0
4,279607396 | PMH | 77790323 | | 371979 | 10/13/...,10,279607396 | PMH | 77790323 | | 371979 | 10/13/...,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0


In [2]:
import re
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import unicodedata
import spacy
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

def remove_excessive_whitespace(text: str) -> str:
    """
    Remove excessive whitespace from a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with excessive whitespace removed.
    """
    return re.sub(r'\s+', ' ', text).strip() 

def remove_repeated_non_word_characters(text: str) -> str:
    """
    Remove repeated non-word characters from a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with repeated non-word characters removed.
    """
    return re.sub(r'(\W)\1+', r'\1', text).strip() # \W matches any non-word character (equivalent to [^a-zA-Z0-9_ ])

def remove_first_line_of_text(text: str) -> str:
    """
    Remove the first line of a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with the first line removed.
    """
    return re.sub(r'^.*\n', '', text).strip()

def remove_last_line_of_text(text: str) -> str:
    """
    Remove the last line of a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with the last line removed.
    """
    return re.sub(r'\n.*$', '', text).strip()

def correct_isolated_commas(text: str) -> str:
    """
    Correct isolated commas in a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with isolated commas corrected.
    """
    # Replace punctuation with a blank character before
    text = re.sub(r' ([.,:;!?])', r'\1', text)
    return text.strip()

def remove_accents_from_text(text: str) -> str:
    """
    Remove accents from a string.

    Args:
        text (str): The input string.

    Returns:
        str: The string with accents removed.
    """
    return unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII') 

# Define a text cleaning pipeline using scikit-learn's Pipeline
# The pipeline consists of a series of transformations applied sequentially to the text data
text_cleaning_pipeline = Pipeline([
    # Remove the first line from the text
    ('remove_first_line', FunctionTransformer(remove_first_line_of_text)),
    # Remove the last line from the text
    ('remove_last_line', FunctionTransformer(remove_last_line_of_text)),
    # Remove excessive whitespace from the text
    ('remove_excessive_whitespace', FunctionTransformer(remove_excessive_whitespace)),
    # Remove repeated non-word characters from the text
    ('remove_repeated_non_word_characters', FunctionTransformer(remove_repeated_non_word_characters)),
    # Correct isolated commas in the text
    ('correct_isolated_commas', FunctionTransformer(correct_isolated_commas)),
    # Optionally remove accents from the text
    # Uncomment the following line if you want to remove accents
    # One good tip to decide if you'll should use this or not is to think about the language of your data. Are the use of accents relatively consistent in your data? If so, you should keep them. If not, you should remove them. Since our data was automatically translated from English to Portuguese, we'll keep the accents because the translation process doesn't generally misuses accents.
    # ('remove_accents', FunctionTransformer(remove_accents_from_text)),
])

# Apply the text cleaning pipeline to the 'text_pt' column of the training DataFrame
# - .apply(text_cleaning_pipeline.transform) applies the pipeline to each element in the 'text_pt' column
# - The cleaned text is stored in a new column 'cleaned_text'
df_train['cleaned_text'] = df_train['text_pt'].apply(text_cleaning_pipeline.transform)

# Apply the text cleaning pipeline to the 'text_pt' column of the validation DataFrame
df_valid['cleaned_text'] = df_valid['text_pt'].apply(text_cleaning_pipeline.transform)

# Apply the text cleaning pipeline to the 'text_pt' column of the test DataFrame
df_test['cleaned_text'] = df_test['text_pt'].apply(text_cleaning_pipeline.transform)

In [3]:
df = pd.concat([df_train, df_valid, df_test], ignore_index=True)
df.head()

Unnamed: 0,text,uid,text_pt,Asthma,CAD,CHF,Depression,Diabetes,Gallstones,GERD,Gout,Hypercholesterolemia,Hypertension,Hypertriglyceridemia,OA,Obesity,OSA,PVD,Venous Insufficiency,cleaned_text
0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,1,1,0,1,0,1,0,0,1,0,1,0,0,1,1,Data de Alta: 6/20/2006 MÉDICO EM ATENDIMENTO:...
1,368346277 | EMH | 64927307 | | 815098 | 3/29/1...,2,368346277 | EMH | 64927307 | | 815098 | 29/03/...,0,1,0,0,0,0,0,0,1,1,0,0,1,0,1,0,Data de alta: 10/12/1993 DIAGNÓSTICO PRINCIPAL...
2,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,4,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,1,1,1,0,1,0,1,0,1,1,1,0,1,0,0,1,Data de Alta: 9/29/2007 MÉDICO ASSISTENTE: SAL...
3,614370301 | OH | 58149804 | | 530586 | 10/21/1...,7,614370301 | OH | 58149804 | | 530586 | 21/10/1...,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,Data de Alta: 15/10/1995 DIAGNÓSTICO PRINCIPAL...
4,279607396 | PMH | 77790323 | | 371979 | 10/13/...,10,279607396 | PMH | 77790323 | | 371979 | 10/13/...,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,Data de Alta: 06/11/1992 DIAGNÓSTICO DE ADMISS...


# Named Entity Recognition (NER)

Named Entity Recognition (NER) is a crucial task within the broader field of Information Extraction, focusing on the identification and classification of specific entities within text. This process transforms unstructured textual data into structured information, bridging the gap between human language and machine-readable formats.
Named Entity Recognition is a fundamental task in natural language processing, serving as a bridge between unstructured text and structured data. Its applications span across various industries and use cases, making it a critical component in many AI and data processing systems. As NLP technologies advance, NER continues to evolve, tackling more complex scenarios and improving accuracy across diverse domains.
## Core Concepts

1. **Definition**: NER involves locating and categorizing named entities in text into predefined classes.

2. **Purpose**: The primary goal is to extract structured information from unstructured text, enabling machines to understand and process human language more effectively.

3. **Entity Types**: Common categories include:
   - `PERSON`: Individual names (e.g., "John Doe", "Marie Curie")
   - `ORG`: Organizations (e.g., "Apple Inc.", "United Nations")
   - `GPE`: Geopolitical entities (e.g., "New York", "France")
   - `TIME`: Time expressions (e.g., "3:30 PM", "noon")
   - `DATE`: Date expressions (e.g., "July 4, 1776", "next Monday")
   - `LOCATION`: Physical locations not classified as GPE (e.g., "Eiffel Tower", "Amazon Rainforest")
   - `PRODUCT`: Names of products (e.g., "iPhone", "Coca-Cola")
   - `EVENT`: Named events (e.g., "World War II", "Olympics")

   > Note: The specific entity types can vary depending on the NER system and its intended application.

## Importance and Applications

1. **Information Retrieval**: Enhances search capabilities by allowing queries based on specific entity types.
2. **Question Answering Systems**: Facilitates more accurate responses by identifying key entities in questions and potential answers.
3. **Content Classification**: Aids in categorizing documents based on mentioned entities.
4. **Recommendation Systems**: Improves personalization by recognizing user preferences for specific entities.
5. **Legal and Compliance**: Assists in identifying sensitive information for redaction or special handling.

## Challenges in NER

1. **Ambiguity**: Words can have multiple meanings or belong to different entity types based on context.
2. **New or Rare Entities**: Dealing with entities not present in training data.
3. **Nested Entities**: Handling entities that contain other entities (e.g., "Bank of America" contains "America").
4. **Domain Specificity**: NER systems often require customization for specific domains (e.g., medical, legal).

## NER Techniques

1. **Rule-Based Approaches**: Utilize hand-crafted rules and dictionaries.
2. **Machine Learning Methods**: 
   - Supervised Learning: Use labeled data to train classifiers.
   - Unsupervised Learning: Attempt to cluster similar entities without labeled data.
3. **Deep Learning Approaches**: Employ neural networks, particularly sequence models like RNNs and transformers.

## Evaluation Metrics

- **Precision**: Accuracy of identified entities.
- **Recall**: Proportion of actual entities correctly identified.
- **F1 Score**: Harmonic mean of precision and recall.

## Visual Example

The image provided illustrates NER in action for legal entities:

<br><br>
<p align="center">
  <img src="images/NER.png"  alt="" style="width: 80%; height: 80%"/>
</p>
<br><br>

This example demonstrates how NER can identify and categorize various entities within a legal context, showcasing its practical application in specialized domains.

## Why is Named Entity Recognition Crucial?

Named Entity Recognition (NER) plays a pivotal role in both **Natural Language Processing (NLP)** and **Natural Language Understanding (NLU)**. Its ability to extract detailed, structured insights from textual data makes it critical for various applications and scenarios. Below are some key instances where NER is essential:

- **Information Retrieval**: By identifying key entities in documents, NER enhances search functionality, enabling users to search using named entities.
- **Question Answering Systems**: NER helps in accurately understanding user queries, allowing for more precise answers in question-answering systems.
- **Machine Translation**: NER assists in generating highly accurate translations by recognizing and preserving entity names in the translated text.

Additionally, NER is crucial for other applications such as:

- **Relation Extraction**
- **Knowledge Graph Construction**
- **Text Mining**

## Illustrative Use Cases of NER

- **Healthcare Text Analysis**: In clinical note analysis, NER can identify information about diseases, symptoms, treatments, and medications, aiding in improved medical decision-making.
- **News Articles**: For a news reading app that curates articles, NER can extract information about people, organizations, and locations mentioned in the articles. This information can be used to categorize or tag articles or improve article recommendations.
- **Customer Support**: In customer support scenarios, NER can identify and separate out important pieces of information from a customer’s query, such as name, email address, and phone number, allowing for more efficient handling of customer requests.

In essence, NER transforms unstructured textual data into structured, machine-readable data, enabling more sophisticated and nuanced analyses.

## Why Not Use Regular Expressions?

Regular expressions (regex) are powerful tools for text processing and pattern matching. However, they are not suitable for NER due to their limitations in generalizing to unseen data. Here’s why:

- **Limited Flexibility**: Regex can be rigid. For example, a regex pattern to match names like `[A-Z][a-z]*\s[A-Z][a-z]*` would only match names with a first and a last name. It would fail to match:
  - Names with a middle name or initial.
  - Hyphenated names.
  - Names with connectives (e.g., "de", "a", "do", "dos", "das").
  - Names with suffixes (e.g., "Júnior", "Neto").
- **False Positives**: Regex might match strings that are not names, such as titles like "Doctor" or "Professor".

## Why Not Use a Dictionary?

Using a dictionary for NER can be effective for a small number of entities, but it becomes impractical for larger datasets. Consider the following:

- **Scalability Issues**: A dictionary containing all names of people worldwide would be enormous, making it difficult to maintain and update.
- **Dynamic Nature**: Names and entities constantly evolve, and keeping a dictionary up-to-date with new names would be challenging.


In [4]:
# Import the necessary modules from the spaCy library
# - spacy: The main spaCy library
# - displacy: A module for visualizing spaCy's results
import spacy
from spacy import displacy

# Load the spaCy model for Portuguese
# - 'pt_core_news_lg' is a large model with more accuracy and features
# - You can also use 'pt_core_news_sm' for a smaller, faster model with fewer features
# - To install the model, run: python -m spacy download pt_core_news_lg
nlp = spacy.load('pt_core_news_lg')

# Define a sample text in Portuguese
# This text contains various entities such as names, locations, organizations, and dates
text_example = (
    "Meu nome é Elias Jacob e eu moro em Natal, Rio Grande do Norte. "
    "Eu trabalho no Instituto Metrópole Digital, que é a unidade mais bacana da UFRN. "
    "Desde 2021 eu também trabalho como Corregedor da UFRN. "
    "Quando a pandemia começou, no início de 2020, eu estava com as malas prontas para uma viagem de férias para o Japão. "
    "Eu até fui buscar meu visto no Consulado em Recife, mas, quando chegou mais perto da viagem, meus voos foram todos cancelados pela United Airlines e eu não viajei. "
    "No dia 11 de novembro de 2023, eu estive no show do Roger Waters em São Paulo. "
    "Sempre que visito a cidade, eu dou uma passada no Kidoairaku, meu restaurante japonês favorito lá."
)

# Process the text using the spaCy model
# - nlp() processes the text and returns a Doc object containing the processed text
# - The Doc object includes information about the entities, parts of speech, and other linguistic features
doc = nlp(text_example)

# Visualize the named entities in the text using displaCy
# - displacy.render() renders the named entities in the text
# - doc: The Doc object containing the processed text
# - style='ent': Specifies that the visualization should show named entities
# - jupyter=True: Specifies that the visualization should be displayed in a Jupyter notebook
displacy.render(doc, style='ent', jupyter=True)

In [5]:
from typing import List, Tuple

def extract_named_entities_from_text(input_text: str) -> List[Tuple[str, str]]:
    """
    Extract named entities from a given text.

    Args:
        input_text (str): The input text.

    Returns:
        List[Tuple[str, str]]: A list of tuples where each tuple contains the named entity and its label.
    """
    # Parse the text with spaCy
    # - nlp(input_text) processes the input text and returns a Doc object containing the parsed text
    parsed_text = nlp(input_text)
    
    # Initialize an empty list to store the named entities
    named_entities = []
    
    # Iterate over the named entities in the parsed text
    # - parsed_text.ents contains the named entities identified in the text
    for entity in parsed_text.ents:
        # Append the entity text and label to the list
        # - entity.text is the text of the named entity
        # - entity.label_ is the label of the named entity (e.g., PERSON, ORG, GPE)
        named_entities.append((entity.text, entity.label_))
    
    # Return the list of named entities
    return named_entities

# Test the function with an example text
extract_named_entities_from_text(text_example)

[('Elias Jacob', 'PER'),
 ('Natal', 'LOC'),
 ('Rio Grande do Norte', 'LOC'),
 ('Instituto Metrópole Digital', 'ORG'),
 ('UFRN', 'LOC'),
 ('Corregedor da UFRN', 'MISC'),
 ('Japão', 'LOC'),
 ('Consulado', 'MISC'),
 ('Recife', 'LOC'),
 ('United Airlines', 'ORG'),
 ('Roger Waters', 'PER'),
 ('São Paulo', 'LOC'),
 ('Kidoairaku', 'ORG')]

## Differences Between Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging

In the field of Natural Language Processing (NLP), **Named Entity Recognition (NER)** and **Part-of-Speech (POS) tagging** are two different techniques. Both involve labeling words or phrases in a text, but they serve distinct purposes and provide different types of information.

### Named Entity Recognition (NER)

**Named Entity Recognition (NER)** focuses on identifying and classifying named entities within a text into predefined categories such as:

- **Person names** (`PERSON`)
- **Organizations** (`ORG`)
- **Locations** (`LOC`)
- **Medical codes** (`MEDICAL`)
- **Time expressions** (`TIME`)
- **Quantities** (`QUANTITY`)
- **Monetary values** (`MONEY`)
- **Percentages** (`PERCENT`)

**Example:**
In the sentence, "Apple Inc. was established in April 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne," NER would identify:
- 'Apple Inc.' as an **organization** (`ORG`)
- 'April 1976' as a **time expression** (`TIME`)
- 'Steve Jobs', 'Steve Wozniak', and 'Ronald Wayne' as **persons** (`PERSON`)

### Part-of-Speech (POS) Tagging

**Part-of-Speech (POS) tagging** involves labeling each word in a sentence with its corresponding part of speech. This includes:

- **Nouns** (`NN`, `NNP`)
- **Verbs** (`VB`, `VBD`, `VBG`)
- **Adjectives** (`JJ`)
- **Adverbs** (`RB`)
- **Pronouns** (`PR`)
- **Prepositions** (`IN`)
- **Conjunctions** (`CC`)

**Example:**
Using the same sentence, "Apple Inc. was established in April 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne," POS tagging would assign:
- `NNP` (proper noun) to 'Apple Inc.', 'Steve Jobs', 'Steve Wozniak', and 'Ronald Wayne'
- `VBD` (past tense verb) to 'was established'
- `IN` (preposition) to 'in', 'by'

### Key Differences

**Objective:**
- **NER:** Identifies and categorizes entities in the text into predefined classes.
- **POS Tagging:** Identifies the grammatical roles of words within the sentence.

**Scope:**
- **NER:** Requires a broader context to accurately classify an entity.
- **POS Tagging:** Focuses on individual words and their immediate neighbors.

**Output:**
- **NER:** Provides labels for multi-word phrases.
- **POS Tagging:** Provides labels for individual words.

### Additional Considerations

- **Context Sensitivity:** NER needs to consider the broader context to disambiguate entities (e.g., 'Apple' as a company vs. a fruit), whereas POS tagging relies on local context.
- **Complexity:** NER can be more complex due to variability and ambiguity of named entities, while POS tagging deals with a more fixed set of grammatical categories.
- **Applications:** NER is crucial for tasks like information extraction, question answering, and summarization. POS tagging is essential for syntactic parsing, text-to-speech systems, and machine translation.

### Interplay Between NER and POS Tagging

While distinct, NER and POS tagging can complement each other in NLP pipelines:
- POS information can aid in NER by helping to identify potential entity boundaries and types.
- NER results can enhance POS tagging accuracy, especially for proper nouns and domain-specific terms.

We won't cover POS tagging in this notebook, but understanding the differences between NER and POS tagging is essential for developing a comprehensive understanding of text analysis techniques.


In [6]:
# We won't discuss POS tagging in this class, but here's an example of how to do it with Spacy

# Define a dictionary to map POS tags to their Portuguese equivalents
# This dictionary will help translate the POS tags from English to Portuguese
pos_tags_pt = {
    "ADJ": "Adjetivo",
    "ADP": "Adposição",
    "ADV": "Advérbio",
    "AUX": "Auxiliar",
    "CONJ": "Conjunção",
    "CCONJ": "Conjunção Coordenativa",
    "DET": "Determinante",
    "INTJ": "Interjeição",
    "NOUN": "Substantivo",
    "NUM": "Numeral",
    "PART": "Partícula",
    "PRON": "Pronome",
    "PROPN": "Nome próprio",
    "PUNCT": "Pontuação",
    "SCONJ": "Conjunção Subordinativa",
    "SYM": "Símbolo",
    "VERB": "Verbo",
    "X": "Outros",
    "SPACE": "Espaço"
}

def extract_part_of_speech_tags(input_text: str) -> List[Tuple[str, str, str]]:
    """
    Extract part-of-speech (POS) tags from a given text.

    Args:
        input_text (str): The input text.

    Returns:
        List[Tuple[str, str, str]]: A list of tuples where each tuple contains the token, its POS tag, and the corresponding Portuguese POS tag.
    """
    # Parse the text with spaCy
    # - nlp(input_text) processes the input text and returns a Doc object containing the parsed text
    parsed_text = nlp(input_text)
    
    # Initialize an empty list to store the POS tags
    pos_tags = []
    
    # Iterate over the tokens in the parsed text
    # - token.text is the text of the token
    # - token.pos_ is the POS tag of the token
    for token in parsed_text:
        # Append the token text, its POS tag, and the corresponding Portuguese POS tag to the list
        pos_tags.append((token.text, token.pos_, pos_tags_pt[token.pos_]))
    
    # Return the list of POS tags
    return pos_tags

# Test the function with an example text
# - text_example is a sample text containing various tokens
# - The function call extracts the POS tags from the text and returns them as a list of tuples
extract_part_of_speech_tags(text_example)

[('Meu', 'DET', 'Determinante'),
 ('nome', 'NOUN', 'Substantivo'),
 ('é', 'AUX', 'Auxiliar'),
 ('Elias', 'PROPN', 'Nome próprio'),
 ('Jacob', 'PROPN', 'Nome próprio'),
 ('e', 'CCONJ', 'Conjunção Coordenativa'),
 ('eu', 'PRON', 'Pronome'),
 ('moro', 'VERB', 'Verbo'),
 ('em', 'ADP', 'Adposição'),
 ('Natal', 'PROPN', 'Nome próprio'),
 (',', 'PUNCT', 'Pontuação'),
 ('Rio', 'PROPN', 'Nome próprio'),
 ('Grande', 'PROPN', 'Nome próprio'),
 ('do', 'ADP', 'Adposição'),
 ('Norte', 'PROPN', 'Nome próprio'),
 ('.', 'PUNCT', 'Pontuação'),
 ('Eu', 'PRON', 'Pronome'),
 ('trabalho', 'NOUN', 'Substantivo'),
 ('no', 'ADP', 'Adposição'),
 ('Instituto', 'PROPN', 'Nome próprio'),
 ('Metrópole', 'PROPN', 'Nome próprio'),
 ('Digital', 'PROPN', 'Nome próprio'),
 (',', 'PUNCT', 'Pontuação'),
 ('que', 'PRON', 'Pronome'),
 ('é', 'AUX', 'Auxiliar'),
 ('a', 'DET', 'Determinante'),
 ('unidade', 'NOUN', 'Substantivo'),
 ('mais', 'ADV', 'Advérbio'),
 ('bacana', 'ADJ', 'Adjetivo'),
 ('da', 'ADP', 'Adposição'),
 ('UFRN

In [7]:
# Visualize the syntactic dependencies in the text using displaCy
# - displacy.render() renders the syntactic dependencies in the text
# - doc: The Doc object containing the processed text
# - style='dep': Specifies that the visualization should show syntactic dependencies
# - jupyter=True: Specifies that the visualization should be displayed in a Jupyter notebook
# - options={'distance': 90}: Sets the distance between words in the visualization for better readability
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

# To know more about the Universal Dependencies standard, check: https://universaldependencies.org/u/dep/index.html
# Universal Dependencies is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.

## IOB2 Tagging Scheme

The **Inside-outside-beginning (IOB)** tagging scheme is a widely-used format in Named Entity Recognition (NER). The **IOB2** tagging scheme, also known as the **Begin-inside-outside (BIO)** tagging scheme, enhances the basic concept by providing clear distinctions between different parts of entities.

### Key Concepts of IOB2

- **B (Beginning)**: Indicates the beginning of an entity.
- **I (Inside)**: Marks the continuation of an entity.
- **O (Outside)**: Denotes tokens that do not belong to any entity.

This scheme helps in distinguishing between different entities and managing multi-word entities, such as full names or locations.

### Example of IOB2 Notation

Consider the sentence:

> "Meu nome é Elias Jacob e eu vivo em Natal"

With IOB2 notation, the tagged sentence would be:

```
Meu O
nome O
é O
Elias B-PER
Jacob I-PER
e O
eu O
vivo O
em O
Natal B-LOC
```

- `B-PER` indicates the beginning of a person's name ('Elias').
- `I-PER` continues the same person's name ('Jacob').
- `B-LOC` signifies the beginning of a location entity ('Natal').
- `O` labels non-entity tokens.

### Flexibility of Entity Definition in IOB2

The IOB2 scheme is flexible and can accommodate various entity types beyond the common labels like `PER` (person), `LOC` (location), `ORG` (organization), and `MISC` (miscellaneous). For instance, if your task involves identifying `CITY` and `STATE`, the tagging would adapt accordingly:

```
Meu O
nome O
é O
Elias O
Jacob O
e O
eu O
vivo O
em O
Natal B-CITY
```

Here, `B-CITY` marks 'Natal' as a city entity.

### Contemporary Approaches and Preferences

While IOB2 tagging traditionally assumes each token is a word, modern NLP often uses subword tokens such as wordpieces or byte-pair encodings (BPE). In these cases, the IOB2 scheme can be less effective due to ambiguity in classifying subword tokens.

#### Alternative Scheme

To address this, a more flexible schema might be used, consisting of a tuple with four elements: the start and end index of the entity, the entity type, and the entity value. For the sentence "Meu nome é Elias Jacob e eu vivo em Natal":

```python
(11, 22, 'person', 'Elias Jacob')
(36, 41, 'location', 'Natal')
```

This method is explicit and accommodates any tokenization type, though it is more verbose and only focuses on entities, ignoring non-entity tokens.

### Applying IOB2 Tagging in Portuguese Texts

For practical applications, such as training models with Portuguese text, we'll need to perform some data preparation. Typically, data is formatted in the CoNLL format, a tab-separated structure where each line contains a word and its named entity tag.

> Note: The example below is derived from a real Portuguese dataset. Sensitive information has been removed, and the file follows the CoNLL format.
>
> For more details, refer to the [CoNLL format documentation](https://www.aclweb.org/anthology/W03-0419.pdf).

When working with such data, ensure correct formatting to effectively train your NER models.


In [8]:
with open('data/healthcare/example_medical.conll', 'r') as f:
    text_example_medical = f.read()
print(text_example_medical)


10	O
:	O
00	O
:	O
AVALIADO	O
RAIO	B-Procedimento de diagnóstico
X	I-Procedimento de diagnóstico
,	O
SNE	B-Dispositivo médico
MAL	B-Conceito qualitativo
POSICIONADA	I-Conceito qualitativo
,	O
REPASSADA	O
SEM	B-Negação
INTERCORRÊNCIA	I-Negação
NO	I-Negação
PROCEDIMENTO	B-Atividade de saúde
.	O
AUSCULTA	B-Sinal ou sintoma
ABDOMINAL	I-Sinal ou sintoma
POSITIVA	I-Sinal ou sintoma
.	O
RETIRO	O
FIO	B-Dispositivo médico
GUIA	I-Dispositivo médico
,	O
SOLICITO	O
RAIO	O
X	O
PARA	O
AVALIAÇÃO	O
DA	O
POSIÇÃO	O
.	O
17	O
:	O
50	O
:	O
CLIENTE	B-Paciente ou Grupo Deficiente
RELATOU	O
CEFALÉIA	B-Sinal ou sintoma
NO	O
PERÍODO	B-Conceito temporal
DA	I-Conceito temporal
MANHÃ	I-Conceito temporal
E	O
FOI	O
MEDICADO	B-Procedimento terapêutico ou preventivo
.	O
CONSCIENTE	B-Achado
,	O
PUPILAS	B-Achado
ISOCÓRICAS	I-Achado
FOTORREAGENTES	I-Achado
,	O
COMUNICATIVO	B-Achado
,	O
EUPNÉICO	B-Achado
,	O
REPOUSO	B-Achado
NO	I-Achado
LEITO	I-Achado
,	O
ACEITANDO	O
BEM	O
A	O
DIETA	B-Comida
,	O
SEM	O
ACESSO	B-Dispositivo 

In [9]:
with open('data/healthcare/example_medical_raw.txt', 'r') as f:
    raw_text_example_medical = f.read()

print(raw_text_example_medical)

10:00: AVALIADO RAIO-X, SNE MAL POSICIONADA, REPASSADA SEM INTERCORRÊNCIA NO PROCEDIMENTO. AUSCULTA ABDOMINAL POSITIVA. RETIRO FIO GUIA, SOLICITO RAIO-X PARA AVALIAÇÃO DA POSIÇÃO. 17:50: CLIENTE RELATOU CEFALÉIA NO PERÍODO DA MANHÃ E FOI MEDICADO. CONSCIENTE, PUPILAS ISOCÓRICAS FOTORREAGENTES, COMUNICATIVO, EUPNÉICO, REPOUSO NO LEITO, ACEITANDO BEM A DIETA, SEM ACESSO PERIFÉRICO, APRESENTANDO EDEMA EM OMBRO E, MANTENDO TIPÓIA DE ATADURA EM MSE, ABDÔMEN PLANO E FLÁCIDO, ELIMINAÇÕES PRESENTES. 20:50 h: Paciente com cuff furado, realizado troca de TOT, instalado TOT 7.5, VM modo controlado VC 500, PEEP 07, FIO2 a 50%. 22:30h: ESQUEMA DE SUPERVISÃO NOTURNA: TRAQUEOSTOMIZADO, EM DESMAME VENTILATÓRIO EM PS DE 17, PEEP 5, FIO2 50%. PERMANECE EXTREMAMENTE AGITADO (não responde à sedação intermitente). REINICIADO  DORMONID 7 ML/H, FENTANIL 2 ML/H (continua não respondendo). PERMANECE AGITADO, ARRANCOU SOE. ACESSO VENOSO PERIFÉRICO EM MSD COM PLANO BÁSICO DE HIDRATAÇÃO + SEDAÇÃO. POLISSECRETIVO 

In [10]:
# Initialize an empty list to store the entities
# This list will hold the entities extracted from each line of the input text
list_of_entities = []

# Iterate over each line in the input text
# - text_example_medical.split('\n') splits the input text into lines based on the newline character
for line in text_example_medical.split('\n'):
    # Check if the line is not empty
    # - len(line) > 1 ensures that the line contains more than one character (i.e., it is not empty)
    if len(line) > 1:
        # Split the line into entities based on the tab character
        # - line.split('\t') splits the line into a list of entities using the tab character as the delimiter
        # - Append the list of entities to the list_of_entities
        list_of_entities.append(line.split('\t'))

# Print the list of entities
# This list contains the entities extracted from each non-empty line of the input text
list_of_entities

[['10', 'O'],
 [':', 'O'],
 ['00', 'O'],
 [':', 'O'],
 ['AVALIADO', 'O'],
 ['RAIO', 'B-Procedimento de diagnóstico'],
 ['X', 'I-Procedimento de diagnóstico'],
 [',', 'O'],
 ['SNE', 'B-Dispositivo médico'],
 ['MAL', 'B-Conceito qualitativo'],
 ['POSICIONADA', 'I-Conceito qualitativo'],
 [',', 'O'],
 ['REPASSADA', 'O'],
 ['SEM', 'B-Negação'],
 ['INTERCORRÊNCIA', 'I-Negação'],
 ['NO', 'I-Negação'],
 ['PROCEDIMENTO', 'B-Atividade de saúde'],
 ['.', 'O'],
 ['AUSCULTA', 'B-Sinal ou sintoma'],
 ['ABDOMINAL', 'I-Sinal ou sintoma'],
 ['POSITIVA', 'I-Sinal ou sintoma'],
 ['.', 'O'],
 ['RETIRO', 'O'],
 ['FIO', 'B-Dispositivo médico'],
 ['GUIA', 'I-Dispositivo médico'],
 [',', 'O'],
 ['SOLICITO', 'O'],
 ['RAIO', 'O'],
 ['X', 'O'],
 ['PARA', 'O'],
 ['AVALIAÇÃO', 'O'],
 ['DA', 'O'],
 ['POSIÇÃO', 'O'],
 ['.', 'O'],
 ['17', 'O'],
 [':', 'O'],
 ['50', 'O'],
 [':', 'O'],
 ['CLIENTE', 'B-Paciente ou Grupo Deficiente'],
 ['RELATOU', 'O'],
 ['CEFALÉIA', 'B-Sinal ou sintoma'],
 ['NO', 'O'],
 ['PERÍODO', 'B-

In [11]:
def extract_entity_spans(list_of_entities: List[Tuple[str, str]], raw_text: str) -> List[Tuple[int, int, str]]:
    """
    Extract the spans of entities in the raw text.

    Args:
        list_of_entities (List[Tuple[str, str]]): A list of tuples where each tuple contains the entity and its label.
        raw_text (str): The raw text.

    Returns:
        List[Tuple[int, int, str]]: A list of tuples where each tuple contains the start index, end index, and label of an entity.
    """
    # Initialize an empty list to store the entity spans
    entity_spans = []
    
    # Iterate over the entities
    for entity in list_of_entities:
        # If the entity label is not 'O' (i.e., the entity is not outside any named entity)
        if entity[1] != 'O':
            # Find the start index of the entity in the raw text
            start_index = raw_text.find(entity[0])
            # Calculate the end index of the entity in the raw text
            end_index = start_index + len(entity[0])
            # Append the start index, end index, and label of the entity to the list
            entity_spans.append((start_index, end_index, entity[1]))
    
    # Sort the entity spans by the start index
    # This ensures that the entities are in the order they appear in the text
    entity_spans = sorted(entity_spans, key=lambda x: x[0])
    
    # Remove duplicate entity spans
    # Using dict.fromkeys() removes duplicates while preserving order
    entity_spans = list(dict.fromkeys(entity_spans))
    
    return entity_spans

# Example usage:
# - list_of_entities contains the entities extracted from the text
# - raw_text_example_medical is the raw text from which the entities were extracted
# - The function call extracts the spans of the entities in the raw text and returns them as a list of tuples
entities_span = extract_entity_spans(list_of_entities, raw_text_example_medical)
entities_span

[(16, 20, 'B-Procedimento de diagnóstico'),
 (21, 22, 'I-Procedimento de diagnóstico'),
 (22, 23, 'I-Dispositivo médico'),
 (22, 23, 'I-Procedimento terapêutico ou preventivo'),
 (24, 27, 'B-Dispositivo médico'),
 (26, 27, 'I-Procedimento terapêutico ou preventivo'),
 (28, 31, 'B-Conceito qualitativo'),
 (32, 43, 'I-Conceito qualitativo'),
 (41, 43, 'I-Conceito temporal'),
 (55, 58, 'B-Negação'),
 (59, 73, 'I-Negação'),
 (74, 76, 'I-Negação'),
 (74, 76, 'I-Achado'),
 (77, 89, 'B-Atividade de saúde'),
 (91, 99, 'B-Sinal ou sintoma'),
 (100, 109, 'I-Sinal ou sintoma'),
 (110, 118, 'I-Sinal ou sintoma'),
 (127, 130, 'B-Dispositivo médico'),
 (131, 135, 'I-Dispositivo médico'),
 (187, 194, 'B-Paciente ou Grupo Deficiente'),
 (203, 211, 'B-Sinal ou sintoma'),
 (215, 222, 'B-Conceito temporal'),
 (226, 231, 'I-Conceito temporal'),
 (229, 230, 'I-Abreviação'),
 (234, 236, 'B-Lesão ou envenenamento'),
 (238, 246, 'B-Procedimento terapêutico ou preventivo'),
 (248, 258, 'B-Achado'),
 (260, 267,

In [12]:
from typing import List, Tuple

def merge_consecutive_entities_of_same_type(text: str, entity_spans: List[Tuple[int, int, str]]) -> List[Tuple[int, int, str]]:
    """
    Merge consecutive entities of the same type in the text.

    Args:
        text (str): The input text.
        entity_spans (List[Tuple[int, int, str]]): A list of tuples where each tuple contains the start index, end index, and label of an entity.

    Returns:
        List[Tuple[int, int, str]]: A list of tuples where each tuple contains the start index, end index, and label of a merged entity.
    """
    # Initialize an empty list to store the merged entity spans
    merged_entities = []
    
    # Iterate over the entity spans
    for i in range(len(entity_spans)):
        # Check if the current entity is the beginning of a chunk (label starts with 'B')
        if entity_spans[i][2].startswith('B'):
            # Check if the next entity is part of the same chunk and has the same type
            if (i < len(entity_spans) - 1 and 
                entity_spans[i+1][2].startswith('I') and 
                entity_spans[i+1][2][2:] == entity_spans[i][2][2:]):
                # Merge the current entity with the next one
                merged_entities.append((entity_spans[i][0], entity_spans[i+1][1], entity_spans[i][2][2:]))
            else:
                # If the entity is not part of a chunk, add it to the list as is
                merged_entities.append((entity_spans[i][0], entity_spans[i][1], entity_spans[i][2][2:]))
        elif entity_spans[i][2].startswith('I'):
            # Skip entities that are inside a chunk but not preceded by a 'B' entity
            continue
        else:
            # Add entities that are outside any chunk directly to the list
            merged_entities.append(entity_spans[i])
    
    return merged_entities

# Example usage:
# - raw_text_example_medical is the raw text from which the entities were extracted
# - entities_span is a list of tuples where each tuple contains the start index, end index, and label of an entity
# - The function call merges consecutive entities of the same type and returns them as a list of tuples
merged_entities = merge_consecutive_entities_of_same_type(raw_text_example_medical, entities_span)
merged_entities

[(16, 22, 'Procedimento de diagnóstico'),
 (24, 27, 'Dispositivo médico'),
 (28, 43, 'Conceito qualitativo'),
 (55, 73, 'Negação'),
 (77, 89, 'Atividade de saúde'),
 (91, 109, 'Sinal ou sintoma'),
 (127, 135, 'Dispositivo médico'),
 (187, 194, 'Paciente ou Grupo Deficiente'),
 (203, 211, 'Sinal ou sintoma'),
 (215, 231, 'Conceito temporal'),
 (234, 236, 'Lesão ou envenenamento'),
 (238, 246, 'Procedimento terapêutico ou preventivo'),
 (248, 258, 'Achado'),
 (260, 278, 'Achado'),
 (295, 307, 'Achado'),
 (309, 317, 'Achado'),
 (319, 335, 'Achado'),
 (353, 358, 'Comida'),
 (364, 381, 'Dispositivo médico'),
 (396, 401, 'Sinal ou sintoma|Doença ou Síndrome'),
 (405, 410, 'Localização do corpo ou região'),
 (423, 440, 'Dispositivo médico'),
 (444, 447, 'Localização do corpo ou região'),
 (449, 456, 'Localização do corpo ou região'),
 (457, 472, 'Procedimento terapêutico ou preventivo'),
 (474, 495, 'Achado'),
 (506, 514, 'Paciente ou Grupo Deficiente'),
 (519, 530, 'Dispositivo médico'),
 (5

In [14]:
from typing import List, Dict
from spacy import displacy

def visualize_entities(text: str, entities: List[Dict[str, str]], colors: Dict[str, str]) -> None:
    """
    Visualize named entities in the text using spaCy's displaCy.

    Args:
        text (str): The input text containing named entities.
        entities (List[Dict[str, str]]): A list of dictionaries where each dictionary contains the start index, end index, and label of an entity.
        colors (Dict[str, str]): A dictionary mapping entity labels to their corresponding colors.

    Returns:
        None
    """
    # Define options for displaCy visualization
    # - "ents": List of entity labels to be visualized
    # - "colors": Dictionary mapping entity labels to colors
    displacy_options = {"ents": list(colors.keys()), "colors": colors}
    
    # Prepare the data for displaCy rendering
    # - "text": The input text
    # - "ents": List of entities with their start index, end index, and label
    render_data = [{"text": text, "ents": [{"start": entity[0], "end": entity[1], "label": entity[2]} for entity in entities]}]
    
    # Render the named entities using displaCy
    # - style="ent": Specifies that the visualization should show named entities
    # - jupyter=True: Specifies that the visualization should be displayed in a Jupyter notebook
    # - options: Options for displaCy visualization
    displacy.render(render_data, style="ent", manual=True, jupyter=True, options=displacy_options)

# Define a dictionary mapping entity labels to colors
entity_colors = {
    'Abreviação': '#E74C3C', 
    'Localização do corpo ou região': '#3498DB',
    'Achado': '#9B59B6', 
    'Organização relacionada à assistência médica': '#34495E',
    'Dispositivo médico': '#2ECC71', 
    'Procedimento terapêutico ou preventivo': '#1ABC9C',
    'Parte do corpo, órgão ou componente do órgão': '#F39C12', 
    'Procedimento de diagnóstico': '#16A085',
    'Negação': '#27AE60', 
    'Sinal ou sintoma': '#2980B9',
    'Comida': '#8E44AD', 
    'Atividade de saúde': '#2C3E50',
    'Organização relacionada ao cuidado de saúde': '#D35400', 
    'Lesão ou envenenamento': '#7F8C8D',
    'Função patológica|Sinal ou sintoma': '#C0392B', 
    'Paciente ou Grupo Deficiente': '#BDC3C7',
    'Substância farmacológica': '#7D3C98', 
    'Conceito qualitativo': '#76D7C4',
    'Conceito quantitativo': '#5DADE2', 
    'Sinal ou sintoma|Doença ou Síndrome': '#F1948A',
    'Conceito espacial': '#BB8FCE', 
    'Conceito temporal': '#73C6B6'
}

# Example usage:
# - raw_text_example_medical is the raw text containing named entities
# - merged_entities is a list of dictionaries where each dictionary contains the start index, end index, and label of an entity
# - The function call visualizes the named entities in the text using displaCy
visualize_entities(raw_text_example_medical, merged_entities, entity_colors)



## Annotating your own data

Let's assume you have a dataset of 100 medical records and you want to annotate medical procedures and medications. You'll need to use a an annotation tool. Data annotation is an essential step in training our machine learning models. Today, we will guide you through the process of labeling your data using web-based graphical user interfaces (GUIs).

#### Step 1: Choosing the Right Tool for Annotation

Selecting the appropriate tool for data labeling is crucial for efficient and accurate annotation. Here are some of the recommended tools, each offering unique features:

- **[Label Studio](https://labelstud.io/)**: A versatile and feature-rich tool, ideal for a wide range of annotation tasks.
- **[Doccano](https://github.com/doccano/)**: A user-friendly tool, particularly well-suited for text annotation.
- **[Argilla](https://docs.argilla.io/en/latest/index.html)**: A powerful option with advanced features like weak supervision and active learning. This tool is highly recommended for complex annotation tasks. For a deeper understanding of Human-in-the-Loop Machine Learning, refer to [this book](https://www.manning.com/books/human-in-the-loop-machine-learning).

While we will primarily focus on Label Studio due to its extensive feature set, feel free to explore Doccano or other tools listed in [this exhaustive list](https://github.com/doccano/awesome-annotation-tools).

#### Step 2: Setup and Installation 

Setting up these annotation tools is generally straightforward. To streamline the process, a ready-to-use Label Studio instance has been prepared for you. Follow these steps to get started:

1. **Create Your Account**: Click [this link](https://labelclasse.jacob.al/user/signup/?token=36cf05fc40c431c9) to create your Label Studio account.
2. **Login**: Use [this link](https://labelclasse.jacob.al) to log in to your account.
3. **Data Conversion**: Convert your data into a format that Label Studio can process. Label Studio expects a JSON file structured as follows:

```json
[
    {
        "id": 1,
        "data": {
            "text": "Here's some text to label",
            "meta_info": {
                "some_meta": 123,
                "another_meta": 456
            }
        }
    }
]
```

#### Important Considerations

- **Tool Selection**: Choose the tool that best fits your annotation needs. For instance, if your project involves complex tasks, Argilla might be the best option due to its advanced features.
- **Data Preparation**: Ensure your data is properly formatted before importing it into the annotation tool. This step is crucial for smooth operation and accurate labeling.
- **User Interface Familiarity**: Spend some time getting familiar with the user interface of the chosen tool. This will improve your efficiency and accuracy during the annotation process.



In [15]:
df

Unnamed: 0,text,uid,text_pt,Asthma,CAD,CHF,Depression,Diabetes,Gallstones,GERD,Gout,Hypercholesterolemia,Hypertension,Hypertriglyceridemia,OA,Obesity,OSA,PVD,Venous Insufficiency,cleaned_text
0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,490646815 | WMC | 31530471 | | 9629480 | 11/23...,0,1,1,0,1,0,1,0,0,1,0,1,0,0,1,1,Data de Alta: 6/20/2006 MÉDICO EM ATENDIMENTO:...
1,368346277 | EMH | 64927307 | | 815098 | 3/29/1...,2,368346277 | EMH | 64927307 | | 815098 | 29/03/...,0,1,0,0,0,0,0,0,1,1,0,0,1,0,1,0,Data de alta: 10/12/1993 DIAGNÓSTICO PRINCIPAL...
2,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,4,908761918 | MMC | 45427009 | | 0927689 | 5/26/...,1,1,1,0,1,0,1,0,1,1,1,0,1,0,0,1,Data de Alta: 9/29/2007 MÉDICO ASSISTENTE: SAL...
3,614370301 | OH | 58149804 | | 530586 | 10/21/1...,7,614370301 | OH | 58149804 | | 530586 | 21/10/1...,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,Data de Alta: 15/10/1995 DIAGNÓSTICO PRINCIPAL...
4,279607396 | PMH | 77790323 | | 371979 | 10/13/...,10,279607396 | PMH | 77790323 | | 371979 | 10/13/...,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,Data de Alta: 06/11/1992 DIAGNÓSTICO DE ADMISS...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1232,637843490 | AEH | 54848709 | | 5419602 | 11/28...,1215,637843490 | AEH | 54848709 | | 5419602 | 11/28...,0,1,0,0,1,0,1,0,1,1,0,1,0,0,0,0,Data de Alta: 29/01/2007 * ORDENS DE ALTA FINA...
1233,482506854 | ERH | 18073415 | | 775999 | 1/27/2...,1217,482506854 | ERH | 18073415 | | 775999 | 1/27/2...,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,Data de Alta: 12/10/2002 * ORDEM DE ALTA * ROR...
1234,747959081 | BH | 31199129 | | 3498389 | 8/6/20...,1219,747959081 | BH | 31199129 | | 3498389 | 8/6/20...,0,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,Data de Alta: 21/4/2006 MÉDICO ASSISTENTE: NIC...
1235,781979567 | MCH | 91301170 | | 8750646 | 7/12/...,1222,781979567 | MCH | 91301170 | 8750646 | 7/12/20...,0,1,1,0,1,0,0,0,1,1,0,0,1,0,0,0,"* ORDENS DE ALTA * GOUD, JODY 499-78-41-6 Hunt..."


In [16]:
df.columns

Index(['text', 'uid', 'text_pt', 'Asthma', 'CAD', 'CHF', 'Depression',
       'Diabetes', 'Gallstones', 'GERD', 'Gout', 'Hypercholesterolemia',
       'Hypertension', 'Hypertriglyceridemia', 'OA', 'Obesity', 'OSA', 'PVD',
       'Venous Insufficiency', 'cleaned_text'],
      dtype='object')

In [17]:
text_col = 'cleaned_text'
meta_cols = [ # Just as an example, these are the columns that we'll use as metadata and are not part of the task itself.
    'uid',
    'Asthma',
    'CAD' # etc
]



In [18]:
# Initialize an empty list to store dictionaries
list_of_dicts = []

# Iterate over the first 100 rows of the DataFrame 'df'
for row in df.head(100).itertuples():
    # Append a dictionary to 'list_of_dicts' for each row
    list_of_dicts.append({
        'id': row.Index,  # Use the row index as the 'id'
        'data': {
            'text': getattr(row, text_col),  # Extract the text from the specified column
            'meta_info': {m: getattr(row, m) for m in meta_cols}  # Extract meta information from specified columns
        }
    })

# Return the length of the list of dictionaries
len(list_of_dicts)

100

In [19]:
list_of_dicts[0]

{'id': 0,
 'data': {'text': 'Data de Alta: 6/20/2006 MÉDICO EM ATENDIMENTO: TRUKA, DEON XAVIER M.D. SERVIÇO: BH. DIAGNÓSTICO PRINCIPAL: Anemia e Sangramento Gastrointestinal. DIAGNÓSTICOS SECUNDÁRIOS: Diabetes, substituição da válvula mitral, fibrilação atrial e doença renal crônica. HISTÓRICO DA DOENÇA ATUAL: A paciente é uma mulher de 86 anos com histórico de diabetes, doença renal crônica, insuficiência cardíaca congestiva com fração de ejeção de 45% a 50%, que se apresenta com queixa principal de fadiga e fraqueza há uma semana. Ela apresentava piora da dor na virilha e no quadril direito, pós-artroplastia total do quadril há aproximadamente 13 anos, que piorou nas últimas duas semanas, e também recentemente completou um tratamento com Levaquin para infecção do trato urinário. Ela se apresentou ao consultório do Dr. Parent queixando-se de fadiga e fraqueza há uma semana. Ela teve algumas dores abdominais em uma distribuição semelhante a uma faixa ao redor do lado direito. Foi const

In [20]:
# Import the json module to handle JSON data
import json

# Open a file in write mode to save the JSON data
with open('outputs/cn_first_100_ner.json', 'w') as file:
    # Serialize 'list_of_dicts' to a JSON formatted string and write it to the file
    json.dump(list_of_dicts, file)

## Step 4: Reading back your data

Once you've annotated and exported your data to a JSON format, you can read it back using the following code:


In [22]:
# reading back a json exported from labelstudio

# Open the JSON file exported from Label Studio in read mode
with open('outputs/cn_first_100_ner_fromlabelstudio.json', 'r') as file:
    # Load the JSON data from the file into the variable 'tmp_data'
    tmp_data = json.load(file)

# Display the loaded JSON data
tmp_data


[{'id': 743,
  'annotations': [{'id': 612,
    'completed_by': 1,
    'result': [{'value': {'start': 182,
       'end': 212,
       'text': 'substituição da válvula mitral',
       'labels': ['Procedimento']},
      'id': 'kstD2SIBL8',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'manual'},
     {'value': {'start': 576,
       'end': 605,
       'text': 'artroplastia total do quadril',
       'labels': ['Procedimento']},
      'id': '_phchysu_z',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'manual'},
     {'value': {'start': 721,
       'end': 729,
       'text': 'Levaquin',
       'labels': ['Medicamento']},
      'id': 'jSbewazLvL',
      'from_name': 'label',
      'to_name': 'text',
      'type': 'labels',
      'origin': 'manual'},
     {'value': {'start': 1099,
       'end': 1109,
       'text': 'transfusão',
       'labels': ['Procedimento']},
      'id': '-jKNCewNDD',
      'from_na

In [23]:
# Extract the first item from the loaded JSON data
item = tmp_data[0]

# Extract the text data from the 'data' field of the item
txt = item['data']['text']

# Extract the first annotation dictionary from the 'annotations' field
annotation_dict = item['annotations'][0]

# Initialize an empty list to store entity spans
entities_span = []

# Iterate over each annotation result in the annotation dictionary
for annotation in annotation_dict['result']:
    # Append a tuple containing the start, end, and label of the entity to 'entities_span'
    entities_span.append((annotation['value']['start'], annotation['value']['end'], annotation['value']['labels'][0]))

# Display the list of entity spans
entities_span

[(182, 212, 'Procedimento'),
 (576, 605, 'Procedimento'),
 (721, 729, 'Medicamento'),
 (1099, 1109, 'Procedimento'),
 (1158, 1173, 'Medicamento'),
 (1216, 1222, 'Medicamento'),
 (1263, 1270, 'Medicamento'),
 (1335, 1340, 'Medicamento'),
 (1366, 1376, 'Medicamento'),
 (1401, 1411, 'Medicamento'),
 (1431, 1440, 'Medicamento'),
 (1461, 1470, 'Medicamento'),
 (1489, 1504, 'Medicamento'),
 (1535, 1550, 'Medicamento'),
 (2014, 2051, 'Procedimento'),
 (2361, 2368, 'Medicamento'),
 (2371, 2379, 'Medicamento'),
 (2913, 2922, 'Procedimento'),
 (2769, 2787, 'Procedimento'),
 (2791, 2811, 'Procedimento'),
 (3047, 3055, 'Procedimento'),
 (3305, 3322, 'Procedimento'),
 (3445, 3448, 'Procedimento'),
 (3601, 3616, 'Procedimento'),
 (3702, 3727, 'Procedimento'),
 (3744, 3781, 'Procedimento'),
 (3969, 3978, 'Procedimento'),
 (3908, 3911, 'Procedimento'),
 (4052, 4061, 'Procedimento'),
 (4150, 4159, 'Procedimento'),
 (4246, 4265, 'Procedimento'),
 (4423, 4435, 'Procedimento'),
 (4591, 4628, 'Procedimento

In [24]:
# Import the displacy module from the spaCy library for visualizing entities
from spacy import displacy

# Define a dictionary to map entity labels to specific colors for visualization
colors = {
    'Medicamento': '#2ECC71',  # Green color for 'Medicamento' entities
    'Procedimento': '#3498DB',  # Blue color for 'Procedimento' entities
}

# Define options for displacy visualization, including the entity labels and their corresponding colors
options = {"ents": list(colors.keys()), "colors": colors}

# Prepare the data for rendering by displacy
# 'render_data' is a list of dictionaries, each containing the text and its corresponding entities
render_data = [{"text": txt, "ents": [{"start": i[0], "end": i[1], "label": i[2]} for i in entities_span]}]

# Render the entities in the text using displacy with the specified options
# 'style="ent"' specifies that we are visualizing named entities
# 'manual=True' indicates that we are providing the entities manually
# 'jupyter=True' ensures the visualization is displayed correctly in a Jupyter notebook
displacy.render(render_data, style="ent", manual=True, jupyter=True, options=options)

### Using Weak Supervision Techniques to Warm-Start Your Annotation Task

Named Entity Recognition (NER) is a task that can be both time-consuming and resource-intensive due to the significant effort required for data labeling. Any approach that can reduce the amount of data needing manual annotation will save considerable time and resources. **Weak supervision** is one such technique, offering a powerful means to conserve effort and expedite the annotation process.

#### What is Weak Supervision?

Weak supervision is a machine learning approach that leverages both human expertise and heuristic methods to train models. This technique is particularly valuable when dealing with limited labeled data and can enhance the performance of models that are already trained on some labeled data.

> **Note:** Although we won't get into the theoretical aspects of weak supervision here, those interested can explore this topic further in my comprehensive course available at [Programa de Pós-Graduação em Tecnologia da Informação/UFRN](https://posgraduacao.ufrn.br/ppgti).

#### Strategy: Using Pretrained Models for Label Generation

One strategy within the weak supervision framework involves utilizing pretrained models to generate labels for your dataset. The principle is straightforward: the more advanced the pretrained model, the higher the quality of the generated labels.

- **Pretrained Models**: These are models that have been previously trained on large datasets and can be repurposed to generate annotations for new data.
- **Generated Labels**: While these labels may not be perfect, they provide a cost-effective alternative to manual annotation, particularly in specialized fields like medicine where expert annotators are required.

#### Practical Application in NER

In our class, we will demonstrate how to use Large Language Models (LLMs) to initiate the NER annotation task. Specifically, we will employ GPT-4 to generate labels for the following entities:
- **Medicamento** (Medication)
- **Doença** (Disease)
- **Procedimento** (Procedure)

##### Steps to Follow:

1. **Model Selection**: Ensure you have the necessary clearance and budget to use a large language model. Be mindful of data protection regulations such as LGPD (Lei Geral de Proteção de Dados).
2. **Data Preparation**: Prepare your dataset for annotation.
3. **Label Generation**: Use GPT-4 to generate annotations for your dataset. These annotations will serve as a preliminary labeling, which can then be refined further if needed.

> **Important:** The generated annotations might not be flawless, but they can significantly reduce the workload compared to having a human expert manually annotate the entire dataset.

#### Example Application

To illustrate this, we will apply the annotation method to an example dataset. This practical demonstration will show how GPT-4 can identify and label entities related to 'Medicamento', 'Doença', and 'Procedimento'. 

Let's proceed with the example from earlier to see how this works in practice.


In [25]:
input_text = "10:00: AVALIADO RAIO-X, SNE MAL POSICIONADA, REPASSADA SEM INTERCORRÊNCIA NO PROCEDIMENTO. AUSCULTA ABDOMINAL POSITIVA. RETIRO FIO GUIA, SOLICITO RAIO-X PARA AVALIAÇÃO DA POSIÇÃO. 17:50: CLIENTE RELATOU CEFALÉIA NO PERÍODO DA MANHÃ E FOI MEDICADO. CONSCIENTE, PUPILAS ISOCÓRICAS FOTORREAGENTES, COMUNICATIVO, EUPNÉICO, REPOUSO NO LEITO, ACEITANDO BEM A DIETA, SEM ACESSO PERIFÉRICO, APRESENTANDO EDEMA EM OMBRO E, MANTENDO TIPÓIA DE ATADURA EM MSE, ABDÔMEN PLANO E FLÁCIDO, ELIMINAÇÕES PRESENTES. 20:50 h: Paciente com cuff furado, realizado troca de TOT, instalado TOT 7.5, VM modo controlado VC 500, PEEP 07, FIO2 a 50%. 22:30h: ESQUEMA DE SUPERVISÃO NOTURNA: TRAQUEOSTOMIZADO, EM DESMAME VENTILATÓRIO EM PS DE 17, PEEP 5, FIO2 50%. PERMANECE EXTREMAMENTE AGITADO (não responde à sedação intermitente). REINICIADO  DORMONID 7 ML/H, FENTANIL 2 ML/H (continua não respondendo). PERMANECE AGITADO, ARRANCOU SOE. ACESSO VENOSO PERIFÉRICO EM MSD COM PLANO BÁSICO DE HIDRATAÇÃO + SEDAÇÃO. POLISSECRETIVO (mantendo trach care). FO SEM SINAIS FLOGÍSTICOS EM FACE, EDEMA EM FACE LD. MONITORIZAÇÃO DE MÚLTIPLOS PARÂMETROS CONFORME ROTINA DA UTI. DIURESE EM UROPEN. 23:30 h: Retorna do CC, onde realizou relaparotomia exploradora, instalado sistema de aspiração fechado em traqueostomia (trach care) no CC, trajeto de volta do CC sem intercorrências, mantém monitorização de parâmetros, TOT nº 7,5 em rima 22, VM modo VCV 440ml, peep 6, fio2 a 40%, pupilas isocóricas fotorreagentes, SNG com débito em média quantidade de aspecto borra de café, diurese via SVD em média quantidade, sem grumos, de cor amarelo claro, PAM em radial esquerda, acesso venoso central em subclávia à direita, evacuação ausente no período."

In [26]:
from langchain_openai import ChatOpenAI
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional, Sequence
from langchain_core.prompts import ChatPromptTemplate

class Medicamento(BaseModel):
    """Identifica o nome do medicamento"""

    principio_ativo: str = Field(..., description="Nome do princípio ativo do medicamento (ex.: clonazepam)")
    nome_comercial: Optional[str] = Field(None, description="Nome comercial do medicamento (ex.: Rivotril)")
    dosagem: Optional[str] = Field(None, description="Dosagem do medicamento (ex.: 2mg)")

class Doenca(BaseModel):
    """Identifica o nome da doença"""

    nome: str = Field(..., description="Nome da doença (ex.: hipertensão arterial)")
    parte_do_corpo: Optional[str] = Field(None, description="Parte do corpo afetada pela doença (ex.: coração)")

class ProcedimentoMedico(BaseModel):
    """Identifica o nome do procedimento médico"""

    nome: str = Field(..., description="Nome do procedimento médico (ex.: cirurgia cardíaca)")
    
class TodosSaude(BaseModel):
    """Identifica as informações sobre todos os medicamentos mencionados em um texto"""
    
    doencas: Sequence[Doenca] = Field(..., description="Lista de doenças identificadas no texto")
    procedimentos_medicos: Sequence[ProcedimentoMedico] = Field(..., description="Lista de procedimentos médicos identificados no texto")
    medicamentos: Sequence[Medicamento] = Field(..., description="Lista de medicamentos identificados no texto")



In [27]:
# Import the load_dotenv function from the dotenv module to load environment variables from a .env file
from dotenv import load_dotenv

# Load environment variables from a .env file into the environment
load_dotenv()

# Initialize the ChatOpenAI model with specific parameters
# temperature: controls the randomness of the model's output
# model: specifies the model version to use
# timeout: sets the maximum time to wait for a response
# max_retries: defines the number of retry attempts in case of failure
gpt4_model = ChatOpenAI(temperature=0, model='gpt-4o-mini', timeout=10, max_retries=10)

# Define a prompt template for the chat model using ChatPromptTemplate
# The prompt consists of a system message and a human message
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Você é um algoritmo perfeito para extração de entidades nomeadas sobre medicamentos. "
            "Você deve extrair informações relevantes do texto, exatamente como está escrito no texto. "
            "Se você não souber o valor de um atributo solicitado para extrair, retorne nulo para o valor do atributo.",
        ),
        ("human", "{text}"),  # Placeholder for the human input text
    ]
)

# Create a processing chain by combining the prompt template and the chat model
# The chain will take the prompt and pass it to the model, expecting structured output in the TodosSaude format
chain = prompt | gpt4_model.with_structured_output(TodosSaude)

In [28]:
gpt_output = chain.invoke(input_text)
gpt_output

TodosSaude(doencas=[Doenca(nome='cefaleia', parte_do_corpo=None)], procedimentos_medicos=[ProcedimentoMedico(nome='raio-x'), ProcedimentoMedico(nome='troca de TOT'), ProcedimentoMedico(nome='desmame ventilatório'), ProcedimentoMedico(nome='relaparotomia exploradora')], medicamentos=[Medicamento(principio_ativo='dormonid', nome_comercial=None, dosagem='7 ml/h'), Medicamento(principio_ativo='fentanil', nome_comercial=None, dosagem='2 ml/h')])

In [29]:
gpt_output.doencas

[Doenca(nome='cefaleia', parte_do_corpo=None)]

In [30]:
gpt_output.procedimentos_medicos

[ProcedimentoMedico(nome='raio-x'),
 ProcedimentoMedico(nome='troca de TOT'),
 ProcedimentoMedico(nome='desmame ventilatório'),
 ProcedimentoMedico(nome='relaparotomia exploradora')]

In [31]:
gpt_output.medicamentos

[Medicamento(principio_ativo='dormonid', nome_comercial=None, dosagem='7 ml/h'),
 Medicamento(principio_ativo='fentanil', nome_comercial=None, dosagem='2 ml/h')]

In [32]:
type(gpt_output)

__main__.TodosSaude

In [36]:
import re
from typing import List, Tuple

def extract_entities(gpt_output: TodosSaude, text: str) -> List[Tuple[int, int, str]]:
    """
    Extract entities (medications, medical procedures, diseases) from the input text.

    Args:
        gpt_output (TodosSaude): The output from GPT containing medications, medical procedures, and diseases.
        text (str): The input text from which to extract entities.

    Returns:
        List[Tuple[int, int, str]]: A list of tuples, each containing the start index, end index, and type of an entity.
    """
    # Initialize lists to store the spans of medications, medical procedures, and diseases
    medication_spans = []
    procedure_spans = []
    disease_spans = []

    # Helper function to find and append spans
    def find_spans(entity_name: str, entity_type: str, spans_list: List[Tuple[int, int, str]]):
        if entity_name:
            for match in re.finditer(entity_name, text, re.IGNORECASE):
                spans_list.append((match.start(), match.end(), entity_type))

    # Extract medication spans
    for medication in gpt_output.medicamentos:
        find_spans(medication.principio_ativo, 'Medicamento', medication_spans)
        find_spans(medication.nome_comercial, 'Medicamento', medication_spans)

    # Extract medical procedure spans
    for procedure in gpt_output.procedimentos_medicos:
        find_spans(procedure.nome, 'Procedimento', procedure_spans)

    # Extract disease spans
    for disease in gpt_output.doencas:
        find_spans(disease.nome, 'Doença', disease_spans)

    # Combine all the spans into one list
    all_spans = medication_spans + procedure_spans + disease_spans

    # Sort the entities by start index
    sorted_spans = sorted(all_spans, key=lambda x: x[0])

    return sorted_spans

# Example usage
entities_span = extract_entities(gpt_output, input_text)
print(entities_span)

[(16, 22, 'Procedimento'), (146, 152, 'Procedimento'), (542, 554, 'Procedimento'), (683, 703, 'Procedimento'), (817, 825, 'Medicamento'), (834, 842, 'Medicamento'), (1195, 1220, 'Procedimento')]


In [37]:
# Import the displacy module from spaCy for visualizing named entities
from spacy import displacy

# Define a dictionary to map entity labels to specific colors for visualization
colors = {
    'Medicamento': '#2ECC71',  # Green color for 'Medicamento' entities
    'Procedimento': '#3498DB',  # Blue color for 'Procedimento' entities
    'Doença': '#E74C3C'        # Red color for 'Doença' entities
}

# Define options for displacy visualization, including the entity labels and their corresponding colors
options = {"ents": list(colors.keys()), "colors": colors}

# Prepare the data for rendering by displacy
# 'render_data' is a list of dictionaries, each containing the text and its corresponding entities
render_data = [{"text": input_text, "ents": [{"start": i[0], "end": i[1], "label": i[2]} for i in entities_span]}]

# Render the entities in the text using displacy with the specified options
# 'style="ent"' specifies that we are visualizing named entities
# 'manual=True' indicates that we are providing the entities manually
# 'jupyter=True' ensures the visualization is displayed correctly in a Jupyter notebook
displacy.render(render_data, style="ent", manual=True, jupyter=True, options=options)

## Leveraging Pre-trained Models for Named Entity Recognition (NER)

### What are Pre-trained Models?

In the context of Natural Language Processing (NLP), pre-trained models are powerful tools that can significantly expedite and enhance your NER tasks.  Imagine them as neural networks already trained on vast amounts of data. These models come "pre-loaded" with knowledge about language structure, grammar, and even specific entities, depending on their training data. This means you can leverage their existing knowledge without having to train a model from scratch.

Instead of training your own model, you can use these pre-trained models as a starting point. This is particularly beneficial when:

- **Limited Data:** You have a small dataset for your specific NER task.
- **Time Constraints:** Training a robust NER model from scratch can be computationally expensive and time-consuming. 

###  Choosing the Right Pre-trained Model: Factors to Consider

While pre-trained models offer convenience and efficiency, it's crucial to understand their limitations and select them carefully:

- **Domain Specificity:** Pre-trained models are often trained on specific types of data. For instance, a model trained on legal documents might not perform well on medical records. Always choose a model trained on data similar to your target domain.

    - **Example:** If you're identifying diseases in medical records, a model pre-trained on biomedical literature would be a better choice than a model trained on news articles.

- **Label Alignment:**  The labels used during the pre-training phase dictate the entities a model can recognize. 

    - **Example:** A model trained to identify people and locations in news articles won't be effective in recognizing chemical compounds in scientific papers.

- **Performance Evaluation:** Even within a relevant domain, different pre-trained models will have varying levels of performance. It's essential to evaluate and compare different models on a subset of your data to determine the best fit for your specific NER task.  

> **Key Takeaway:**  Using a pre-trained model doesn't mean you can skip the evaluation process. Always test and fine-tune the chosen model on your data to ensure optimal performance for your specific NER application. 

One example of a pretrained model use case was presented before in this notebook, as seen in our first example discussed earlier:


In [38]:
# Import the spaCy library and the displacy module for visualizing named entities
import spacy
from spacy import displacy

# Load the Portuguese language model 'pt_core_news_lg'
# Note: You can also use 'pt_core_news_sm' for a smaller model. To install it, run: python -m spacy download pt_core_news_lg
nlp = spacy.load('pt_core_news_lg')

# Define a text example containing various entities such as names, locations, and organizations
text_example = (
    "Meu nome é Elias Jacob e eu moro em Natal, Rio Grande do Norte. Eu trabalho no Instituto Metrópole Digital, "
    "que é a unidade mais bacana da UFRN. Desde 2021 eu também trabalho como Corregedor da UFRN. Quando a pandemia "
    "começou, no início de 2020, eu estava com as malas prontas para uma viagem de férias para o Japão. Eu até fui "
    "buscar meu visto no Consulado em Recife, mas, quando chegou mais perto da viagem, meus voos foram todos cancelados "
    "pela United Airlines e eu não viajei. No dia 11 de novembro de 2023, eu fui no show do Roger Waters em São Paulo. "
    "Felizmente, eu conseguir dar uma passada no Kidoairaku, meu restaurante japonês favorito lá."
)

# Process the text example using the loaded spaCy model to create a Doc object
doc = nlp(text_example)

# Render the named entities in the text using displacy with the 'ent' style
# 'jupyter=True' ensures the visualization is displayed correctly in a Jupyter notebook
displacy.render(doc, style='ent', jupyter=True)


In fact, when we load a Spacy model, we're essentially leveraging a pretrained model previously trained by Spacy developers to recognize specific entities such as: LOC, MISC, ORG, PER. 


## Harnessing the Power of HuggingFace Transformers for NLP Tasks

The [HuggingFace Transformers](https://huggingface.co/transformers/) library stands as one of the most comprehensive libraries crafted specifically to support Natural Language Processing (NLP) tasks. This powerful tool provides access to a vast collection of pre-trained models that have been fine-tuned on an assortment of NLP tasks. These tasks range from text classification, question-answering, and named entity recognition to text generation, and much more.

What makes HuggingFace Transformers truly remarkable is its compatibility with popular deep learning frameworks like PyTorch and TensorFlow. This implies that with just a handful of lines of code, we can harness these state-of-the-art pre-trained models and execute complex NLP tasks such as Named Entity Recognition (NER) on our datasets.

Here's a step-by-step guide to illustrate how we can utilize HuggingFace Transformers:

### Step 1: Accessing HuggingFace's Model Repository

Our journey begins by visiting the [HuggingFace model repository](https://huggingface.co/models), which serves as a treasure trove of various pre-trained models ready for use.

### Step 2: Filtering Models Based on Task and Language

Once we're at the repository, we will be presented with various options to filter the available models. For our purpose of performing Token Classification (NER is a form of token classification) in Portuguese, we need to set the task filter to 'Token Classification' and the language filter to 'Portuguese'.

### Step 3: Selecting the Right Model 

Having applied the filters, we now get to choose the specific model suitable for our data. Here are some model suggestions based on their specializations:

* [Clinical NER PT - Diagnostic](https://huggingface.co/pucpr/clinicalnerpt-diagnostic)
* [Clinical NER PT - Disorder](https://huggingface.co/pucpr/clinicalnerpt-disorder)
* [Clinical NER PT - Disease](https://huggingface.co/pucpr/clinicalnerpt-disease)
* [Clinical NER PT - Chemical](https://huggingface.co/pucpr/clinicalnerpt-chemical)
* [Clinical NER PT - Laboratory](https://huggingface.co/pucpr/clinicalnerpt-laboratory)
* [Clinical NER PT - Finding](https://huggingface.co/pucpr/clinicalnerpt-finding)
* [Clinical NER PT - Medical](https://huggingface.co/pucpr/clinicalnerpt-medical)
* [Clinical NER PT - Procedure](https://huggingface.co/pucpr/clinicalnerpt-procedure)
* [Clinical NER PT - Healthcare](https://huggingface.co/pucpr/clinicalnerpt-healthcare)
* [TempClin BioBertPT - All](https://huggingface.co/pucpr-br/tempclin-biobertpt-all)

These links lead towards the detailed overview of each model, including their usage scripts, which you can directly apply to your data.

This comprehensive guide hopefully clarifies not just the importance but also the process of implementing HuggingFace Transformers for NLP tasks such as NER.

In [40]:
with open('data/healthcare/example_medical_raw.txt', 'r') as f:
    txt_example = f.read()

print(txt_example)

10:00: AVALIADO RAIO-X, SNE MAL POSICIONADA, REPASSADA SEM INTERCORRÊNCIA NO PROCEDIMENTO. AUSCULTA ABDOMINAL POSITIVA. RETIRO FIO GUIA, SOLICITO RAIO-X PARA AVALIAÇÃO DA POSIÇÃO. 17:50: CLIENTE RELATOU CEFALÉIA NO PERÍODO DA MANHÃ E FOI MEDICADO. CONSCIENTE, PUPILAS ISOCÓRICAS FOTORREAGENTES, COMUNICATIVO, EUPNÉICO, REPOUSO NO LEITO, ACEITANDO BEM A DIETA, SEM ACESSO PERIFÉRICO, APRESENTANDO EDEMA EM OMBRO E, MANTENDO TIPÓIA DE ATADURA EM MSE, ABDÔMEN PLANO E FLÁCIDO, ELIMINAÇÕES PRESENTES. 20:50 h: Paciente com cuff furado, realizado troca de TOT, instalado TOT 7.5, VM modo controlado VC 500, PEEP 07, FIO2 a 50%. 22:30h: ESQUEMA DE SUPERVISÃO NOTURNA: TRAQUEOSTOMIZADO, EM DESMAME VENTILATÓRIO EM PS DE 17, PEEP 5, FIO2 50%. PERMANECE EXTREMAMENTE AGITADO (não responde à sedação intermitente). REINICIADO  DORMONID 7 ML/H, FENTANIL 2 ML/H (continua não respondendo). PERMANECE AGITADO, ARRANCOU SOE. ACESSO VENOSO PERIFÉRICO EM MSD COM PLANO BÁSICO DE HIDRATAÇÃO + SEDAÇÃO. POLISSECRETIVO 

In [41]:
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Define the model name for the clinical named entity recognition (NER) task
model_name = 'pucpr/clinicalnerpt-chemical'

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained model for token classification
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create a pipeline for named entity recognition (NER) using the loaded model and tokenizer
# 'aggregation_strategy' is set to 'first' to aggregate tokens into entities based on the first token
# 'device=-1' indicates that the pipeline should run on the CPU (use 'device=0' for GPU)
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy='first', device=-1)



In [42]:
entities = nlp(txt_example)
entities

[{'entity_group': 'ChemicalDrugs',
  'score': 0.93268013,
  'word': 'dormonid',
  'start': 817,
  'end': 825},
 {'entity_group': 'ChemicalDrugs',
  'score': 0.9052683,
  'word': 'fentanil',
  'start': 834,
  'end': 842}]

In [43]:
import random
from typing import List

def get_random_colors(num_colors: int) -> List[str]:
    """
    Returns a list of random colors from a predefined set of color codes.

    Args:
        num_colors (int): The number of random colors to return. Must be less than or equal to the number of available colors.

    Returns:
        List[str]: A list of random color codes in hexadecimal format.

    Raises:
        AssertionError: If num_colors is greater than the number of available colors.

    Example:
        >>> get_random_colors(5)
        ['#FF0000', '#00FF00', '#0000FF', '#FFFF00', '#00FFFF']
    """
    # Predefined list of color codes
    color_codes = [
        '#FF0000',  # Red
        '#00FF00',  # Lime
        '#0000FF',  # Blue
        '#FFFF00',  # Yellow
        '#00FFFF',  # Aqua
        '#FF00FF',  # Fuchsia
        '#C0C0C0',  # Silver
        '#808080',  # Gray
        '#800000',  # Maroon
        '#808000',  # Olive
        '#008000',  # Green
        '#800080',  # Purple
        '#008080',  # Teal
        '#000080',  # Navy
        '#FFA500',  # Orange
        '#A52A2A',  # Brown
        '#8B4513',  # SaddleBrown
        '#5F9EA0',  # CadetBlue
        '#7FFF00',  # Chartreuse
        '#D2691E',  # Chocolate
        '#FF7F50',  # Coral
        '#6495ED',  # CornflowerBlue
        '#DC143C',  # Crimson
        '#00FFFF',  # Cyan
        '#00008B',  # DarkBlue
        '#008B8B',  # DarkCyan
        '#B8860B',  # DarkGoldenRod
        '#A9A9A9'   # DarkGray
    ]

    # Ensure the requested number of colors does not exceed the available colors
    assert num_colors <= len(color_codes), f'You can only get up to {len(color_codes)} colors'

    # Return a random sample of the requested number of colors
    return random.sample(color_codes, num_colors)

# Example usage
get_random_colors(5)

['#008000', '#800000', '#FFFF00', '#A52A2A', '#00FFFF']

In [44]:
from typing import List, Tuple, Dict, Optional
from spacy import displacy

def render_named_entities(text: str, entity_spans: List[Tuple[int, int, str]], colors: Optional[Dict[str, str]] = None) -> None:
    """
    Render named entities in the text using displaCy.

    Args:
        text (str): The text to render.
        entity_spans (List[Tuple[int, int, str]]): A list of tuples, each containing the start index, end index, and type of an entity.
        colors (Optional[Dict[str, str]]): A dictionary mapping entity types to colors. If None, random colors will be used.

    Returns:
        None

    Example:
        >>> text = "John Doe works at OpenAI."
        >>> entity_spans = [(0, 8, 'PERSON'), (18, 24, 'ORG')]
        >>> render_named_entities(text, entity_spans)
    """
    # If no colors are provided, generate random colors for each entity type
    if colors is None:
        entity_types = list(set([span[2] for span in entity_spans]))
        random_colors = get_random_colors(len(entity_types))
        colors = {entity_type: random_colors[i] for i, entity_type in enumerate(entity_types)}

    # Define the options for displaCy visualization
    displacy_options = {"ents": list(colors.keys()), "colors": colors}

    # Prepare the data for displaCy rendering
    displacy_data = [{"text": text, "ents": [{"start": span[0], "end": span[1], "label": span[2]} for span in entity_spans]}]

    # Render the named entities in the text using displaCy
    displacy.render(displacy_data, style="ent", manual=True, jupyter=True, options=displacy_options)

In [45]:
from typing import List, Tuple, Dict, Optional
from transformers import Pipeline

def perform_inference_and_render_entities(huggingface_pipeline: Pipeline, text: str, colors: Optional[Dict[str, str]] = None) -> None:
    """
    Perform inference using a Hugging Face pipeline and render the entities in the text.

    Args:
        huggingface_pipeline (Pipeline): The Hugging Face pipeline to use for inference.
        text (str): The text to perform inference on and render.
        colors (Optional[Dict[str, str]]): A dictionary mapping entity types to colors. If None, random colors will be used.

    Returns:
        None
    """
    # Perform inference using the Hugging Face pipeline
    extracted_entities = huggingface_pipeline(text)
    
    # Prepare a list of entity spans
    entity_spans = []
    for entity in extracted_entities:
        entity_spans.append((entity['start'], entity['end'], entity['entity_group']))
    
    # Render the entities in the text
    render_named_entities(text, entity_spans, colors)

In [46]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Initialize an empty dictionary to store the loaded models
models_dict = {}

# List of model names to be loaded
models_name_list = [
    'pucpr/clinicalnerpt-diagnostic',
    'pucpr/clinicalnerpt-disorder',
    'pucpr/clinicalnerpt-disease',
    'pucpr/clinicalnerpt-chemical',
    'pucpr/clinicalnerpt-laboratory',
    'pucpr/clinicalnerpt-finding',
    'pucpr/clinicalnerpt-medical',
    'pucpr/clinicalnerpt-procedure',
    'pucpr/clinicalnerpt-healthcare',
    'pucpr-br/tempclin-biobertpt-all',
]

# Loop through each model name in the list to load and initialize the models
for model_name in models_name_list:
    print(f'Loading model {model_name}')  # Print the name of the model being loaded
    
    # Load the tokenizer for the current model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Load the pre-trained model for token classification
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    
    # Create a named entity recognition (NER) pipeline using the loaded model and tokenizer
    # 'aggregation_strategy' is set to 'first' to aggregate tokens into entities based on the first token
    # 'device=-1' indicates that the pipeline should run on the CPU (use 'device=0' for GPU)
    nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy='first', device=-1)
    
    # Store the NER pipeline in the dictionary with the model name as the key
    models_dict[model_name] = nlp

Loading model pucpr/clinicalnerpt-diagnostic


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-disorder
Loading model pucpr/clinicalnerpt-disease


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-chemical
Loading model pucpr/clinicalnerpt-laboratory


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-finding


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-medical


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-procedure


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr/clinicalnerpt-healthcare


tokenizer_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

Loading model pucpr-br/tempclin-biobertpt-all


tokenizer_config.json:   0%|          | 0.00/621 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/709M [00:00<?, ?B/s]

In [47]:
# Iterate over each model in the models_dict dictionary
for model_name, ner_pipeline in models_dict.items():
    # Print the name of the model being used for inference
    print(f'Inference with model {model_name}')
    
    # Perform inference using the NER pipeline and render the entities in the example text
    perform_inference_and_render_entities(ner_pipeline, txt_example)
    
    # Print four newline characters to separate the output of different models
    print('\n' * 4)

Inference with model pucpr/clinicalnerpt-diagnostic







Inference with model pucpr/clinicalnerpt-disorder







Inference with model pucpr/clinicalnerpt-disease







Inference with model pucpr/clinicalnerpt-chemical







Inference with model pucpr/clinicalnerpt-laboratory







Inference with model pucpr/clinicalnerpt-finding







Inference with model pucpr/clinicalnerpt-medical







Inference with model pucpr/clinicalnerpt-procedure







Inference with model pucpr/clinicalnerpt-healthcare







Inference with model pucpr-br/tempclin-biobertpt-all









In [48]:
# Extract the cleaned text from the 6th row of the DataFrame 'df'
txt_example_2 = df.cleaned_text.iloc[5]

# Iterate over each model in the models_dict dictionary
for model_name, ner_pipeline in models_dict.items():
    # Print the name of the model being used for inference
    print(f'Inference with model {model_name}')
    
    # Perform inference using the NER pipeline and render the entities in the extracted text
    perform_inference_and_render_entities(ner_pipeline, txt_example_2)
    
    # Print four newline characters to separate the output of different models
    print('\n' * 4)

Inference with model pucpr/clinicalnerpt-diagnostic







Inference with model pucpr/clinicalnerpt-disorder







Inference with model pucpr/clinicalnerpt-disease







Inference with model pucpr/clinicalnerpt-chemical







Inference with model pucpr/clinicalnerpt-laboratory







Inference with model pucpr/clinicalnerpt-finding







Inference with model pucpr/clinicalnerpt-medical







Inference with model pucpr/clinicalnerpt-procedure







Inference with model pucpr/clinicalnerpt-healthcare







Inference with model pucpr-br/tempclin-biobertpt-all









## Dealing with Large Texts using HuggingFace Transformers

HuggingFace Transformers is a powerful library that simplifies the use of Transformer models. However, a common limitation in these models is their restriction on input text length, typically capped at 512 tokens. This limitation stems from the computational demands of the self-attention mechanism inherent to Transformer architectures.

### Understanding the Bottleneck: Self-Attention and Quadratic Complexity

The self-attention mechanism, crucial for Transformers to capture relationships between words in a sequence, involves calculating attention scores between every pair of tokens. This process results in quadratic complexity: as the number of tokens (text length) increases, the number of connections and computations required grows quadratically. This quadratic scaling leads to significant memory and computational demands, making it impractical for standard Transformers to handle very long sequences. 

<center>
<img src="images/transformer_quadratic.webp" alt="" style="width: 50%; height: 50%"/>
</center>

For a deeper dive into this constraint, refer to this [informative article](http://jalammar.github.io/illustrated-transformer/) or explore [this interactive visualization](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb). While research is exploring less computationally intensive attention mechanisms, this discussion will focus on practical strategies for managing the existing constraint.

### Text Segmentation: Addressing the 512 Token Limit

When working with Transformer-based models, ensure your text doesn't exceed the 512 token limit. But what defines a token in this context?

> **What is a Token?**
>
> In essence, a token is a meaningful unit of text. While we often perceive words as individual tokens, Transformer models utilize **subword tokenization**. This approach breaks down words into smaller units, which may include parts of words or byte-pair encodings (BPE). Subword tokenization allows the model to represent any word, even those absent from its vocabulary, by combining these smaller units.

This approach offers several advantages:

- **Handling Out-of-Vocabulary Words:** By breaking down words into smaller units, the model can represent unseen words as combinations of known subword tokens.
- **Reduced Vocabulary Size:** Representing words through subword units leads to a smaller overall vocabulary size, which can improve efficiency. 

Consider the example below for a clearer understanding:

In [49]:
from transformers import AutoTokenizer

# Load the pre-trained tokenizer for the BERT model in Portuguese
# The model "neuralmind/bert-base-portuguese-cased" is a BERT model trained on Portuguese text
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

# Define a sample text in Portuguese
text_1 = 'Eu gosto de farofa.'

# Tokenize the sample text using the loaded tokenizer
# The tokenizer converts the text into a format suitable for input to the BERT model
text_1_tokenized = tokenizer(text_1)

# Display the tokenized output
# The output includes token IDs, attention masks, and other information required by the model
text_1_tokenized

{'input_ids': [101, 3396, 10303, 125, 5546, 6288, 22278, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Above, we can see the sentence `"Eu gosto de farofa"` has 9 tokens in total (`[101, 3396, 10303, 125, 5546, 6288, 22278, 119, 102]`). Let's check what they mean

In [50]:
# Iterate over each token ID in the 'input_ids' list from the tokenized text
for token_id in text_1_tokenized['input_ids']:
    # Print the token ID and its corresponding decoded token (i.e., the original text representation)
    # The 'decode' method converts the token ID back to its string representation
    print(token_id, '\t\t', f'"{tokenizer.decode(token_id)}"')

101 		 "[CLS]"
3396 		 "Eu"
10303 		 "gosto"
125 		 "de"
5546 		 "far"
6288 		 "##of"
22278 		 "##a"
119 		 "."
102 		 "[SEP]"


### Key Special Tokens

In Transformer-based models, special tokens such as `[CLS]` and `[SEP]` play essential roles. These tokens are added by the tokenizer at specific locations in the text to provide the model with crucial context about the text's structure.

- **[CLS]**: Known as the Classification token, `[CLS]` is automatically inserted at the beginning of the text during the tokenization process. The model uses this token to understand the start of a piece of text and often utilizes the representation of this token for classification tasks.

- **[SEP]**: The Separator token `[SEP]` is placed at the end of the text or between two texts when comparing texts. This token helps the model discern where one piece of information ends and another begins, aiding in tasks such as sentence pair classification and question-answering.

To further clarify, consider the following analogy:

- **[CLS] Token**: Think of it as the title of a book. It gives you an idea of what the content is about right from the beginning.
- **[SEP] Token**: Imagine it as a chapter divider in a book. It clearly indicates where one chapter (or piece of information) ends and another begins, helping you navigate through the content.


### Tokenization Process

When processing input data, Transformer models use a tokenization process to break down words into manageable pieces called tokens. This process is needed for handling the diverse vocabulary and structure of natural language. Here are some examples to illustrate how tokenization works:

- **"Eu"**: This word is present in the model's vocabulary and is represented as a single token.
- **"gosto"**: Also present in the model's vocabulary and converted into one token.
- **"de"**: Another word from the model's vocabulary represented as a single token.
- **"farofa"**: A word not present in the model's vocabulary. Hence, it gets broken down into three subword tokens: `"far"`, `"##of"`, and `"##a"`.

This tokenization strategy ensures that every kind of word, even unknown ones, can be processed by the model, enabling it to handle a wide variety of linguistic inputs.

### Importance of Tokenizer-Specific Conventions

It's important to note that special tokens are a convention dependent on the tokenizer used to train the model. For instance:

- **BERT-based models**: Most will have the `[CLS]` and `[SEP]` tokens.
- **GPT-2 models**: These models do not use `[CLS]` and `[SEP]` tokens.

Moreover, different tokenizers from the same architecture may have different tokens as well. Understanding the specific conventions of the tokenizer you are working with is crucial for effectively utilizing Transformer models.


In [51]:
# Import the AutoTokenizer class from the transformers library
from transformers import AutoTokenizer

# Load the pre-trained tokenizer for the multilingual BERT model
# The model "bert-base-multilingual-cased" is a BERT model trained on multiple languages
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

# Define a sample text in Portuguese
text_1 = 'Eu gosto de farofa.'

# Tokenize the sample text using the loaded tokenizer
# The tokenizer converts the text into a format suitable for input to the BERT model
text_1_tokenized = tokenizer(text_1)

# Display the tokenized output
# The output includes token IDs, attention masks, and other information required by the model
text_1_tokenized

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

{'input_ids': [101, 41859, 11783, 15340, 10104, 13301, 89549, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [52]:
# Iterate over each token ID in the 'input_ids' list from the tokenized text
for token_id in text_1_tokenized['input_ids']:
    # Print the token ID and its corresponding decoded token (i.e., the original text representation)
    # The 'decode' method converts the token ID back to its string representation
    print(token_id, '\t\t', f'"{tokenizer.decode(token_id)}"')

101 		 "[CLS]"
41859 		 "Eu"
11783 		 "go"
15340 		 "##sto"
10104 		 "de"
13301 		 "far"
89549 		 "##ofa"
119 		 "."
102 		 "[SEP]"


In [53]:
# Import the AutoTokenizer class from the transformers library
from transformers import AutoTokenizer

# Load the pre-trained tokenizer for the GPT-2 model
# The model "gpt2" is a generative pre-trained transformer model
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define a sample text in Portuguese
text_1 = 'Eu gosto de farofa.'

# Tokenize the sample text using the loaded tokenizer
# The tokenizer converts the text into a format suitable for input to the GPT-2 model
text_1_tokenized = tokenizer(text_1)

# Display the tokenized output
# The output includes token IDs, attention masks, and other information required by the model
text_1_tokenized

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'input_ids': [36, 84, 308, 455, 78, 390, 1290, 1659, 64, 13], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [54]:
# Iterate over each token ID in the 'input_ids' list from the tokenized text
for token in text_1_tokenized['input_ids']:
    # Print the token ID and its corresponding decoded token (i.e., the original text representation)
    # The 'decode' method converts the token ID back to its string representation
    print(token, '\t\t', f'"{tokenizer.decode(token)}"')

36 		 "E"
84 		 "u"
308 		 " g"
455 		 "ost"
78 		 "o"
390 		 " de"
1290 		 " far"
1659 		 "of"
64 		 "a"
13 		 "."


### Strategies for Handling Large Texts

Transformer models, while highly effective, have limitations when it comes to handling large texts. Here are several strategies to manage these limitations:

#### 1. **Text Pruning**
Text pruning involves trimming the text to fit within the transformer's token limit. While straightforward, this technique has significant drawbacks:
- **Risk of Information Loss**: Important information might be discarded, potentially affecting the model's performance on tasks requiring comprehensive understanding.
- **Context Disruption**: Removing parts of the text can disrupt the context, making it harder for the model to grasp the overall meaning.

#### 2. **Chunking**
Chunking is a more balanced approach, especially suitable for lengthy documents. Here’s how it works:
- **Segment Division**: The text is divided into smaller chunks or segments, each containing fewer than the maximum number of tokens the transformer can handle.
- **Maintaining Context**: By carefully choosing chunk boundaries, you can preserve essential context within each segment.
- **Example**: Think of chunking as dividing a book into chapters. Each chapter is self-contained but contributes to the overall story.

#### 3. **Sliding Window Approach**
The sliding window approach enhances chunking by maintaining context across segments:
- **Overlapping Segments**: Instead of simply partitioning the text, this method uses overlapping windows, where part of the previous segment’s content is included in the current segment.
- **Context Preservation**: This overlap ensures that the context flows smoothly from one segment to the next, much like how a series of overlapping photographs can create a panoramic view.
- **Example**: Imagine reading a story where each page shares a few sentences with the previous page. This overlap helps you remember the context as you move forward.

#### 4. **Using Transformers with Larger Input Sizes**
Some transformer models are designed to handle larger inputs. These include:
- **Longformer**: Developed to handle longer sequences by combining local and global attention mechanisms.
- **BigBird**: Uses sparse attention mechanisms to manage longer texts efficiently.
- **Current Limitations**: While these models are promising, they are still being refined and may not yet match the maturity and accuracy of established transformer models.
- **Technical Trade-offs**: These models use different techniques to approximate full self-attention mechanisms, balancing memory efficiency with potential accuracy trade-offs.

### Focus on Chunking
For simplicity and practicality, we will focus on the chunking strategy. Chunking is effective for managing large texts without significantly compromising context or important information.

> **Note**: When implementing chunking, consider the natural breaks in the text, such as paragraphs or sentences, to ensure each chunk is as meaningful and self-contained as possible.


In [55]:
# Lets check how many tokens each clinical note has considering a given tokenizer

# Define the model checkpoint
model_checkpoint = 'pucpr/clinicalnerpt-disorder'

# Load the tokenizer and the model from the checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

# Create a named entity recognition pipeline using the model and tokenizer
ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy='first', device=-1)

# Tokenize the cleaned text from the dataframe
tokenized_texts = tokenizer(df['cleaned_text'].values.tolist(), truncation=False)

# Calculate the length of each tokenized text
tokenized_text_lengths = [len(tokenized_text) for tokenized_text in tokenized_texts['input_ids']]

# Create a pandas Series from the lengths and describe it
token_lengths_series = pd.Series(tokenized_text_lengths)
token_lengths_description = token_lengths_series.describe()

# Print the description
print(token_lengths_description)


Token indices sequence length is longer than the specified maximum sequence length for this model (3283 > 512). Running this sequence through the model will result in indexing errors


count    1237.000000
mean     2091.936944
std       891.235993
min        23.000000
25%      1470.000000
50%      1968.000000
75%      2606.000000
max      5475.000000
dtype: float64


In [58]:
import plotly.express as px
import pandas as pd
from typing import List

def plot_token_distribution(token_lengths: List[int], with_gpt4=False) -> None:
    """
    Plot a histogram of token lengths with vertical lines indicating important thresholds.

    Args:
        token_lengths (List[int]): The lengths of the tokenized texts.
        with_gpt4 (bool): Whether to include the GPT-4 maximum input size in the plot.

    Returns:
        None
    """
    # Create a histogram of the token lengths
    histogram_figure = px.histogram(pd.Series(token_lengths), nbins=100)

    # Add vertical lines at 512, 7680, and 8192 tokens
    histogram_figure.add_vline(x=512, line_width=3, line_dash="dash", line_color="red")
    histogram_figure.add_vline(x=7680, line_width=3, line_dash="dash", line_color="green")
    histogram_figure.add_vline(x=8192, line_width=3, line_dash="dash", line_color="blue")

    # Add annotations for the vertical lines
    histogram_figure.add_annotation(x=512, y=100, text="512 tokens", showarrow=True, arrowhead=1)
    histogram_figure.add_annotation(x=7680, y=100, text="I have trained a model that handles sentences up to here", showarrow=True, arrowhead=1)

    if with_gpt4:
        # Add a vertical line at 128000 tokens (GPT-4 max input size)
        histogram_figure.add_vline(x=128000, line_width=3, line_dash="dash", line_color="purple")
        histogram_figure.add_annotation(x=128000, y=100, text="128000 tokens (GPT-4 max input size)", showarrow=True, arrowhead=1)
    # Hide the legend
    histogram_figure.update_layout(showlegend=False)

    # Display the plot
    histogram_figure.show()

# Call the function with the token lengths
plot_token_distribution(token_lengths_description)

In [59]:
plot_token_distribution(token_lengths_description, with_gpt4=True)

In [61]:
# Find the index of the second longest tokenized text
# 'tokenized_text_lengths' is a list of lengths of tokenized texts
# 'sorted(tokenized_text_lengths)[-2]' gets the second largest length
# 'index()' finds the index of this length in the original list
id_second_biggest_text = tokenized_text_lengths.index(sorted(tokenized_text_lengths)[-2])

# Retrieve the cleaned text corresponding to the second longest tokenized text
# 'df.iloc[id_second_biggest_text]' gets the row at the specified index
# '.cleaned_text' accesses the 'cleaned_text' column of that row
huge_text = df.iloc[id_second_biggest_text].cleaned_text

# Display the retrieved text
print(huge_text)

Data de Desligamento: MÉDICO RESPONSÁVEL: JOSILOWSKY, ODELL MD MÉDICO DE CUIDADOS PRIMÁRIOS: Dr. Freddie Fluegge, 216-754-4187 PRINCIPAL QUEIXA: Sepse. HISTÓRICO DA DOENÇA ATUAL: A Sra. Wiest é uma mulher de 83 anos com vários problemas médicos, incluindo doença arterial coronariana pós-CABG, diabetes mellitus, ICC (FE 55%) e fibrilação atrial, que apresentou hipotensão, provavelmente secundária a sepse de celulite na extremidade inferior direita. Ela desenvolveu celulite na extremidade inferior direita há um mês. Dois dias antes da admissão, a paciente desenvolveu piora de eritema, edema e sensibilidade junto com calafrios e suores. Ela foi levada para a sala de emergência em 07/08/2006 e apresentava hipotensão com pressão arterial sistólica na faixa dos 80. Ela recebeu líquidos intravenosos e iniciou vancomicina, aztreonam e clindamicina, mas permaneceu hipotensa. Ela começou a receber dopamina, Levophed e vasopressina, bem como Decadron. A cirurgia foi consultada para realizar uma b

In [62]:
len(huge_text), len(huge_text.split())

(19077, 2931)

In [63]:
len(tokenizer(huge_text, truncation=True)['input_ids'])

512

In [64]:
len(tokenizer(huge_text, truncation=False)['input_ids'])

5387

In [65]:
# As you can see below, everything after the 512th token is ignored

perform_inference_and_render_entities(nlp, huge_text)

### Simple chunking

Simple Chunking is a strategy that involves dividing or 'chunking' the text into numerous segments, each containing no more than 512 tokens. This allows larger bodies of text to be processed by the model incrementally, circumventing the token cap.
An inherent problem with the Simple Chunking approach lies in its lack of context preservation between chunks. As the text is divided into separate blocks of 512 tokens, information or meaning that spans across these segments may get lost in the process.

This absence of contextual continuity might affect the model's understanding and performance, especially when working with long texts where the meaningful connection between sentences or paragraphs is critical.

Nevertheless, Simple Chunking serves as a practical starting point for handling large volumes of text data with HuggingFace Transformers. It lays the groundwork for other, more sophisticated techniques which seek to maintain context between chunks, thus resulting in enhanced performance when dealing with extensive text data.

Let's see how to implement it

In [66]:
from transformers import PreTrainedTokenizer
from typing import List

def simple_chunk_text_into_segments(text: str, chunk_size: int, tokenizer: PreTrainedTokenizer) -> List[str]:
    """
    Chunk a large text into smaller segments using a tokenizer.

    Args:
        text (str): The text to be chunked.
        chunk_size (int): The size of each chunk.
        tokenizer (PreTrainedTokenizer): The tokenizer to use for chunking.

    Returns:
        List[str]: A list of text chunks.
    """
    # Tokenize the text without truncation, special tokens, and with offset mapping
    tokens = tokenizer(text, truncation=False, return_offsets_mapping=True, add_special_tokens=False)['input_ids'][1:-1]

    # Initialize an empty list to store the chunks
    text_chunks = []

    # Iterate over the tokens in steps of chunk_size
    for i in range(0, len(tokens), chunk_size):
        # Decode the current chunk of tokens into text
        decoded_chunk = tokenizer.decode(tokens[i:i+chunk_size])
        # Append the decoded chunk to the list of chunks
        text_chunks.append(decoded_chunk)

    return text_chunks

# Call the function with the huge text, chunk size, and tokenizer
text_chunks = simple_chunk_text_into_segments(huge_text, 512, tokenizer)

In [67]:
len(text_chunks)

11

In [69]:
print(text_chunks[0])

de desligamento : medico responsavel : josilowsky, odell md medico de cuidados primarios : dr. freddie fluegge, 216 - 754 - 4187 principal queixa : sepse. historico da doenca atual : a sra. wiest e uma mulher de 83 anos com varios problemas medicos, incluindo doenca arterial coronariana pos - cabg, diabetes mellitus, icc ( fe 55 % ) e fibrilacao atrial, que apresentou hipotensao, provavelmente secundaria a sepse de celulite na extremidade inferior direita. ela desenvolveu celulite na extremidade inferior direita ha um mes. dois dias antes da admissao, a paciente desenvolveu piora de eritema, edema e sensibilidade junto com calafrios e suores. ela foi levada para a sala de emergencia em 07 / 08 / 2006 e apresentava hipotensao com pressao arterial sistolica na faixa dos 80. ela recebeu liquidos intravenosos e iniciou vancomicina, aztreonam e clindamicina, mas permaneceu hipotensa. ela comecou a receber dopamina, levophed e vasopressina, bem como decadron. a cirurgia foi consultada para r

In [70]:
# Iterate over each chunk of text in the 'text_chunks' list
for idx, chunk in enumerate(text_chunks):
    # Print the index of the current chunk
    print(f'Chunk {idx}')
    
    # Perform named entity recognition (NER) on the current chunk and render the entities
    # 'nlp' is the NER pipeline
    # 'chunk' is the current text chunk being processed
    # 'colors' specifies the color to use for the 'Disorder' entity type
    perform_inference_and_render_entities(nlp, chunk, colors={'Disorder': '#FFA500'})
    
    # Print two newline characters to separate the output of different chunks
    print('\n\n')

# Note: The comment below highlights a potential issue where entities might be split between chunks.
# This can happen if an entity spans the boundary between two chunks, leading to incomplete entity recognition.
# See the split in the middle of the entity between chunks 5 and 6 or 8 and 9.

Chunk 0





Chunk 1





Chunk 2





Chunk 3





Chunk 4





Chunk 5





Chunk 6





Chunk 7





Chunk 8





Chunk 9





Chunk 10







### Advanced Chunking: Enhancing Text Segmentation with Semantic Awareness

Advanced Chunking represents a significant evolution from Simple Chunking, offering a more sophisticated approach to text segmentation. This method goes beyond mere token count-based division, incorporating semantic understanding to create more meaningful and context-preserving chunks.

#### Key Principles of Advanced Chunking

1. **Semantic-Aware Segmentation**: 
   - Unlike Simple Chunking, Advanced Chunking considers the meaning and structure of the text.
   - It aims to preserve semantic units such as sentences, paragraphs, or even thematic sections.

2. **Boundary Respect**: 
   - Chunks are designed to end at natural linguistic boundaries (e.g., sentence or paragraph endings).
   - This approach maintains the integrity of ideas and reduces the risk of splitting coherent thoughts.

3. **Flexible Token Limits**: 
   - While still adhering to overall token constraints, Advanced Chunking allows for some flexibility to accommodate semantic units.
   - This might result in slightly variable chunk sizes, prioritizing meaning over strict token counts.

#### Advantages of Advanced Chunking

- **Improved Context Preservation**: By keeping semantic units intact, each chunk retains more coherent information.
- **Enhanced Readability**: For both humans and AI models, chunks aligned with natural language structures are easier to process and understand.
- **Better Input for Transformers**: Providing semantically complete chunks can lead to improved performance in various NLP tasks.

#### Challenges and Considerations

1. **Increased Complexity**: 
   - Implementing Advanced Chunking requires more sophisticated algorithms to identify and respect semantic boundaries.
   - This complexity can increase processing time and resource requirements.

2. **Balancing Act**: 
   - There's often a trade-off between maintaining semantic integrity and adhering to token limits.
   - Decisions must be made on when to prioritize semantic completeness over strict token count adherence.

3. **Text Analysis Overhead**: 
   - Advanced Chunking necessitates a deeper analysis of the text structure, potentially including:
     - Sentence boundary detection
     - Paragraph identification
     - Thematic content analysis

4. **Handling Edge Cases**: 
   - Long sentences or paragraphs that exceed token limits require special handling.
   - Strategies might include further subdivision or allowing occasional oversized chunks.

#### Implementation Strategies

1. **Pre-processing**: 
   - Analyze the text to identify sentence and paragraph boundaries.
   - Consider using NLP libraries for more advanced linguistic analysis.

2. **Adaptive Chunking**: 
   - Develop algorithms that can adjust chunk sizes based on semantic units while staying close to token limits.

3. **Hierarchical Approach**: 
   - Consider a multi-level chunking strategy that respects both large (paragraphs) and small (sentences) semantic units.

4. **Post-processing**: 
   - After initial chunking, refine chunks to ensure they don't break mid-sentence or mid-paragraph where possible.

#### Practical Considerations

- **Domain Specificity**: The effectiveness of Advanced Chunking can vary based on the type of text (e.g., technical documents vs. narrative text).
- **Performance Metrics**: Develop ways to measure the quality of chunks beyond just token count adherence.
- **Iterative Refinement**: Be prepared to fine-tune your chunking algorithm based on observed results and specific use case requirements.

> **Challenge**: Implementing Advanced Chunking requires a blend of NLP techniques and creative problem-solving. As you approach this task, consider how you might balance semantic integrity with computational efficiency. What strategies could you employ to handle various text types and structures effectively?


#### Streamlining Advanced Chunking with Langchain

Working with Large Language Models (LLMs) often involves managing extensive texts that exceed the models' token limits. [Langchain](https://www.langchain.com/) simplifies this process by offering intuitive tools for advanced chunking. This section explores how Langchain facilitates this crucial task, enabling seamless interaction with LLMs.

Langchain's strength lies in its ability to break down complex tasks into manageable components. For advanced chunking, it leverages specialized modules called **Text Splitters**. These components intelligently segment large texts into smaller, semantically coherent units, ensuring optimal compatibility with LLM input requirements.

##### Understanding Langchain's Text Splitters

Text Splitters operate on the principle of balancing chunk size with semantic integrity. Here's a breakdown of their typical workflow:

1. **Fine-grained Segmentation**: The text is initially divided into small, meaningful units, often at the sentence level. This step ensures that the fundamental units of meaning are preserved. 

2. **Adaptive Chunk Aggregation**: These small units are then progressively combined to form larger chunks. The aggregation process is carefully controlled, aiming to create chunks that approach the desired size limit without breaking the flow of information.

3. **Context Overlap**: To maintain coherence across chunks, Langchain incorporates an overlap mechanism. This means that consecutive chunks share a small amount of overlapping text, ensuring smoother transitions and preserving context for the LLM.

##### Benefits of Langchain for Advanced Chunking

- **Simplicity and Efficiency**: Langchain abstracts away the complexities of manual chunking, providing a streamlined and efficient solution.
- **Semantic Awareness**: By focusing on meaningful units like sentences, Langchain ensures that chunks retain contextual relevance.
- **Customization Options**: Langchain offers flexibility in configuring chunk size, overlap, and choice of splitting criteria, allowing for adaptation to specific use cases.


In [71]:
# Import the RecursiveCharacterTextSplitter class from the langchain_text_splitters module
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define the model name for the tokenizer
model_name = 'pucpr/clinicalnerpt-disorder'

# Load the tokenizer for the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the RecursiveCharacterTextSplitter with the tokenizer and specified parameters
# chunk_size=512: Maximum size of each chunk
# chunk_overlap=0: Number of overlapping characters between chunks
# separators=["\t", "\n", ". "]: List of separators to use for splitting the text
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=512, chunk_overlap=0, separators=["\t", "\n", ". "]
)

# Split the large text into smaller chunks using the text splitter
text_chunks_langchain = text_splitter.split_text(huge_text)

# Print the lengths of the new text chunks, the langchain text chunks, and the original text chunks
len(text_chunks_langchain), len(text_chunks)


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



(13, 11)

In [73]:
print(text_chunks_langchain[0])

Data de Desligamento: MÉDICO RESPONSÁVEL: JOSILOWSKY, ODELL MD MÉDICO DE CUIDADOS PRIMÁRIOS: Dr. Freddie Fluegge, 216-754-4187 PRINCIPAL QUEIXA: Sepse. HISTÓRICO DA DOENÇA ATUAL: A Sra. Wiest é uma mulher de 83 anos com vários problemas médicos, incluindo doença arterial coronariana pós-CABG, diabetes mellitus, ICC (FE 55%) e fibrilação atrial, que apresentou hipotensão, provavelmente secundária a sepse de celulite na extremidade inferior direita. Ela desenvolveu celulite na extremidade inferior direita há um mês. Dois dias antes da admissão, a paciente desenvolveu piora de eritema, edema e sensibilidade junto com calafrios e suores. Ela foi levada para a sala de emergência em 07/08/2006 e apresentava hipotensão com pressão arterial sistólica na faixa dos 80. Ela recebeu líquidos intravenosos e iniciou vancomicina, aztreonam e clindamicina, mas permaneceu hipotensa. Ela começou a receber dopamina, Levophed e vasopressina, bem como Decadron. A cirurgia foi consultada para realizar uma b

In [74]:
print(text_chunks[0])

de desligamento : medico responsavel : josilowsky, odell md medico de cuidados primarios : dr. freddie fluegge, 216 - 754 - 4187 principal queixa : sepse. historico da doenca atual : a sra. wiest e uma mulher de 83 anos com varios problemas medicos, incluindo doenca arterial coronariana pos - cabg, diabetes mellitus, icc ( fe 55 % ) e fibrilacao atrial, que apresentou hipotensao, provavelmente secundaria a sepse de celulite na extremidade inferior direita. ela desenvolveu celulite na extremidade inferior direita ha um mes. dois dias antes da admissao, a paciente desenvolveu piora de eritema, edema e sensibilidade junto com calafrios e suores. ela foi levada para a sala de emergencia em 07 / 08 / 2006 e apresentava hipotensao com pressao arterial sistolica na faixa dos 80. ela recebeu liquidos intravenosos e iniciou vancomicina, aztreonam e clindamicina, mas permaneceu hipotensa. ela comecou a receber dopamina, levophed e vasopressina, bem como decadron. a cirurgia foi consultada para r

In [75]:
# Iterate over each chunk in the list of text chunks generated by the langchain text splitter
for idx, chunk in enumerate(text_chunks_langchain):
    # Tokenize the current chunk and calculate the number of tokens
    # 'truncation=True' ensures that the text is truncated to fit the model's maximum input size
    # 'input_ids' contains the token IDs for the chunk
    num_tokens = len(tokenizer(chunk, truncation=True)['input_ids'])
    
    # Print the index of the chunk and the number of tokens it contains
    print(f'Chunk {idx} has {num_tokens} tokens')

Chunk 0 has 400 tokens
Chunk 1 has 425 tokens
Chunk 2 has 443 tokens
Chunk 3 has 444 tokens
Chunk 4 has 408 tokens
Chunk 5 has 371 tokens
Chunk 6 has 437 tokens
Chunk 7 has 424 tokens
Chunk 8 has 438 tokens
Chunk 9 has 421 tokens
Chunk 10 has 406 tokens
Chunk 11 has 405 tokens
Chunk 12 has 389 tokens


In [76]:
# Iterate over each chunk in the list of text chunks
for idx, chk in enumerate(text_chunks):
    # Tokenize the current chunk and calculate the number of tokens
    # 'truncation=True' ensures that the text is truncated to fit the model's maximum input size
    # 'input_ids' contains the token IDs for the chunk
    sz = len(tokenizer(chk, truncation=True)['input_ids'])
    
    # Print the index of the chunk and the number of tokens it contains
    print(f'Chunk {idx} has {sz} tokens')

Chunk 0 has 512 tokens
Chunk 1 has 512 tokens
Chunk 2 has 512 tokens
Chunk 3 has 512 tokens
Chunk 4 has 512 tokens
Chunk 5 has 512 tokens
Chunk 6 has 512 tokens
Chunk 7 has 512 tokens
Chunk 8 has 512 tokens
Chunk 9 has 512 tokens
Chunk 10 has 265 tokens


## Training your Custom Named Entity Recognition (NER) Model: An In-Depth Guide

Now, we're going to dive deeper into the process of training our own NER model. Leveraging the power of machine learning techniques and pre-annotated data, let's understand how to make such a model possible. The crucial technique that will enable us to achieve this feat is known as **Transfer Learning**.

### Understanding Transfer Learning

**Transfer learning** is an efficient technique in machine learning where a model, initially trained for one specific task, is adapted or 're-purposed' for another related task.

<p align="center">
  <img src="images/transfer_learning2.png"  alt="" style="width: 60%; height: 60%"/>
</p>

The above diagram compares traditional vs transfer learning approaches. In the traditional approach, we train a model from scratch for a specific task, which requires a large amount of data and significant time to train effectively. On the other hand, transfer learning allows us to leverage a pre-trained model and fine-tune it for our specific task. This approach is much faster and requires less data.

#### Transfer Learning in Natural Language Processing (NLP)

In the context of Natural Language Processing (NLP), transfer learning is typically executed through 'fine-tuning' a pre-trained language model using a new dataset.

- **Pre-trained Language Models**: These are models trained on extensive and diverse text corpora. They have already learned various linguistic patterns, relationships, and structures, which are beneficial for numerous NLP tasks.
- **Fine-tuning**: This involves adapting the pre-trained model to a specific task by training it on a smaller, task-specific dataset.

Fine-tuning is particularly powerful because it allows the model to build upon the knowledge acquired during its initial training phase. The model has previously learned general language features like vocabulary, syntax, and semantics, which are pertinent to the new task. 

This approach is especially useful when you have a limited amount of data for the new task. Despite this limitation, the fine-tuned model can still perform impressively well.

<p align="center">
  <img src="images/transfer_learning.png"  alt="" style="width: 80%; height: 80%"/>
</p>

As demonstrated in the illustration above, the model initially learns from a larger, more general dataset. This already provides the model with much of the vocabulary and grammar rules of the language. Following this, the model is fine-tuned using a smaller, more specialized dataset. This helps the model understand the specific nuances and requirements of the new task.

### Applying Transfer Learning to NER

Now, let's apply the concept of transfer learning to our Named Entity Recognition (NER) task. We will fine-tune a pre-trained language model on our annotated dataset to train an efficient NER model. This model will be capable of performing pretty well even with a small amount of data.

- **Step 1: Select a Pre-trained Model**: Choose a model that has been pre-trained on a large and diverse text corpus. Popular choices include BERT, GPT-3, and RoBERTa.
- **Step 2: Prepare Your Dataset**: Ensure your dataset is annotated with the entities you want your model to recognize. Common entities include names, dates, locations, and more.
- **Step 3: Fine-tune the Model**: Using your annotated dataset, fine-tune the pre-trained model. This will involve adjusting the model's parameters to better fit the specific patterns and entities in your data.
- **Step 4: Evaluate and Adjust**: After fine-tuning, evaluate the model's performance on a validation set. Make necessary adjustments to improve accuracy and efficiency.

### Key Points to Remember

- **Efficiency**: Transfer learning is much faster and requires less data than training a model from scratch.
- **Performance**: Fine-tuned models can perform exceptionally well even with limited task-specific data.
- **Applicability**: This approach is particularly useful for tasks like NER, where pre-trained language models can leverage their understanding of linguistic structures.


Now, coming back to our case, we're going to use the same concept of transfer learning. We will fine-tune a pre-trained language model on our annotated dataset to train an efficient NER model.

## Transitioning to Legal NER: Exploring the LeNER-BR Dataset

While medical NER presents a valuable application of NLP, our focus shifts to legal NER, utilizing the specialized **LeNER-BR dataset**. This dataset provides a focused resource for training and evaluating models specifically on legal text.

### Understanding LeNER-BR

LeNER-BR comprises 70 legal documents meticulously annotated for Named Entity Recognition within a legal context.  These annotations categorize entities into six key classes: 

* **ORGANIZACAO:** This class encompasses the names of various organizations relevant to legal documents, such as government agencies, corporations, and institutions.
* **PESSOA:**  This category identifies individuals mentioned within the legal text, including plaintiffs, defendants, witnesses, lawyers, and judges. 
* **TEMPO:**  Representations of temporal information are crucial in legal documents. This class captures dates, times, and durations relevant to the legal proceedings.
* **LOCAL:**  Locations play a significant role in legal cases. This category identifies places like courthouses, addresses, cities, and countries mentioned in the documents.
* **LEGISLACAO:**  Legal documents often refer to specific laws, regulations, and legal codes. This class annotates these references, providing valuable context to the legal content.
* **JURISPRUDENCIA:** This category captures references to past legal cases, precedents, and rulings that hold relevance to the current document.

These categories provide a structured framework for understanding the key elements within legal documents, enabling us to extract valuable information and insights for various legal NLP tasks.  

>
> If your work involves medical NER tasks in Portuguese language, you can consider exploring datasets like [SemClinBR](https://rdcu.be/cNgqV), [BRATECA](https://physionet.org/content/brateca/1.1/) or many others available at [Physionet](https://physionet.org/content/?topic=medical-text-processing). 
>
> However, please note that while I have the necessary access to use these medical datasets, you don't, so I am unable to share them with the class.
>

### Our tools

We'll be using the following tools for this exercise:

- [spaCy](https://spacy.io/)
- [HuggingFace Transformers](https://huggingface.co/transformers/)

### SpaCy
We'll start by using spaCy's pre-trained model for Portuguese, and then we'll fine-tune it using our own data.

Spacy uses an ensemble of a CNN, a BiLSTM (Bidirectional LSTM), and a transition-based model to perform tasks like NER. They are used in the following way:

1. **Convolutional Neural Network (CNN)**: The CNN in spaCy is used to extract features from the input text. These features could include things like the word itself, its prefix or suffix, its shape (e.g., is it capitalized, is it a digit, etc.), and the features of surrounding words. The CNN helps to convert these raw features into a more abstract representation that can be used for further processing. This is very useful in tasks where local and position-invariant features are helpful, such as in Named Entity Recognition (NER).

2. **Bidirectional Long Short-Term Memory (BiLSTM)**: The BiLSTM takes the features extracted by the CNN and processes them in both forward and backward directions. This allows it to capture the context of each word from both before and after in the sentence. LSTM models are particularly good at handling sequences of data (like text), and the bidirectional part means that they can look at the context from both directions, which can be very helpful for understanding the meaning of a word in a sentence.

3. **Transition-based Model**: This model takes the contextual embeddings from the BiLSTM and makes a prediction at each word about the named entities in the text or the grammatical relations in the sentence. For instance, in Named Entity Recognition, it would predict whether each word starts an entity, ends an entity, is inside an entity, or is not part of an entity at all.

So, the CNN is used for feature extraction, the BiLSTM is used to capture context, and the transition-based model is used to make the final predictions based on the context. This combination allows spaCy to leverage the strengths of each type of model for high performance on tasks like NER and dependency parsing.

In [77]:
# Loading our dataset
from datasets import load_dataset

dataset = load_dataset('lener_br')
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 7828
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1177
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1390
    })
})

In [87]:
# Create a dictionary that maps integer IDs to label strings for named entity recognition (NER) tags
# The dataset contains NER tags in the 'ner_tags' feature of the 'train' split
# 'str2int' converts the string label to its corresponding integer ID
# 'names' provides the list of all NER tag names
id2label = {
    dataset['train'].features['ner_tags'].feature.str2int(tag_name): tag_name
    for tag_name in dataset['train'].features['ner_tags'].feature.names
}

# Display the created dictionary
id2label

{0: 'O',
 1: 'B-ORGANIZACAO',
 2: 'I-ORGANIZACAO',
 3: 'B-PESSOA',
 4: 'I-PESSOA',
 5: 'B-TEMPO',
 6: 'I-TEMPO',
 7: 'B-LOCAL',
 8: 'I-LOCAL',
 9: 'B-LEGISLACAO',
 10: 'I-LEGISLACAO',
 11: 'B-JURISPRUDENCIA',
 12: 'I-JURISPRUDENCIA'}

In [88]:
lener_train = dataset['train']
lener_valid = dataset['validation']
lener_test = dataset['test']

In [89]:
lener_train[0]

{'id': '0',
 'tokens': ['EMENTA',
  ':',
  'APELAÇÃO',
  'CÍVEL',
  '-',
  'AÇÃO',
  'DE',
  'INDENIZAÇÃO',
  'POR',
  'DANOS',
  'MORAIS',
  '-',
  'PRELIMINAR',
  '-',
  'ARGUIDA',
  'PELO',
  'MINISTÉRIO',
  'PÚBLICO',
  'EM',
  'GRAU',
  'RECURSAL',
  '-',
  'NULIDADE',
  '-',
  'AUSÊNCIA',
  'DE',
  'INTERVENÇÃO',
  'DO',
  'PARQUET',
  'NA',
  'INSTÂNCIA',
  'A',
  'QUO',
  '-',
  'PRESENÇA',
  'DE',
  'INCAPAZ',
  '-',
  'PREJUÍZO',
  'EXISTENTE',
  '-',
  'PRELIMINAR',
  'ACOLHIDA',
  '-',
  'NULIDADE',
  'RECONHECIDA',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [91]:
from typing import List, Tuple, Dict

def convert_to_entity_tuples(data: Dict[str, List[str]]) -> Tuple[List[Tuple[int, int, str]], str]:
    """
    Converts a dictionary containing token and NER tag information to a list of tuples containing start index,
    end index, and label for each named entity.

    Args:
        data (Dict[str, List[str]]): A dictionary containing 'tokens' and 'ner_tags' keys.

    Returns:
        Tuple[List[Tuple[int, int, str]], str]: A list of tuples containing start index, end index, and label for each named entity,
        and the original text formed by joining the tokens.

    Example:
        >>> data = {'tokens': ['John', 'Doe', 'is', 'a', 'doctor'], 'ner_tags': [1, 2, 0, 0, 3]}
        >>> convert_to_entity_tuples(data)
        ([(0, 8, 'PERSON'), (15, 21, 'PROFESSION')], 'John Doe is a doctor')
    """
    # Extract tokens and NER tags from the input data
    tokens = data['tokens']
    ner_tags = data['ner_tags']
    entity_tuples = []

    i = 0
    while i < len(tokens):
        # Check if the current token is part of a named entity
        if ner_tags[i] != 0:
            label = ner_tags[i]
            # Calculate the start index of the named entity
            start_index = len(' '.join(tokens[:i])) + int(i > 0)
            # Find the end index of the named entity
            while i < len(tokens) and ner_tags[i] == label:
                i += 1
            end_index = len(' '.join(tokens[:i]))
            # Add the named entity to the list of tuples
            entity_tuples.append((start_index, end_index, label))
        else:
            i += 1

    # Join the tokens to form the original text
    original_text = ' '.join(tokens)
    return entity_tuples, original_text


def merge_entities_of_same_type(text: str, entity_spans: List[Tuple[int, int, str]]) -> List[Tuple[int, int, str]]:
    """
    Merges B- and I- entities of the same type.

    Args:
        text (str): A string containing the text to be analyzed.
        entity_spans (List[Tuple[int, int, str]]): A list of tuples containing start index, end index, and label for each named entity.

    Returns:
        List[Tuple[int, int, str]]: A list of tuples containing start index, end index, and label for each named entity with B- and I- entities of the same type merged and B- removed.

    Example:
        >>> text = "John Doe is a doctor."
        >>> entity_spans = [(0, 8, 'B-PERSON'), (15, 21, 'B-PROFESSION')]
        >>> merge_entities_of_same_type(text, entity_spans)
        [(0, 8, 'PERSON'), (15, 21, 'PROFESSION')]
    """
    merged_entity_spans = []
    for i in range(len(entity_spans)):
        # Check if the current entity is a B- entity
        if entity_spans[i][2][0] == 'B':
            # Check if the next entity is an I- entity of the same type
            if i < len(entity_spans) - 1 and entity_spans[i+1][2][0] == 'I' and entity_spans[i+1][2][2:] == entity_spans[i][2][2:]:
                # Merge the two entities
                merged_entity_spans.append((entity_spans[i][0], entity_spans[i+1][1], entity_spans[i][2][2:]))
            # If the next entity is not an I- entity of the same type, add the current entity to the list
            else:
                merged_entity_spans.append((entity_spans[i][0], entity_spans[i][1], entity_spans[i][2][2:]))
        else:
            merged_entity_spans.append(entity_spans[i])
    return merged_entity_spans

In [92]:
ner_tuple, text = convert_to_entity_tuples(lener_train[0])
ner_tuple, text

([(91, 101, 1), (102, 109, 2)],
 'EMENTA : APELAÇÃO CÍVEL - AÇÃO DE INDENIZAÇÃO POR DANOS MORAIS - PRELIMINAR - ARGUIDA PELO MINISTÉRIO PÚBLICO EM GRAU RECURSAL - NULIDADE - AUSÊNCIA DE INTERVENÇÃO DO PARQUET NA INSTÂNCIA A QUO - PRESENÇA DE INCAPAZ - PREJUÍZO EXISTENTE - PRELIMINAR ACOLHIDA - NULIDADE RECONHECIDA .')

In [93]:
ner_tuple = [(i[0], i[1], id2label[i[2]]) for i in ner_tuple]
ner_tuple

[(91, 101, 'B-ORGANIZACAO'), (102, 109, 'I-ORGANIZACAO')]

In [94]:
ner_tuple = merge_entities_of_same_type(text, ner_tuple)
ner_tuple

[(91, 109, 'ORGANIZACAO'), (102, 109, 'I-ORGANIZACAO')]

In [95]:
render_named_entities(text, ner_tuple)

In [96]:
from typing import List, Tuple, Dict

def prepare_lener_data_for_training(lener_data: List, id2label: Dict) -> List[Tuple[str, List[Tuple[int, int, str]]]]:
    """
    Prepare the data for training by converting it to tuples and merging entities of the same type.

    Args:
        lener_data (List): The data to prepare for training.
        id2label (Dict): A dictionary mapping entity IDs to labels.

    Returns:
        List[Tuple[str, List[Tuple[int, int, str]]]]: A list of tuples, each containing a text and a list of entity tuples.
    """
    # Initialize an empty list to store the prepared data
    prepared_data = []

    # Iterate over the data
    for data_item in lener_data:
        # Convert the data item to tuples
        entity_tuples, text = convert_to_entity_tuples(data_item)
        # Replace the entity IDs with labels
        entity_tuples = [(entity[0], entity[1], id2label[entity[2]]) for entity in entity_tuples]
        # Merge entities of the same type
        entity_tuples = merge_entities_of_same_type(text, entity_tuples)
        # Append the text and the entity tuples to the prepared data
        prepared_data.append((text, entity_tuples))

    return prepared_data

# Prepare the training, validation, and test data
training_data = prepare_lener_data_for_training(lener_train, id2label)
validation_data = prepare_lener_data_for_training(lener_valid, id2label)
test_data = prepare_lener_data_for_training(lener_test, id2label)

In [97]:
import random
random.seed(271828)

random.shuffle(training_data) # shuffle the training data

In [100]:

training_data[3]

('Eis o teor da lei impugnada : Art . 1º As Quotas de Produtividade , que compõem o prêmio de produtividade , a que se refere o art . 66 da Lei Complementar nº 92 , de 05 de julho de 2002 , devidas aos Auditores Fiscais da Coordenação da Receita do Estado , a qualquer título , constituem parcela de sua remuneração e por isso , incorporam-se aos proventos de aposentadoria e são extensivas aos auditores fiscais aposentados e seus pensionistas .',
 [(126, 160, 'LEGISLACAO'),
  (130, 160, 'I-LEGISLACAO'),
  (166, 185, 'TEMPO'),
  (169, 185, 'I-TEMPO'),
  (221, 253, 'ORGANIZACAO'),
  (233, 253, 'I-ORGANIZACAO')])

In [101]:
len(training_data), len(validation_data), len(test_data)

(7828, 1177, 1390)

In [110]:
import spacy
from spacy.tokens import DocBin

# Load a new blank spaCy model for Portuguese
nlp = spacy.blank("pt")

def convert_to_spacy(data_list):
    """
    Converts a list of texts and annotations to a spaCy DocBin object.

    Args:
        data_list (list): A list of tuples containing the text and annotations.

    Returns:
        DocBin: A spaCy DocBin object containing the labeled texts.
    """
    # Create a DocBin object to store the processed documents
    db = DocBin()

    # Iterate over each text and its corresponding annotations in the data list
    for text, annot in data_list:
        # Create a spaCy Doc object from the text
        doc = nlp.make_doc(text)
        ents = []
        seen_tokens = set()
        
        # Iterate over each annotation (start index, end index, label)
        for start, end, label in annot:
            # Create a span for the entity using character indexes
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            
            # If the span is None, it means the character span does not align with token boundaries
            if span is None:
                print(f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{text[start:end]}' does not align with token boundaries:\n\n{text}\n")
            else:
                # Check for overlapping tokens
                if any(token.i in seen_tokens for token in span):
                    continue
                    #print(f"Skipping overlapping entity [{start}, {end}, {label}] in the following text")
                else:
                    # Add the span to the list of entities
                    ents.append(span)
                    seen_tokens.update(token.i for token in span)
        
        # Assign the list of entities to the doc.ents attribute
        doc.ents = ents
        
        # Add the processed document to the DocBin object
        db.add(doc)

    # Return the DocBin object containing all the processed documents
    return db

# Convert the training, validation, and test data to spaCy DocBin objects
db_train = convert_to_spacy(training_data)
db_valid = convert_to_spacy(validation_data)
db_test = convert_to_spacy(test_data)

In [111]:
len(db_train), len(db_valid), len(db_test)

(7828, 1177, 1390)

In [113]:
from pathlib import Path

output_dir = Path('./outputs/spacy') # output directory
output_dir.mkdir(parents=True, exist_ok=True) # create the output directory

In [114]:
# Save the data to disk
db_train.to_disk("./outputs/spacy/train.spacy")
db_valid.to_disk("./outputs/spacy/valid.spacy")
db_test.to_disk("./outputs/spacy/test.spacy")

In [115]:
# If we want to load it later
from spacy.tokens import DocBin
db_train = DocBin().from_disk("./outputs/spacy/train.spacy")
db_valid = DocBin().from_disk("./outputs/spacy/valid.spacy")
db_test = DocBin().from_disk("./outputs/spacy/test.spacy")
len(db_train), len(db_valid), len(db_test)

(7828, 1177, 1390)

In [116]:
# Creating the config file
! python -m spacy init config "./outputs/spacy/config.cfg" --lang pt --pipeline ner --optimize efficiency --force

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: pt
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
outputs/spacy/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [117]:
# Training

! python -m spacy train "./outputs/spacy/config.cfg" \
                        --output "./outputs/spacy" \
                        --paths.train "./outputs/spacy/train.spacy" \
                        --paths.dev "./outputs/spacy/valid.spacy"

[38;5;4mℹ Saving to output directory: outputs/spacy[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     45.42    0.00    0.00    0.00    0.00
  0     200       3948.38   3077.83    3.77    5.97    2.75    0.04
  0     400        632.11   1614.60   31.80   34.59   29.42    0.32
  0     600       5515.43   1307.05   52.98   58.31   48.54    0.53
  0     800        305.55   1169.88   56.33   62.38   51.35    0.56
  0    1000        650.63   1185.06   58.01   61.19   55.15    0.58
  0    1200        436.25   1360.52   68.19   71.62   65.07    0.68
  1    1400        550.49   1239.53   69.58   72.76   66.67    0.70
  1    1600        877.33   1226.74   68.55   71.04   66.23    0.69
  1    1800        710.35   1389.31   7

In [145]:
# Evaluation
from spacy.training.example import Example

nlp_ner = spacy.load("./outputs/spacy/model-best")
db_test = DocBin().from_disk("./outputs/spacy/test.spacy")


examples = []
for doc in db_test.get_docs(nlp_ner.vocab):
    examples.append(Example(nlp_ner.make_doc(doc.text), doc))

results_spacy = nlp_ner.evaluate(examples)
results_spacy

{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'ents_p': 0.8417630057803468,
 'ents_r': 0.7584635416666666,
 'ents_f': 0.797945205479452,
 'ents_per_type': {'JURISPRUDENCIA': {'p': 0.8133333333333334,
   'r': 0.6594594594594595,
   'f': 0.7283582089552239},
  'PESSOA': {'p': 0.8186046511627907,
   'r': 0.7553648068669528,
   'f': 0.7857142857142857},
  'TEMPO': {'p': 0.8784530386740331, 'r': 0.828125, 'f': 0.8525469168900803},
  'ORGANIZACAO': {'p': 0.832579185520362,
   'r': 0.7345309381237525,
   'f': 0.7804878048780488},
  'LOCAL': {'p': 0.6444444444444445,
   'r': 0.6170212765957447,
   'f': 0.6304347826086957},
  'LEGISLACAO': {'p': 0.886039886039886,
   'r': 0.8227513227513228,
   'f': 0.8532235939643347}},
 'speed': 34316.68674538908}

In [146]:
pd.DataFrame(results_spacy['ents_per_type']).T

Unnamed: 0,p,r,f
JURISPRUDENCIA,0.813333,0.659459,0.728358
PESSOA,0.818605,0.755365,0.785714
TEMPO,0.878453,0.828125,0.852547
ORGANIZACAO,0.832579,0.734531,0.780488
LOCAL,0.644444,0.617021,0.630435
LEGISLACAO,0.88604,0.822751,0.853224


In [120]:
example = """TRIBUTÁRIO. IMPOSTO DE RENDA. NÃO INCIDÊNCIA. AUXÍLIO-ALIMENTAÇÃO (AUXÍLIO-ALMOÇO). VERBA DE NATUREZA INDENIZATÓRIA. PRECEDENTE DA TRU - 5a. REGIÃO. RECURSO INOMINADO DA FAZENDA NACIONAL IMPROVIDO.

A parte ré interpõe recurso inominado contra sentença que julgou procedente o pedido de declaração da inexistência de declaração de não incidência de imposto de renda sobre as verbas decorrentes do pagamento de auxílio-almoço, bem como condenou a ré na repetição do indébito tributário.

Em seu recurso, a União sustenta que a verba, recebida a título de auxílio-almoço, não teria natureza indenizatória, razão por que incidiria o imposto de renda.

A sentença não merece ser reformada. Explico.

O imposto de renda incide sobre a renda ou o acréscimo patrimonial de qualquer natureza, a teor do art. 43 do Código Tributário Nacional, in verbis:

Art. 43. O imposto, de competência da União, sobre a renda e proventos de qualquer natureza tem como fato gerador a aquisição da disponibilidade econômica ou jurídica:
I – de renda, assim entendido o produto do capital, do trabalho ou da combinação de ambos;
II – de proventos de qualquer natureza, assim entendidos os acréscimos patrimoniais não compreendidos no inciso anterior.

A verba recebida a título de auxílio-alimentação (auxílio-almoço) possui natureza indenizatória, uma vez que se destina a cobrir os custos de refeição do empregado, não configurando acréscimo patrimonial.

O §1º. do art. 22 da Lei nº. 8.460/92, incluído pela Lei nº 9.257/97, reconhece a natureza indenizatória do auxílio-alimentação ao consignar que “a concessão do auxílio-alimentação será feita em pecúnia e terá caráter indenizatório”. Logo, tal verba não sofre a incidência do imposto de renda.

Embora o referido dispositivo seja destinado aos servidores públicos federais, não se pode tratar de forma diferenciada os empregados públicos, em face do princípio da isonomia tributária, previsto no art. 150, II, da Constituição Federal, segundo o qual é vedado instituir tratamento desigual entre contribuintes que se encontrem em situação equivalente.

No caso dos autos, os documentos comprovam que o autor percebe o auxílio-almoço, dotado de natureza indenizatória, cujo pagamento deu azo à incidência do imposto de renda. É cabível, portanto a restituição do seu indébito, bem como a declaração de não incidência do imposto sobre tal verba. Nesse sentido, invoco ainda os seguintes precedentes:

TRIBUTÁRIO - PROCESSUAL CIVIL - INEXISTÊNCIA DE VIOLAÇÃO DO ART. 557 DO CPC - IMPOSTO DE RENDA - NÃO INCIDÊNCIA SOBRE VERBAS INDENIZATÓRIAS - AUXÍLIO-ALIMENTAÇÃO - AUXÍLIO-TRANSPORTE. 1. A eventual nulidade da decisão monocrática calcada no art. 557 do CPC fica superada com a reapreciação do recurso pelo órgão colegiado, na via de agravo regimental. 2. O fato gerador do imposto de renda é a aquisição de disponibilidade econômica ou jurídica decorrente de acréscimo patrimonial (art. 43 do CTN). 3. Não incide imposto de renda sobre as verbas recebidas a título de indenização. Precedentes. 4. O pagamento de verbas a título de auxílio-alimentação e auxílio-transporte correspondem ao pagamento de verbas indenizatórias, portanto, não incide na espécie imposto de renda. Agravo regimental improvido. (AGRESP 201000172325, HUMBERTO MARTINS, STJ - SEGUNDA TURMA, DJE DATA:23/04/2010)

EMENTA TRIBUTÁRIO. IMPOSTO DE RENDA DA PESSOA FÍSICA - IRPF. “AUXÍLIO-ALMOÇO”. VERBA DESTINADA A RESSARCIR PARCIALMENTE AS DESPESAS DO EMPREGADO COM A SUA PRÓPRIA ALIMENTAÇÃO. NATUREZA INDENIZATÓRIA. NÃO INCIDÊNCIA DO IRPF. “AUXÍLIO-ENSINO”. VERBA DESTINADA A RESSARCIR PARCIALMENTE AS DESPESAS DO EMPREGADO COM A EDUCAÇÃO DE SEUS FILHOS. NATUREZA REMUNERATÓRIA INCIDÊNCIA DE IMPOSTO DE RENDA. ART. 43 DO CTN. PEDIDO DE UNIFORMIZAÇÃO CONHECIDO E PARCIALMENTE PROVIDO. (Recursos 05032827820154058312, MARCOS ANTONIO GARAPA DE CARVALHO - Turma Regional de Uniformização de Jurisprudência da 5a. Região, Creta - Data::06/04/2017 - Página N/I.)

Recurso inominado da Fazenda Nacional improvido.

Condenação da União em honorários advocatícios, arbitrados em dez por cento sobre o valor condenação.

 

ACÓRDÃO

Decide a 3ª Turma Recursal dos Juizados Especiais Federais de Pernambuco, por maioria, vencido o Juiz Federal Isaac Batista de Carvalho Neto, NEGAR PROVIMENTO AO RECURSO INOMINADO, nos termos da ementa supra.

Recife, data do julgamento.

Joaquim Lustosa Filho

Juiz Federal Relator"""

In [121]:
# Perform named entity recognition (NER) on the example text using the spaCy model
doc_pred = nlp_ner(example)

# Define a dictionary to map entity labels to specific colors for visualization
colors = {
    'ORGANIZACAO': '#2ECC71',  # Green for organizations
    'PESSOA': '#3498DB',       # Blue for persons
    'TEMPO': '#E74C3C',        # Red for time expressions
    'LOCAL': '#F1C40F',        # Yellow for locations
    'LEGISLACAO': '#8E44AD',   # Purple for legislation
    'JURISPRUDENCIA': '#D35400' # Orange for jurisprudence
}

# Render the named entities in the text using spaCy's displaCy visualizer
# 'style="ent"' specifies that the visualization should highlight entities
# 'jupyter=True' ensures the visualization is displayed correctly in a Jupyter notebook
# 'options={'colors': colors}' applies the custom colors defined above to the entities
spacy.displacy.render(doc_pred, style="ent", jupyter=True, options={'colors': colors})

In [122]:
# Here's the proof that the model has never seen the law number 9.257/97 before, but it was able to recognize it as a LEGISLACAO entity
for i in training_data:
    if '9.257/97' in i[0]:
        print(i[0])


In [123]:
# Here's the proof that the model has never seen the AGRESP acronym before, but it was able to recognize it as a JURISPRUDENCIA entity
for i in training_data:
    if 'agresp' in i[0].lower():
        print(i[0])


### HuggingFace Transformers

Now, we'll use HuggingFace's pre-trained model for Portuguese, then fine-tune it with our own data.

HuggingFace Transformers implements models like BERT (Bidirectional Encoder Representations from Transformers), GPT-2 (Generative Pretrained Transformer 2), and others. These models perform tasks like Named Entity Recognition (NER) as follows:

1. **Tokenization**: All transformer models start by converting the input text into tokens. These tokens are words or parts of words that the model understands. They're usually shorter than a whole word and can include things like morphological roots, prefixes, suffixes, etc.

2. **Embedding**: After tokenizing the text, each token is converted into a numerical representation called an embedding. These embeddings are vectors in a high-dimensional space that the model has learned during pre-training. Nearby vectors in this space correspond to similar meanings, so these embeddings capture the semantics of the words in some sense.

3. **Transformer Architecture (BERT, GPT-2, etc.)**: Depending on the specific transformer model, different architectural patterns apply. But generally speaking, transformer architectures consist of several layers of self-attention mechanisms and feed-forward neural networks. Self-attention allows the model to weigh the importance of each token when considering a particular one. It essentially captures the context of each word by processing the entire text at once.

4. **Fine-tuning**: The last step is to use our own training data to specialize ('fine-tune') the pre-trained transformer model for a specific task, like NER. We add a task-specific layer on top of the pre-trained model, train it on our data, and gradually adjust the parameters of both the task-specific layer and the pre-existing model.

Therefore, tokenization and embedding are used for feature extraction, the transformer architecture (BERT, GPT-2, etc.) is used for context capturing, and fine-tuning is performed for task-specific predictions. This composite layout enables HuggingFace Transformers to exploit the advantages of each model type for great performance on tasks such as NER and dependency parsing.





### Harnessing the Power of Pretrained Transformer Models in NLP Applications

Transformer models have revolutionized the landscape of Natural Language Processing (NLP) applications. By leveraging a pretrained model from the Hugging Face Hub, we can directly fine-tune it to function effectively for our specific task.

This technique, known as transfer learning, carries the advantage of saving significant amounts of time and computational resources. As long as the corpus utilized during pretraining isn't vastly different from that used during the fine-tuning phase, good results are generally achievable.

#### The Need for Domain Adaptation

Despite the efficiency of transfer learning, there are scenarios where it's beneficial to conduct an extra step of fine-tuning the language models on your specific dataset before going ahead with training a task-specific head. This is particularly relevant when dealing with niche or specialized text such as legal contracts or scientific articles.

In these cases, a generic Transformer model like BERT might treat the domain-specific words in your data as rare tokens. This could negatively affect performance as these 'rare tokens' might actually be integral or frequently occurring elements in your specific application. Hence, by first fine-tuning the language model using in-domain data, we can notably improve the performance of numerous downstream tasks. And the best part? Usually, this step needs to be conducted only once!

#### What Exactly is Domain Adaptation?

Domain Adaptation essentially refers to this process of fine-tuning a pretrained language model on in-domain data. The method gained popularity in 2018, thanks to an architecture called ULMFiT - one of the pioneering neural architectures that really made transfer learning work for NLP.

<br><br>
<p align="center">
  <img src="images/ulmfit.svg"  alt="" style="width: 60%; height: 60%"/>
</p>
<br><br>

ULMFiT's approach was based on LSTMs (Long Short-Term Memory units), a type of recurrent neural network well-suited for sequence prediction tasks. It demonstrated impressive success with domain adaptation. 

Through the combined power of transfer learning and domain adaptation, we can harness the potent capabilities of pretrained Transformer models, tailored precisely for our unique NLP tasks. We'll explore how to perform a similar procedure of domain adaptation but with a Transformer rather than an LSTM.

### Detailed Steps to Train a Custom Named Entity Recognition (NER) Model

In this section, we break down the process of training a custom NER model into three key steps. We aim to offer a clear and comprehensive explanation suitable for a classroom setting.

#### Step 1: Loading a Pre-Trained Language Model

To kick off our custom NER model training, we start by utilizing a pre-trained language model - specifically, one that is already fine-tuned to understand Portuguese. The power of using such a pre-trained model lies in its existing linguistic understanding. By leveraging previous learning, it provides our model with a significant headstart, which results in quicker training times and improved performance. This pre-trained model serves as a robust foundation on which we can construct our task-specific training.

#### Step 2: Fine-Tuning the Language Model with Domain-Specific Text

Once we have the base foundation set up, we proceed to the fine-tuning phase. During this stage, we 'specialize' our pre-trained language model further by exposing it to our specific dataset. This dataset contains text that directly pertains to the domain in which we'd like our model to operate. By aligning our pre-trained model with the peculiarities, styles, and common patterns found in our particular dataset, we help it grasp and adjust to these unique contextual aspects. In essence, the model learns the kind of language "jargon" or nuances associated with our specific domain.

#### Step 3: Training the NER Model using the Fine-Tuned LM

With a domain-adapted, fine-tuned language model in place, we're ready to focus on our primary goal - named entity recognition. We train our NER model using the outputs from the previously fine-tuned language model. The fine-tuned model generates what's known as 'embeddings' for each token in our text. These embeddings are high-dimensional numerical representations that capture various semantic aspects of the tokens.

Our NER model then learns how to utilize these embeddings to identify and classify named entities within our domain-specific content. It essentially maps these learned embeddings to specific named entity categories. Consequently, our model becomes proficient at identifying and classifying entities within the input text as per the categories defined by us. 

The entire process, from initial loading of a pre-trained model through fine-tuning it with domain-specific data, to finally using it for training the NER model, creates a powerful sequence that enables our custom model to perform excellently at its designated Named Entity Recognition tasks.

#### Step 1: Loading a Pre-Trained Language Model

In [124]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForMaskedLM, pipeline, AutoConfig
from datasets import load_dataset
from pathlib import Path
import torch

model_checkpoint = "neuralmind/bert-base-portuguese-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

lener_train = load_dataset('lener_br', split='train')
lener_valid = load_dataset('lener_br', split='validation')
lener_test = load_dataset('lener_br', split='test')


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [126]:
train_dataset = [' '.join(tokens) for tokens in lener_train['tokens']]
valid_dataset = [' '.join(tokens) for tokens in lener_valid['tokens']]
test_dataset = [' '.join(tokens) for tokens in lener_test['tokens']]

# merge train and valid datasets
train_dataset = train_dataset + valid_dataset

train_dataset[10]

'Verificando-se que a incapaz foi acompanhada de seu representante legal , bem como por Advogado , o qual exerceu com regularidade o contraditório e a ampla defesa , a ausência de intimação do parquet não se erige em vício suficiente para macular de nulidade o processo .'

In [127]:
# create huggingface datasets from the lists of texts

from datasets import Dataset

train_dataset = Dataset.from_dict({'text': train_dataset})
test_dataset = Dataset.from_dict({'text': test_dataset})

train_dataset = train_dataset.shuffle(seed=271828)
train_dataset


Dataset({
    features: ['text'],
    num_rows: 9005
})

#### Step 2: Fine-Tuning the Language Model with Domain-Specific Text


In [128]:
path_to_save_lm = Path('./outputs/bert_masked_lm')
path_to_save_lm.mkdir(parents=True, exist_ok=True)

In [129]:
def tokenize_function(examples):
    """
    Tokenizes the input text in the given examples using the tokenizer object.

    Args:
    - examples: A dictionary containing the input text to be tokenized.

    Returns:
    - A dictionary containing the tokenized input text.
    """
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

# Tokenize the datasets. This is the step where we convert the text to numerical representations of the tokens
tokenized_train = train_dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)
tokenized_test = test_dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)


Map:   0%|          | 0/9005 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

In [130]:
tokenized_train

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 9005
})

In [131]:
tokenized_train[0]

{'input_ids': [101,
  177,
  22328,
  5650,
  22339,
  22317,
  248,
  16484,
  11635,
  22317,
  19893,
  18504,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'word_ids': [None, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 3, None]}

In [132]:
for i in range(30):
    print(f'The number of tokens in the {i}th text is {len(tokenized_train[i]["input_ids"])}')

The number of tokens in the 0th text is 14
The number of tokens in the 1th text is 13
The number of tokens in the 2th text is 12
The number of tokens in the 3th text is 20
The number of tokens in the 4th text is 15
The number of tokens in the 5th text is 23
The number of tokens in the 6th text is 132
The number of tokens in the 7th text is 65
The number of tokens in the 8th text is 20
The number of tokens in the 9th text is 59
The number of tokens in the 10th text is 23
The number of tokens in the 11th text is 4
The number of tokens in the 12th text is 108
The number of tokens in the 13th text is 73
The number of tokens in the 14th text is 168
The number of tokens in the 15th text is 22
The number of tokens in the 16th text is 55
The number of tokens in the 17th text is 23
The number of tokens in the 18th text is 31
The number of tokens in the 19th text is 64
The number of tokens in the 20th text is 68
The number of tokens in the 21th text is 18
The number of tokens in the 22th text is

In [133]:
def group_texts(examples):
    """
    This function groups together a set of texts as contiguous text of fixed length (chunk_size). It's useful for training masked language models.

    Args:
    - examples: A dictionary containing the examples to group. Each key corresponds to a feature, and each value is a list of lists of tokens.

    Returns:
    - A dictionary containing the grouped examples. Each key corresponds to a feature, and each value is a list of lists of tokens.
    """
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

chunk_size = 512

tokenized_train = tokenized_train.map(
    group_texts,
    batched=True,
)

tokenized_test = tokenized_test.map(
    group_texts,
    batched=True,
)

Map:   0%|          | 0/9005 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

In [134]:
for i in range(30):
    print(f'The number of tokens in the {i}th text is {len(tokenized_train[i]["input_ids"])}')

The number of tokens in the 0th text is 512
The number of tokens in the 1th text is 512
The number of tokens in the 2th text is 512
The number of tokens in the 3th text is 512
The number of tokens in the 4th text is 512
The number of tokens in the 5th text is 512
The number of tokens in the 6th text is 512
The number of tokens in the 7th text is 512
The number of tokens in the 8th text is 512
The number of tokens in the 9th text is 512
The number of tokens in the 10th text is 512
The number of tokens in the 11th text is 512
The number of tokens in the 12th text is 512
The number of tokens in the 13th text is 512
The number of tokens in the 14th text is 512
The number of tokens in the 15th text is 512
The number of tokens in the 16th text is 512
The number of tokens in the 17th text is 512
The number of tokens in the 18th text is 512
The number of tokens in the 19th text is 512
The number of tokens in the 20th text is 512
The number of tokens in the 21th text is 512
The number of tokens

In [135]:
tokenizer.decode(tokenized_train[1]["input_ids"])

"do fato de o desertor encontrar - se com sua namorada ( jovem ) grávida, revela - se importante trazermos à baila os ensinamentos dos ilustres professores Eugênio Zaffaroni e José Henrique Pierangeli, verbis :'Todo sujeito age numa circunstância dada e com um âmbito de autodeterminação também dado. [SEP] [CLS] NA AÇÃO CÍVEL ORIGINÁRIA 2. 821 PROCED. : MATO GROSSO RELATOR : MIN. RICARDO LEWANDOWSKI AGTE. ( S ) : UNIÃO ADV. ( A / S ) : ADVOGADO - GERAL DA UNIÃO AGDO. ( A / S ) : ESTADO DE MATO GROSSO PROC. ( A / S ) ( ES ) : PROCURADOR - GERAL DO ESTADO DE MATO GROSSO Decisão : O Tribunal, por unanimidade, negou provimento ao agravo regimental, nos termos do voto do Relator. [SEP] [CLS] PRESSUPOSTO RECURSAL NÃO OBSERVADO. [SEP] [CLS] O STF, em julgados recentes, concluiu que a Constituição Federal não define o momento em que ocorrem o fato gerador, a base de cálculo e a exigibilidade da contribuição previdenciária, podendo assim tais matérias ser disciplinadas por lei ordinária. [SEP] [

In [136]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)


In [141]:
from transformers import TrainingArguments

# Define the batch size for training
training_batch_size = 24

# Extract the model name from the model checkpoint
model_name = model_checkpoint.split("/")[-1]

# Define the training arguments
training_arguments = TrainingArguments(
    # Define the output directory for the trained model
    output_dir=path_to_save_lm / f"{model_name}-finetuned-lener",
    # Overwrite the output directory if it already exists
    overwrite_output_dir=True,
    # Evaluate the model after every epoch
    eval_strategy="epoch",
    # Set the learning rate
    learning_rate=3e-5,
    # Set the weight decay
    weight_decay=0.01,
    # Set the batch size for training
    per_device_train_batch_size=training_batch_size,
    # Set the batch size for evaluation
    per_device_eval_batch_size=training_batch_size,
    # Use mixed precision training
    bf16=True,
    # Log the training loss after every 5 steps
    logging_steps=5,
    # Log strategy
    logging_strategy='steps',
    # Train the model for 20 epochs
    num_train_epochs=20,
    # Save only the best model
    save_total_limit=1,
    # Save the model after every epoch
    save_strategy='epoch',
    # Load the best model at the end of training
    load_best_model_at_end=True,
    # Use the evaluation loss to determine the best model
    metric_for_best_model="eval_loss",
    # Lower evaluation loss is better
    greater_is_better=False,
    # Accumulate gradients for 1 steps before performing an update
    gradient_accumulation_steps=1,
    # Set the random seed for reproducibility
    seed=271828,
)

In [142]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    tokenizer=tokenizer,
)


In [143]:
trainer.train()


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.



Epoch,Training Loss,Validation Loss
1,1.2858,1.110343
2,1.2054,1.07417
3,1.2005,1.060949
4,1.1761,1.036157
5,1.1358,1.058085
6,1.115,1.031998
7,1.1002,1.023526
8,1.0816,1.056674
9,1.0835,0.998244
10,1.0716,1.017125



`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.aut

TrainOutput(global_step=360, training_loss=1.0889260013898214, metrics={'train_runtime': 629.3905, 'train_samples_per_second': 26.724, 'train_steps_per_second': 0.572, 'total_flos': 4427067334778880.0, 'train_loss': 1.0889260013898214, 'epoch': 20.0})

In [147]:

# Save the trained model
trainer.save_model(path_to_save_lm / f"{model_name}-finetuned-lener")
tokenizer.save_pretrained(path_to_save_lm / f"{model_name}-finetuned-lener")

('outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener/tokenizer_config.json',
 'outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener/special_tokens_map.json',
 'outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener/vocab.txt',
 'outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener/added_tokens.json',
 'outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener/tokenizer.json')

In [148]:
print(path_to_save_lm / f"{model_name}-finetuned-lener")

outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener


#### Step 3: Training the NER Model using the Fine-Tuned LM

In [149]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForMaskedLM, pipeline, AutoConfig
from datasets import load_dataset
from pathlib import Path
import torch

In [150]:
path_to_save_ner = Path('./outputs/bert_lener')
path_to_save_ner.mkdir(parents=True, exist_ok=True)

In [151]:
lener_train = load_dataset('lener_br', split='train')
lener_valid = load_dataset('lener_br', split='validation')
lener_test = load_dataset('lener_br', split='test')

In [152]:
lener_train

Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 7828
})

In [153]:
# Clean GPU memory
import torch
import gc

model = None
trainer = None
tokenizer = None
gc.collect()
torch.cuda.empty_cache()

In [154]:
# Create a dictionary that maps integer IDs to label strings for named entity recognition (NER) tags
# The dataset contains NER tags in the 'ner_tags' feature of the 'train' split
# 'str2int' converts the string label to its corresponding integer ID
# 'names' provides the list of all NER tag names
id2label = {
    lener_train.features['ner_tags'].feature.str2int(tag_name): tag_name
    for tag_name in lener_train.features['ner_tags'].feature.names
}

# Display the created dictionary
id2label

{0: 'O',
 1: 'B-ORGANIZACAO',
 2: 'I-ORGANIZACAO',
 3: 'B-PESSOA',
 4: 'I-PESSOA',
 5: 'B-TEMPO',
 6: 'I-TEMPO',
 7: 'B-LOCAL',
 8: 'I-LOCAL',
 9: 'B-LEGISLACAO',
 10: 'I-LEGISLACAO',
 11: 'B-JURISPRUDENCIA',
 12: 'I-JURISPRUDENCIA'}

In [155]:
from transformers import AutoTokenizer

model_checkpoint = Path('./outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener')

# Load the tokenizer from the checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Create a label2id mapping by reversing the id2label mapping
label2id_mapping = {v: k for k, v in id2label.items()}
# Load the model from the checkpoint
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id_mapping)



Some weights of BertForTokenClassification were not initialized from the model checkpoint at outputs/bert_masked_lm/bert-base-portuguese-cased-finetuned-lener and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [156]:
label_all_tokens = True

def tokenize_and_align_labels(examples: Dict[str, List[str]]) -> Dict[str, List[str]]:
    """
    Tokenize the input words and align the labels with the tokens.

    Args:
        examples (Dict[str, List[str]]): A dictionary containing the input words and the corresponding labels.

    Returns:
        Dict[str, List[str]]: A dictionary containing the tokenized input words and the aligned labels.
    """
    # Tokenize the input words
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, max_length=512)

    # Initialize an empty list to store the aligned labels
    aligned_labels = []

    # Iterate over the labels
    for i, label in enumerate(examples["ner_tags"]):
        # Get the word IDs for the current batch
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        # Iterate over the word IDs
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        # Append the label IDs to the aligned labels
        aligned_labels.append(label_ids)

    # Add the aligned labels to the tokenized inputs
    tokenized_inputs["labels"] = aligned_labels

    return tokenized_inputs

In [157]:
tokenized_train = lener_train.map(tokenize_and_align_labels, batched=True)
tokenized_valid = lener_valid.map(tokenize_and_align_labels, batched=True)
tokenized_test = lener_test.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/7828 [00:00<?, ? examples/s]

Map:   0%|          | 0/1177 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

In [158]:
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)


    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
        "eval_f1": results["overall_f1"],
    }

In [159]:
from transformers import DataCollatorForTokenClassification, TrainingArguments, Trainer

# Initialize a data collator for token classification
token_classification_collator = DataCollatorForTokenClassification(tokenizer)

# Extract the model name from the model checkpoint
model_name = str(model_checkpoint).split("/")[-1]

# Define the training arguments
training_arguments = TrainingArguments(
    # Define the output directory for the trained model
    output_dir=path_to_save_ner / f"{model_name}-lener",
    # Overwrite the output directory if it already exists
    overwrite_output_dir=True,
    # Set the learning rate
    learning_rate=2e-5,
    # Set the batch size for training
    per_device_train_batch_size=8,
    # Set the batch size for evaluation
    per_device_eval_batch_size=4,
    # Accumulate gradients for 1 step before performing an update
    gradient_accumulation_steps=1,
    # Train the model for 4 epochs
    num_train_epochs=4,
    # Set the weight decay
    weight_decay=0.01,
    # Save only the best model
    save_total_limit=1,
    # Log the training loss after every 5 steps
    logging_steps=5,
    # Log strategy
    logging_strategy='steps',
    # Evaluate the model after every 1 epoch
    eval_steps=1,
    # Save the model after every 1 epoch
    save_steps=1,
    # Evaluate the model after every epoch
    eval_strategy="epoch",
    # Save the model after every epoch
    save_strategy='epoch',
    # Load the best model at the end of training
    load_best_model_at_end=True,
    # Use the evaluation F1 score to determine the best model
    metric_for_best_model="eval_f1",
    # Higher evaluation F1 score is better
    greater_is_better=True,
    # Perform training
    do_train=True,
    # Perform evaluation
    do_eval=True,
    # Perform prediction
    do_predict=True,
    # Use mixed precision training
    bf16=True,
    # Do not push the model to the Hugging Face Model Hub
    push_to_hub=False,
    # Set the random seed for reproducibility
    seed=271828
)


In [160]:
# Initialize the Trainer
trainer = Trainer(
    # The model to train
    model,
    # The training arguments
    training_arguments,
    # The training dataset
    train_dataset=tokenized_train,
    # The evaluation dataset
    eval_dataset=tokenized_valid,
    # The data collator
    data_collator=token_classification_collator,
    # The tokenizer
    tokenizer=tokenizer,
    # The function to compute the metrics
    compute_metrics=compute_metrics,
)

In [161]:
trainer.train()


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.



Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.0813,,0.810766,0.778879,0.845376,0.969083
2,0.0347,,0.848454,0.802493,0.9,0.972298
3,0.0311,,0.846532,0.807377,0.889677,0.971536
4,0.0074,,0.862236,0.827695,0.899785,0.974507



`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.



TrainOutput(global_step=1960, training_loss=0.06744579444972, metrics={'train_runtime': 543.4568, 'train_samples_per_second': 57.616, 'train_steps_per_second': 3.607, 'total_flos': 2632698013761336.0, 'train_loss': 0.06744579444972, 'epoch': 4.0})

In [162]:
trainer.evaluate(tokenized_test)


`torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.


Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.



{'eval_f1': 0.8947635993899339,
 'eval_loss': 0.061111580580472946,
 'eval_precision': 0.8835341365461847,
 'eval_recall': 0.9062821833161689,
 'eval_accuracy': 0.9841192818386343,
 'eval_runtime': 16.3482,
 'eval_samples_per_second': 85.025,
 'eval_steps_per_second': 10.643,
 'epoch': 4.0}

In [163]:
# Save the trained model
trainer.save_model(path_to_save_ner / f"{model_name}-lener")
tokenizer.save_pretrained(path_to_save_ner / f"{model_name}-lener")


('outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener/tokenizer_config.json',
 'outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener/special_tokens_map.json',
 'outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener/vocab.txt',
 'outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener/added_tokens.json',
 'outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener/tokenizer.json')

In [164]:
print(path_to_save_ner / f"{model_name}-lener")

outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener


In [165]:
predictions, labels, _ = trainer.predict(tokenized_test)

predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results_huggingface = metric.compute(predictions=true_predictions, references=true_labels)
results_huggingface

{'JURISPRUDENCIA': {'precision': 0.839754816112084,
  'recall': 0.8912639405204461,
  'f1': 0.8647430117222723,
  'number': 1076},
 'LEGISLACAO': {'precision': 0.9219600725952813,
  'recall': 0.910394265232975,
  'f1': 0.9161406672678089,
  'number': 558},
 'LOCAL': {'precision': 0.65,
  'recall': 0.7647058823529411,
  'f1': 0.7027027027027027,
  'number': 68},
 'ORGANIZACAO': {'precision': 0.8433734939759037,
  'recall': 0.8795811518324608,
  'f1': 0.8610968733982574,
  'number': 955},
 'PESSOA': {'precision': 0.924956369982548,
  'recall': 0.9742647058823529,
  'f1': 0.9489704565801255,
  'number': 544},
 'TEMPO': {'precision': 0.9828660436137072,
  'recall': 0.9238653001464129,
  'f1': 0.9524528301886793,
  'number': 683},
 'overall_precision': 0.8835341365461847,
 'overall_recall': 0.9062821833161689,
 'overall_f1': 0.8947635993899339,
 'overall_accuracy': 0.9841192818386343}

In [166]:
results_spacy

{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'ents_p': 0.8417630057803468,
 'ents_r': 0.7584635416666666,
 'ents_f': 0.797945205479452,
 'ents_per_type': {'JURISPRUDENCIA': {'p': 0.8133333333333334,
   'r': 0.6594594594594595,
   'f': 0.7283582089552239},
  'PESSOA': {'p': 0.8186046511627907,
   'r': 0.7553648068669528,
   'f': 0.7857142857142857},
  'TEMPO': {'p': 0.8784530386740331, 'r': 0.828125, 'f': 0.8525469168900803},
  'ORGANIZACAO': {'p': 0.832579185520362,
   'r': 0.7345309381237525,
   'f': 0.7804878048780488},
  'LOCAL': {'p': 0.6444444444444445,
   'r': 0.6170212765957447,
   'f': 0.6304347826086957},
  'LEGISLACAO': {'p': 0.886039886039886,
   'r': 0.8227513227513228,
   'f': 0.8532235939643347}},
 'speed': 34316.68674538908}

In [167]:
def convert_spacy_to_hf_format(input_data):
    """
    Converts spaCy evaluation metrics to a format compatible with Hugging Face.

    Args:
        input_data (dict): A dictionary containing spaCy evaluation metrics.

    Returns:
        dict: A dictionary containing the converted metrics in Hugging Face format.
    """
    # Initialize an empty dictionary to store the converted metrics
    output = {}
    
    # Process entity-specific metrics
    for entity_type, metrics in input_data['ents_per_type'].items():
        # Convert precision, recall, and F1-score for each entity type
        output[entity_type] = {
            'precision': metrics['p'],
            'recall': metrics['r'],
            'f1': metrics['f'],
            'number': 0  # Placeholder for the number of entities, not available in input
        }
    
    # Add overall metrics to the output dictionary
    output['overall_precision'] = input_data['ents_p']
    output['overall_recall'] = input_data['ents_r']
    output['overall_f1'] = input_data['ents_f']
    output['overall_accuracy'] = input_data['token_acc']
    
    # Return the converted metrics
    return output

convert_spacy_to_hf_format(results_spacy)

{'JURISPRUDENCIA': {'precision': 0.8133333333333334,
  'recall': 0.6594594594594595,
  'f1': 0.7283582089552239,
  'number': 0},
 'PESSOA': {'precision': 0.8186046511627907,
  'recall': 0.7553648068669528,
  'f1': 0.7857142857142857,
  'number': 0},
 'TEMPO': {'precision': 0.8784530386740331,
  'recall': 0.828125,
  'f1': 0.8525469168900803,
  'number': 0},
 'ORGANIZACAO': {'precision': 0.832579185520362,
  'recall': 0.7345309381237525,
  'f1': 0.7804878048780488,
  'number': 0},
 'LOCAL': {'precision': 0.6444444444444445,
  'recall': 0.6170212765957447,
  'f1': 0.6304347826086957,
  'number': 0},
 'LEGISLACAO': {'precision': 0.886039886039886,
  'recall': 0.8227513227513228,
  'f1': 0.8532235939643347,
  'number': 0},
 'overall_precision': 0.8417630057803468,
 'overall_recall': 0.7584635416666666,
 'overall_f1': 0.797945205479452,
 'overall_accuracy': 1.0}

In [168]:
import pandas as pd

df_res_spacy = pd.DataFrame(convert_spacy_to_hf_format(results_spacy)).T
df_res_spacy.sort_index()['f1']

JURISPRUDENCIA       0.728358
LEGISLACAO           0.853224
LOCAL                0.630435
ORGANIZACAO          0.780488
PESSOA               0.785714
TEMPO                0.852547
overall_accuracy     1.000000
overall_f1           0.797945
overall_precision    0.841763
overall_recall       0.758464
Name: f1, dtype: float64

In [169]:
df_res_huggingface = pd.DataFrame(results_huggingface).T
df_res_huggingface.sort_index()['f1']

JURISPRUDENCIA       0.864743
LEGISLACAO           0.916141
LOCAL                0.702703
ORGANIZACAO          0.861097
PESSOA               0.948970
TEMPO                0.952453
overall_accuracy     0.984119
overall_f1           0.894764
overall_precision    0.883534
overall_recall       0.906282
Name: f1, dtype: float64

In [170]:
# Deploying the model for inference
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_checkpoint = Path('./outputs/bert_lener/bert-base-portuguese-cased-finetuned-lener-lener')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, model_max_length=512)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)
pipeline_ner = pipeline('ner', model=model, tokenizer=tokenizer,  aggregation_strategy='first', device=-1)



In [171]:
example = """TRIBUTÁRIO. IMPOSTO DE RENDA. NÃO INCIDÊNCIA. AUXÍLIO-ALIMENTAÇÃO (AUXÍLIO-ALMOÇO). VERBA DE NATUREZA INDENIZATÓRIA. PRECEDENTE DA TRU - 5a. REGIÃO. RECURSO INOMINADO DA FAZENDA NACIONAL IMPROVIDO.

A parte ré interpõe recurso inominado contra sentença que julgou procedente o pedido de declaração da inexistência de declaração de não incidência de imposto de renda sobre as verbas decorrentes do pagamento de auxílio-almoço, bem como condenou a ré na repetição do indébito tributário.

Em seu recurso, a União sustenta que a verba, recebida a título de auxílio-almoço, não teria natureza indenizatória, razão por que incidiria o imposto de renda.

A sentença não merece ser reformada. Explico.

O imposto de renda incide sobre a renda ou o acréscimo patrimonial de qualquer natureza, a teor do art. 43 do Código Tributário Nacional, in verbis:

Art. 43. O imposto, de competência da União, sobre a renda e proventos de qualquer natureza tem como fato gerador a aquisição da disponibilidade econômica ou jurídica:
I – de renda, assim entendido o produto do capital, do trabalho ou da combinação de ambos;
II – de proventos de qualquer natureza, assim entendidos os acréscimos patrimoniais não compreendidos no inciso anterior.

A verba recebida a título de auxílio-alimentação (auxílio-almoço) possui natureza indenizatória, uma vez que se destina a cobrir os custos de refeição do empregado, não configurando acréscimo patrimonial.

O §1º. do art. 22 da Lei nº. 8.460/92, incluído pela Lei nº 9.257/97, reconhece a natureza indenizatória do auxílio-alimentação ao consignar que “a concessão do auxílio-alimentação será feita em pecúnia e terá caráter indenizatório”. Logo, tal verba não sofre a incidência do imposto de renda.

Embora o referido dispositivo seja destinado aos servidores públicos federais, não se pode tratar de forma diferenciada os empregados públicos, em face do princípio da isonomia tributária, previsto no art. 150, II, da Constituição Federal, segundo o qual é vedado instituir tratamento desigual entre contribuintes que se encontrem em situação equivalente.

No caso dos autos, os documentos comprovam que o autor percebe o auxílio-almoço, dotado de natureza indenizatória, cujo pagamento deu azo à incidência do imposto de renda. É cabível, portanto a restituição do seu indébito, bem como a declaração de não incidência do imposto sobre tal verba. Nesse sentido, invoco ainda os seguintes precedentes:

TRIBUTÁRIO - PROCESSUAL CIVIL - INEXISTÊNCIA DE VIOLAÇÃO DO ART. 557 DO CPC - IMPOSTO DE RENDA - NÃO INCIDÊNCIA SOBRE VERBAS INDENIZATÓRIAS - AUXÍLIO-ALIMENTAÇÃO - AUXÍLIO-TRANSPORTE. 1. A eventual nulidade da decisão monocrática calcada no art. 557 do CPC fica superada com a reapreciação do recurso pelo órgão colegiado, na via de agravo regimental. 2. O fato gerador do imposto de renda é a aquisição de disponibilidade econômica ou jurídica decorrente de acréscimo patrimonial (art. 43 do CTN). 3. Não incide imposto de renda sobre as verbas recebidas a título de indenização. Precedentes. 4. O pagamento de verbas a título de auxílio-alimentação e auxílio-transporte correspondem ao pagamento de verbas indenizatórias, portanto, não incide na espécie imposto de renda. Agravo regimental improvido. (AGRESP 201000172325, HUMBERTO MARTINS, STJ - SEGUNDA TURMA, DJE DATA:23/04/2010)

EMENTA TRIBUTÁRIO. IMPOSTO DE RENDA DA PESSOA FÍSICA - IRPF. “AUXÍLIO-ALMOÇO”. VERBA DESTINADA A RESSARCIR PARCIALMENTE AS DESPESAS DO EMPREGADO COM A SUA PRÓPRIA ALIMENTAÇÃO. NATUREZA INDENIZATÓRIA. NÃO INCIDÊNCIA DO IRPF. “AUXÍLIO-ENSINO”. VERBA DESTINADA A RESSARCIR PARCIALMENTE AS DESPESAS DO EMPREGADO COM A EDUCAÇÃO DE SEUS FILHOS. NATUREZA REMUNERATÓRIA INCIDÊNCIA DE IMPOSTO DE RENDA. ART. 43 DO CTN. PEDIDO DE UNIFORMIZAÇÃO CONHECIDO E PARCIALMENTE PROVIDO. (Recursos 05032827820154058312, MARCOS ANTONIO GARAPA DE CARVALHO - Turma Regional de Uniformização de Jurisprudência da 5a. Região, Creta - Data::06/04/2017 - Página N/I.)

Recurso inominado da Fazenda Nacional improvido.

Condenação da União em honorários advocatícios, arbitrados em dez por cento sobre o valor condenação.

 

ACÓRDÃO

Decide a 3ª Turma Recursal dos Juizados Especiais Federais de Pernambuco, por maioria, vencido o Juiz Federal Isaac Batista de Carvalho Neto, NEGAR PROVIMENTO AO RECURSO INOMINADO, nos termos da ementa supra.

Recife, data do julgamento.

Joaquim Lustosa Filho

Juiz Federal Relator"""

In [172]:
pipeline_ner(example)

[{'entity_group': 'ORGANIZACAO',
  'score': 0.49633655,
  'word': 'TRU',
  'start': 131,
  'end': 134},
 {'entity_group': 'ORGANIZACAO',
  'score': 0.5309282,
  'word': '5a',
  'start': 137,
  'end': 139},
 {'entity_group': 'ORGANIZACAO',
  'score': 0.7600396,
  'word': 'REGIÃO',
  'start': 141,
  'end': 147},
 {'entity_group': 'ORGANIZACAO',
  'score': 0.97347915,
  'word': 'União',
  'start': 505,
  'end': 510},
 {'entity_group': 'LEGISLACAO',
  'score': 0.99607575,
  'word': 'art. 43 do Código Tributário Nacional',
  'start': 795,
  'end': 832},
 {'entity_group': 'ORGANIZACAO',
  'score': 0.9583481,
  'word': 'União',
  'start': 884,
  'end': 889},
 {'entity_group': 'LEGISLACAO',
  'score': 0.99337995,
  'word': '§ 1º. do art. 22 da Lei nº. 8. 460 / 92',
  'start': 1436,
  'end': 1471},
 {'entity_group': 'LEGISLACAO',
  'score': 0.99689144,
  'word': 'Lei nº 9. 257 / 97',
  'start': 1487,
  'end': 1502}]

In [173]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=512, chunk_overlap=0, separators=["\t", "\n", "\n\n", ". "])
chunks = text_splitter.split_text(example)

In [174]:
token_classes = model.config.id2label
colors = get_random_colors(len(token_classes))
colors = {token_classes[i]: colors[i] for i in range(len(token_classes))}

In [175]:
colors

{'O': '#6495ED',
 'B-ORGANIZACAO': '#008B8B',
 'I-ORGANIZACAO': '#00FFFF',
 'B-PESSOA': '#FFFF00',
 'I-PESSOA': '#008080',
 'B-TEMPO': '#00FF00',
 'I-TEMPO': '#A9A9A9',
 'B-LOCAL': '#B8860B',
 'I-LOCAL': '#808080',
 'B-LEGISLACAO': '#FF0000',
 'I-LEGISLACAO': '#FF00FF',
 'B-JURISPRUDENCIA': '#7FFF00',
 'I-JURISPRUDENCIA': '#C0C0C0'}

In [176]:
for idx, chunk in enumerate(chunks):
    print(f'Chunk {idx}:')
    perform_inference_and_render_entities(pipeline_ner, chunk)
    print('\n')

Chunk 0:




Chunk 1:




Chunk 2:




Chunk 3:




Chunk 4:






# Questions

1. What is Named Entity Recognition (NER) and why is it important in NLP?

2. What are some common challenges faced in NER?

3. What is the IOB2 tagging scheme used for in NER?

4. How does transfer learning benefit NER tasks?

5. What is the LeNER-BR dataset, and what types of entities does it include?

6. How does spaCy's approach to NER differ from HuggingFace Transformers?

7. What is the purpose of domain adaptation in NLP?

8. Describe the process of fine-tuning a pre-trained language model for NER.

9. What are some strategies for handling large texts in Transformer models?

10. Compare the performance metrics of spaCy and HuggingFace models on the LeNER-BR dataset.


`Answers are commented inside this cell.`
<!-- 
1. Named Entity Recognition (NER) is a task in Information Extraction that focuses on identifying and classifying specific entities within text, such as names, dates, and organizations. It transforms unstructured text into structured information, facilitating tasks like information retrieval, question answering, and content classification.

2. Common challenges in NER include ambiguity, handling new or rare entities, nested entities, and domain specificity. These challenges make it difficult for models to accurately identify and classify entities across different contexts and domains.

3. The IOB2 tagging scheme is used in NER for labeling tokens as part of entities. It includes tags like B (Beginning), I (Inside), and O (Outside) to indicate whether a token is at the beginning of an entity, inside an entity, or outside any entity.

4. Transfer learning benefits NER tasks by allowing models to leverage knowledge from pre-trained language models, which have already learned linguistic patterns from large datasets. This reduces the amount of data and time needed to train models for specific NER tasks.

5. The LeNER-BR dataset is a legal NER dataset comprising 70 annotated legal documents in Brazilian Portuguese. It includes entities such as ORGANIZACAO (organizations), PESSOA (persons), TEMPO (time expressions), LOCAL (locations), LEGISLACAO (legislation), and JURISPRUDENCIA (jurisprudence).

6. SpaCy's approach to NER uses a combination of Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory networks (BiLSTMs), and transition-based models. HuggingFace Transformers, on the other hand, utilize transformer architectures like BERT and GPT-2, which rely on self-attention mechanisms and pre-trained embeddings.

7. Domain adaptation in NLP involves fine-tuning a pre-trained language model on domain-specific data to improve its performance on tasks within that domain. This is particularly useful for specialized fields like legal or medical texts, where language usage may differ from general corpora.

8. Fine-tuning a pre-trained language model for NER involves three steps: loading a pre-trained language model, fine-tuning it with domain-specific text, and training the NER model using the fine-tuned language model. This process allows the model to adapt to specific linguistic patterns and entity types in the target domain.

9. Strategies for handling large texts in Transformer models include text pruning, chunking, the sliding window approach, and using transformers with larger input sizes like Longformer or BigBird. These strategies help manage the token limit constraint while preserving context and meaning.

10. The performance metrics of spaCy and HuggingFace models on the LeNER-BR dataset can be compared by evaluating precision, recall, F1 score, and accuracy. Both models have their strengths and weaknesses, and their performance may vary depending on the specific task and dataset. -->

