# NLP Datasets
## IMD1107 - Processamento de Linguagem Natural
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Keypoints

- A **corpus** is a fundamental collection of text data used to train and evaluate Natural Language Processing (NLP) models.

- Corpora can be **annotated** (labeled, for supervised learning like text classification or NER) or **unannotated** (raw text, for unsupervised learning like language modeling).

- Corpora possess a **hierarchical structure**: documents contain paragraphs, paragraphs contain sentences, sentences contain words, and words contain subword units (morphemes, characters).

- **Domain-specific corpora** (e.g., legal, medical) contain specialized language, improving model performance in targeted areas but requiring careful construction and often expert annotation.

- Key NLP tasks reliant on corpora include **Language Modeling**, **Text Classification**, **Named Entity Recognition (NER)**, Information Extraction, Summarization, and Machine Translation.

- Building a corpus involves defining scope, **data collection** (including web scraping or PDF extraction), cleaning, preprocessing (tokenization, lowercasing, stemming/lemmatization), and validation.

- Effective **data annotation** is crucial for supervised NLP; tools like Label Studio, Doccano, and Argilla facilitate this process, often requiring data in specific formats (e.g., JSON).

- Publicly available corpora, especially for Portuguese, can be found on platforms like **Hugging Face Datasets**, Papers with Code, and Kaggle. Examples include BrWaC, OSCAR, Portuguese Wikipedia, and LeNER-Br.

- NER often uses **BIO notation** (Beginning, Inside, Outside) to tag entities within text.


## Learning goals

By the end of this class, you will be able to:

1. Explain the concept, structure, and purpose of a textual corpus in the context of Natural Language Processing (NLP).

2. Distinguish between annotated and unannotated corpora, including their use cases and roles in supervised, semi-supervised, and unsupervised learning tasks.

3. Identify and describe the primary hierarchical components of a corpus, from documents down to subword units, such as morphemes and phonemes.

4. Explore and evaluate various publicly available Portuguese NLP datasets, including BrWaC, OSCAR, Carolina, FineWeb-2, and the Portuguese Wikipedia corpus, identifying their suitability for different NLP tasks.

5. Apply practical methodologies for building your own textual corpus through data collection, cleaning, preprocessing, categorization, and validation procedures.

6. Implement techniques to extract textual data from diverse formats such as websites (web scraping) and PDF documents using Python libraries like BeautifulSoup, requests, and PyPDF.

7. Create structured datasets compatible with annotation tools such as Label Studio, understanding and adhering to the required data formats (e.g., JSON).

8. Annotate textual data effectively using Label Studio for tasks including text classification (e.g., sentiment analysis) and Named Entity Recognition (NER).

9. Evaluate the advantages and limitations of using domain-specific corpora, along with the challenges in domain-specific corpus construction, specifically regarding data collection, annotation quality, and the need for domain expertise.

10. Formulate relevant NLP projects based on real-world datasets, like B2W e-commerce product reviews and Brazilian legal documents, leveraging annotation to aid model training and evaluation for applications such as sentiment analysis, NER, and recommendation systems.

## What can you do with NLP?

Natural Language Processing (NLP) projects often require the implementation of a range of fundamental tasks. These tasks, due to their repetitive and foundational role across various NLP applications, have been widely studied and expertise in them can significantly enhance your ability to build effective NLP solutions. Below, we will get into each task individually, offering a detailed overview of its function, importance, and application.

### 1. Language Modeling

Language modeling revolves around predicting the next word in a sentence based on preceding words. This task is foundational for many NLP applications, such as speech recognition and machine translation. By learning the probability of word sequences, language models can generate coherent and contextually relevant text.

### 2. Text Classification

Text classification involves categorizing text into predefined categories based on its content. This is one of the most common tasks in NLP and has numerous applications, such as spam detection, sentiment analysis, and topic classification. By training models to recognize patterns in text, we can automate the process of sorting and labeling large volumes of text data.

### 3. Information Extraction

Information extraction focuses on extracting relevant information from text. For example, it can identify dates and events from emails or extract names and locations from news articles. This task is essential for transforming unstructured text into structured data that can be easily analyzed and queried.

### 4. Information Retrieval

Information retrieval is about finding relevant documents based on a user's query. This task is the backbone of search engines like Google, where the goal is to deliver the most relevant results from a vast collection of documents. Effective information retrieval requires understanding both the user's query and the content of the documents to provide accurate matches.

### 5. Conversational Agents

Conversational agents, or chatbots, are designed to interact with users in natural language. These systems can range from simple rule-based bots to advanced AI-driven assistants like ChatGPT and Siri. The goal is to create a smooth and user-friendly user experience where the system can understand and respond to user queries effectively.

### 6. Text Summarization

Text summarization aims to create concise summaries of longer documents while preserving the central meaning. This is particularly useful for quickly understanding the gist of lengthy reports, articles, or research papers without reading the entire text. Summarization can be either extractive (selecting key sentences) or abstractive (generating new sentences).

### 7. Question Answering

Question answering systems are designed to automatically answer questions posed in natural language. These systems can range from simple FAQs to complex AI-driven models that understand context and provide detailed answers. Effective question answering requires a deep understanding of language and the ability to retrieve and integrate information from various sources.

### 8. Machine Translation

Machine translation involves translating text from one language to another. This task is essential for breaking down language barriers and enabling global communication. Advanced models like Google Translate use sophisticated algorithms to provide accurate and contextually appropriate translations.

### 9. Topic Modeling

Topic modeling is about uncovering the hidden thematic structure within a large set of documents. This technique is widely used in text mining to discover the main topics discussed in a corpus. Applications range from analyzing literature and academic papers to understanding customer feedback and social media trends.

## Corpus in Natural Language Processing (NLP)

In NLP, a **corpus** (plural: *corpora*) is a carefully compiled collection of textual data written in human language. These datasets support the training and evaluation of models, enabling them to interpret and work with natural language. The following sections provide a detailed overview suitable for graduate-level study, including structure, types, and considerations for building corpora.

### Size and Scope of Corpora

Corpora vary widely in size:

- **Small and Targeted:** Collections designed for specific tasks, such as sentiment analysis within a niche domain.
- **Large and Diverse:** Datasets used to train language models for applications like machine translation or summarization. These corpora may contain billions of words.

For instance, consider these examples:
- The [HPLT (High Performance Language Technologies v2.0)](https://hplt-project.org/datasets/v2.0) dataset includes data in several languages, with the Portuguese subset containing 237.81 million documents and 146.2 billion words.
- The [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) dataset contains 189.85 million documents and 105.27 billion words in Portuguese, among other languages.


### Types of Corpora: Annotated vs. Unannotated

Corpora can be classified based on whether the data includes additional labels or tags:

1. **Annotated Corpora:**
   - **Description:** These datasets include labels or tags that provide information about linguistic or semantic properties of the text.
   - **Usage Example:** In sentiment analysis, each sentence may be marked as positive, negative, or neutral.
   - **Relevance:** Essential for **supervised learning**, where models learn to predict outcomes from clearly defined examples.

2. **Unannotated Corpora:**
   - **Description:** These consist solely of raw textual data without labels.
   - **Usage Example:** A large set of news articles without any topic tags.
   - **Relevance:** Crucial for **unsupervised learning** methods, where models identify patterns and structures in the absence of explicit guidance.

> **Note:** Annotating text is a resource-intensive process. Many modern NLP methods start with large unannotated datasets and then refine the model using smaller annotated collections. This approach is central to concepts in transfer learning.

### Structure of a Corpus: From Documents to Components of Words

A corpus is often organized into hierarchical units, which aids in efficient processing and analysis. The typical organization includes:

1. **Documents:**  
   - Self-contained units such as news articles, research papers, or social media posts.
2. **Paragraphs:**  
   - Divisions within documents that group related sentences.
3. **Sentences:**  
   - Complete thoughts or statements that can be analyzed independently.
4. **Words and Punctuation:**  
   - Fundamental elements that carry meaning and provide grammatical structure.
5. **Subword Units:**  
   - **Syllables:** Units of pronunciation, e.g., "nat-u-ral."
   - **Phonemes:** Individual sounds making up words.
   - **Morphemes:** Smallest units with semantic meaning, e.g., prefixes like "un-" indicating negation.
   - **Characters:** Basic symbols such as letters.

This detailed breakdown allows for analysis at different levels of granularity, from overall document themes down to individual linguistic units.


### Domain-Specific Corpora

Certain tasks require texts that represent specialized language, making a general-purpose corpus insufficient. **Domain-specific corpora** focus on the language, terminology, and conventions of a specific field, such as:

- **Medical Corpus:**  
  Includes clinical notes, research articles, and patient records. The vocabulary is specific to healthcare and medical research.
- **Legal Corpus:**  
  Contains legal documents, court transcripts, contracts, and legislation, exhibiting the precise language of law.
- **Financial Corpus:**  
  Encompasses financial news, company reports, and market analyses that rely on domain-specific terminology.

#### Advantages

- **Increased Accuracy:**  
  NLP models trained on data that closely matches the target domain often make more accurate predictions.
- **Focused Insights:**  
  Analysis of such corpora can reveal trends and patterns unique to the field.

#### Challenges

- **Data Collection:**  
  Assembling high-quality data from reliable sources demands significant effort and time.
- **Annotation:**  
  The process typically requires domain experts to assign correct labels and categories to the text.


### Constructing a Domain-Specific Corpus

When building a valuable domain-specific corpus, consider the following steps:

1. **Define the Domain:**  
   Clearly specify the area or industry of focus. This definition guides all subsequent stages.
2. **Data Collection:**  
   Gather a variety of documents from reputable and relevant sources. This may involve accessing public databases, web scraping (with permission), or collaborating with organizations.
3. **Data Cleaning:**  
   Remove non-relevant content (e.g., advertisements, duplicate records, boilerplate text) to ensure data quality.
4. **Preprocessing:**  
   Prepare the text through techniques such as:
 
   - **Lowercasing:** Ensures uniformity.  
   - **Stemming/Lemmatization:** Reduces words to a standard form so that variations (e.g., "pulará", "pulou", "pulei") become "pular".
   - **Removing Stop Words:** Excludes common words like "o", "os", "as", "a", "que" that contribute less semantic content.  
   - **Tokenization:** Splits text into manageable units such as words or meaningful phrases.

5. **Categorization/Classification:**  
   Organize documents into relevant categories aligned with the domain's needs.
6. **Validation:**  
   Assess whether the corpus reflects the intended domain accurately and evaluate the effectiveness of cleaning and categorization.

## Locating High-Quality Portuguese Corpora

Searching for appropriate corpora in Portuguese? The same strategies used to find English corpora can assist you with this. This section lists some tried-and-tested resources where you'll find a vast collection of Portuguese corpora:

### 1. [Huggingface Datasets](https://huggingface.co/datasets?language=language:pt&sort=trending)

Huggingface, creators of the most popular NLP tools and numerous deep learning applications, offers an extensive compilation of datasets, including several in Portuguese. They present the datasets in a user-friendly format, downloadable directly from their site or via their Python API.

### 2. [Papers with Code](https://paperswithcode.com/datasets?lang=portuguese&page=1)

A valuable resource for finding papers and corresponding code for specific tasks. Search for papers utilizing Portuguese datasets, and you'll also find the code that trained the models.

### 3. [Kaggle](https://www.kaggle.com/datasets?search=portuguese)

An established platform for data science competitions, Kaggle also provides a rich collection of datasets,  some of which are Portuguese.

### 4. [Brazilian Federal Government's Open Data Portal](https://dados.gov.br/)

The Brazilian government's open data portal hosts a variety of datasets. However, these datasets may require some data cleaning and preprocessing before use due to their non-standard format.

### 5. [GLUE - General Language Understanding Evaluation](https://gluebenchmark.com/)

GLUE is an assortment of resources designed for training, evaluating, and analyzing Natural Language Understanding (NLU) systems. It includes nine distinct tasks, each utilizing a unique dataset. Among these tasks you will find:

- **MNLI**: Sentence-pair classification task using 433k sentence pairs annotated for textual entailment.

- **QQP**: Binary classification task determining semantic equivalence of over 400,000 pairs of Quora questions.

- **QNLI**: Binary classification task converted from Stanford Question Answering Dataset (SQuAD).

- **SST-2**: Single-sentence binary classification task using sentences from movie reviews.

- **CoLA**: Single-sentence binary classification task labelling English sentences as grammatically correct or incorrect.

- **STS-B**: Sentence-pair regression task using news headlines and other sources with human-annotated similarity scores.

- **MRPC**: Sentence-pair binary classification task using automatically extracted sentence pairs from online news sources.

- **RTE**: Collection of several datasets used for sentence-pair binary classification.

- **WNLI**: Sentence-pair binary classification task requiring pronoun referent identification from a list of choices.

GLUE is a monolingual **English** dataset. For Portuguese, there is a machine-translated version of GLUE called [PLUE](https://github.com/ju-resplande/PLUE). While not perfect, it can be useful for certain tasks. You can access PLUE in the Huggingface datasets [here](https://huggingface.co/datasets/dlb/plue).

### Highlighted Datasets 

Let's discuss a few Portuguese datasets that could prove especially beneficial:
- [BrWaC](https://huggingface.co/datasets/brwac): Capturing common, everyday language, BrWaC is generated from web-extracted texts. Perfect for NLP tasks focusing on conversational language understanding.

- [OSCAR](https://huggingface.co/datasets/oscar): A huge multilingual corpus obtained by classifying and filtering the Common Crawl corpus contains 138GB text data in 67 languages, including Portuguese.

- [Portuguese Wikipedia](https://huggingface.co/datasets/olm/wikipedia): As the name suggests, this dataset contains the Portuguese Wikipedia content.

- [Carolina](https://huggingface.co/datasets/carolina-c4ai/corpus-carolina): An Open Corpus for Linguistics and Artificial Intelligence containing documents and texts in contemporary Brazilian Portuguese (1970-2021) extracted from the web, with associated metadata about their provenance and typology.

- [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2): A large-scale web corpus in several languages, including Portuguese, with 189.85 million documents and 105.27 billion words.

- [HPLT (High Performance Language Technologies v2.0)](https://hplt-project.org/datasets/v2.0) with the Portuguese subset containing 237.81 million documents and 146.2 billion words.




### Working with Real-World Datasets: The B2W Product Reviews Example

When working with Natural Language Processing (NLP) tasks, it's crucial to have access to substantial and well-structured datasets. These datasets serve as training material for our models and provide insights into real-world language use. 

Numerous resources offer pre-existing datasets.  One such resource is Huggingface Datasets, a hub containing a vast collection of datasets for various NLP tasks.  Let's examine a specific example using data from B2W product reviews, a Brazilian e-commerce platform.  You can find this dataset on Huggingface [here](https://huggingface.co/datasets/ruanchaves/b2w-reviews01).  

This dataset presents a valuable opportunity to probe into real-world customer feedback. Analyzing such data is essential in sentiment analysis, where we aim to understand customer opinions and attitudes toward products or services.  This dataset can also be utilized for other tasks like topic modeling or even building recommendation systems. 


In [1]:
from datasets import load_dataset

ds_b2w = load_dataset("ruanchaves/b2w-reviews01")

In [2]:
ds_b2w

DatasetDict({
    train: Dataset({
        features: ['submission_date', 'reviewer_id', 'product_id', 'product_name', 'product_brand', 'site_category_lv1', 'site_category_lv2', 'review_title', 'overall_rating', 'recommend_to_a_friend', 'review_text', 'reviewer_birth_year', 'reviewer_gender', 'reviewer_state'],
        num_rows: 132373
    })
})

In [3]:
ds_b2w["train"]

Dataset({
    features: ['submission_date', 'reviewer_id', 'product_id', 'product_name', 'product_brand', 'site_category_lv1', 'site_category_lv2', 'review_title', 'overall_rating', 'recommend_to_a_friend', 'review_text', 'reviewer_birth_year', 'reviewer_gender', 'reviewer_state'],
    num_rows: 132373
})

In [4]:
ds_b2w["train"][:5]

{'submission_date': ['2018-01-01 00:11:28',
  '2018-01-01 00:13:48',
  '2018-01-01 00:26:02',
  '2018-01-01 00:35:54',
  '2018-01-01 01:00:28'],
 'reviewer_id': ['d0fb1ca69422530334178f5c8624aa7a99da47907c44de0243719b15d50623ce',
  '014d6dc5a10aed1ff1e6f349fb2b059a2d3de511c7538a9008da562ead5f5ecd',
  '44f2c8edd93471926fff601274b8b2b5c4824e386ae4f210329b9b71890277fd',
  'ce741665c1764ab2d77539e18d0e4f66dde6213c9f0863f165ffedb1e8147984',
  '7d7b6b18dda804a897359276cef0ca252f9932bf4b5c8e72bce7e88850efa0fc'],
 'product_id': ['132532965',
  '22562178',
  '113022329',
  '113851581',
  '131788803'],
 'product_name': ['Notebook Asus Vivobook Max X541NA-GO472T Intel Celeron Quad Core 4GB 500GB Tela LED 15,6" Windows - 10 Branco',
  'Copo Acrílico Com Canudo 500ml Rocie',
  'Panela de Pressão Elétrica Philips Walita Daily 5L com Timer',
  'Betoneira Columbus - Roma Brinquedos',
  'Smart TV LED 43" LG 43UJ6525 Ultra HD 4K com Conversor Digital 4 HDMI 2 USB WebOS 3.5 Painel Ips HDR e Magic Mobile 

In [5]:
# Import the pandas library, which is essential for data manipulation and analysis
import pandas as pd

# Select the first 100 rows from the 'train' column of the 'ds_b2w' dataset
# and convert it into a pandas DataFrame
df = pd.DataFrame(ds_b2w["train"][:100])

# Display the DataFrame to see the selected rows
df

Unnamed: 0,submission_date,reviewer_id,product_id,product_name,product_brand,site_category_lv1,site_category_lv2,review_title,overall_rating,recommend_to_a_friend,review_text,reviewer_birth_year,reviewer_gender,reviewer_state
0,2018-01-01 00:11:28,d0fb1ca69422530334178f5c8624aa7a99da47907c44de...,132532965,Notebook Asus Vivobook Max X541NA-GO472T Intel...,,Informática,Notebook,Bom,4,Yes,Estou contente com a compra entrega rápida o ú...,1958.0,F,RJ
1,2018-01-01 00:13:48,014d6dc5a10aed1ff1e6f349fb2b059a2d3de511c7538a...,22562178,Copo Acrílico Com Canudo 500ml Rocie,,Utilidades Domésticas,"Copos, Taças e Canecas","Preço imbatível, ótima qualidade",4,Yes,"Por apenas R$1994.20,eu consegui comprar esse ...",1996.0,M,SC
2,2018-01-01 00:26:02,44f2c8edd93471926fff601274b8b2b5c4824e386ae4f2...,113022329,Panela de Pressão Elétrica Philips Walita Dail...,philips walita,Eletroportáteis,Panela Elétrica,ATENDE TODAS AS EXPECTATIVA.,4,Yes,SUPERA EM AGILIDADE E PRATICIDADE OUTRAS PANEL...,1984.0,M,SP
3,2018-01-01 00:35:54,ce741665c1764ab2d77539e18d0e4f66dde6213c9f0863...,113851581,Betoneira Columbus - Roma Brinquedos,roma jensen,Brinquedos,Veículos de Brinquedo,presente mais que desejado,4,Yes,MEU FILHO AMOU! PARECE DE VERDADE COM TANTOS D...,1985.0,F,SP
4,2018-01-01 01:00:28,7d7b6b18dda804a897359276cef0ca252f9932bf4b5c8e...,131788803,"Smart TV LED 43"" LG 43UJ6525 Ultra HD 4K com C...",lg,TV e Home Theater,TV,"Sem duvidas, excelente",5,Yes,"A entrega foi no prazo, as americanas estão de...",1994.0,M,MG
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2018-01-01 06:46:15,a2dc589ed43957ef1d4986474b3d221eb4c5e263368d37...,16302400,Cafeteira Italiana Em Inox 9 Xícaras Kehome,,Eletroportáteis,Cafeteira,terrível,1,No,"Além de vazar po no café a jarra escurece , pé...",1961.0,M,SP
96,2018-01-01 06:46:52,fb5a2615c60b3abab4d5859e68320d7a800e772aab67c1...,25196807,Boneco Falcon Turbocóptero - Estrela,,Brinquedos,Bonecos,Recomendo,5,Yes,Produto muito bom. Atende ao que se propõe. Xx...,1976.0,M,RJ
97,2018-01-01 06:47:42,a2dc589ed43957ef1d4986474b3d221eb4c5e263368d37...,14284347,Fone Ouvido Sem Fio Favix B01 Bluetooth Fm Sd ...,,Áudio,Fones de Ouvido,Péssima qualidade,1,No,A primeira vez de uso quando fui guarda o fone...,1961.0,M,SP
98,2018-01-01 06:48:31,5060e6fb478a7b1f01c8a48d739f9dc84eb111142612e3...,131657539,Smartphone Samsung Galaxy J2 Prime TV Dual Chi...,samsung,Celulares e Smartphones,Smartphone,Muito satisfeito,4,Yes,"Recomendo tanto o produto quanto a loja, produ...",1977.0,M,RS


`Stop for a moment and think about some ideas for projects using the data above`

### Annotated Corpus Example - Named Entity Recognition
Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. 

<br><br>
<p align="center">
  <img src="images/NER.png"  alt="" style="width: 80%; height: 80%"/>
</p>
<br><br>

Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. O is used for non-entity tokens. Let us look at an example with [LeNER-Br](https://huggingface.co/datasets/lener_br), which is a dataset for NER in Brazilian legal documents.

In [6]:
from datasets import load_dataset

ds_lener = load_dataset("lener_br", trust_remote_code=True)

In [7]:
ds_lener

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 7828
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1177
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1390
    })
})

In [8]:
ds_lener["train"][0]

{'id': '0',
 'tokens': ['EMENTA',
  ':',
  'APELAÇÃO',
  'CÍVEL',
  '-',
  'AÇÃO',
  'DE',
  'INDENIZAÇÃO',
  'POR',
  'DANOS',
  'MORAIS',
  '-',
  'PRELIMINAR',
  '-',
  'ARGUIDA',
  'PELO',
  'MINISTÉRIO',
  'PÚBLICO',
  'EM',
  'GRAU',
  'RECURSAL',
  '-',
  'NULIDADE',
  '-',
  'AUSÊNCIA',
  'DE',
  'INTERVENÇÃO',
  'DO',
  'PARQUET',
  'NA',
  'INSTÂNCIA',
  'A',
  'QUO',
  '-',
  'PRESENÇA',
  'DE',
  'INCAPAZ',
  '-',
  'PREJUÍZO',
  'EXISTENTE',
  '-',
  'PRELIMINAR',
  'ACOLHIDA',
  '-',
  'NULIDADE',
  'RECONHECIDA',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [9]:
# Create a dictionary to map integer NER tag values to their corresponding string names
# This is useful for interpreting the NER tags in a human-readable format

# 'ds_lener' is assumed to be a dataset with a 'train' split containing 'features' which include 'ner_tags'
# 'str2int' converts string NER tags to their integer representations
# 'names' provides the list of string NER tag names

mapping_key_lener = {
    ds_lener["train"].features["ner_tags"].feature.str2int(k): k
    for k in ds_lener["train"].features["ner_tags"].feature.names
}

# Display the mapping dictionary to verify its contents
mapping_key_lener

{0: 'O',
 1: 'B-ORGANIZACAO',
 2: 'I-ORGANIZACAO',
 3: 'B-PESSOA',
 4: 'I-PESSOA',
 5: 'B-TEMPO',
 6: 'I-TEMPO',
 7: 'B-LOCAL',
 8: 'I-LOCAL',
 9: 'B-LEGISLACAO',
 10: 'I-LEGISLACAO',
 11: 'B-JURISPRUDENCIA',
 12: 'I-JURISPRUDENCIA'}

In [10]:
# Define a function to display tokens and their corresponding NER tags for a given sentence index
# Parameters:
# - idx: Index of the sentence to display
# - ds: The dataset containing the sentences and their NER tags
# - mapping_key: Dictionary mapping integer NER tags to their string names
def show_tagged_sentence(idx, ds, mapping_key):
    # Iterate over the tokens and their corresponding NER tags in the selected sentence
    for token, tag in zip(ds["train"][idx]["tokens"], ds["train"][idx]["ner_tags"]):
        # Only print the token and its tag if the tag is not 0 (assuming 0 means 'no tag')
        if tag != 0:
            print(f"{token}\t\t{mapping_key[tag]}")


# Call the function to display the tokens and NER tags for the first sentence in the dataset
show_tagged_sentence(0, ds_lener, mapping_key_lener)

MINISTÉRIO		B-ORGANIZACAO
PÚBLICO		I-ORGANIZACAO


In [11]:
show_tagged_sentence(1, ds_lener, mapping_key_lener)

art		B-LEGISLACAO
.		I-LEGISLACAO
178		I-LEGISLACAO
,		I-LEGISLACAO
II		I-LEGISLACAO
,		I-LEGISLACAO
do		I-LEGISLACAO
CPC		I-LEGISLACAO
Ministério		B-ORGANIZACAO
Público		I-ORGANIZACAO
Ministério		B-ORGANIZACAO
Público		I-ORGANIZACAO


`Stop for a moment and think about some ideas for projects using these datasets, but now think about NER in general`

### Unannotated corpus

In [None]:
# Import the load_dataset function from the datasets library
# This function is used to load various datasets, including large text corpora

from datasets import load_dataset

# Load the Portuguese Wikipedia dataset using the 'olm/wikipedia' dataset identifier
# Specify the language as Portuguese ('pt') and the date of the Wikipedia dump as '20250220'
# The date format is YYYYMMDD, and you can check the available dump dates at: https://dumps.wikimedia.org/backup-index.html
# Note: This dataset is very large and may take a significant amount of time to download and process
# It is recommended to have a powerful machine with ample RAM and a fast internet connection to handle this dataset

ds_wiki = load_dataset(
    "olm/wikipedia", language="pt", date="20250220", trust_remote_code=True
)


# FYI, it took 42 minutes to download, parse and process the dataset on a 48-core machine with 256GB of RAM and a 1Gbps internet connection.

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
ds_wiki

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 1144009
    })
})

In [14]:
ds_wiki["train"][10]

{'id': '5024936',
 'url': 'https://pt.wikipedia.org/wiki/The%20Four%20Feathers%20%281939%29',
 'title': 'The Four Feathers (1939)',
 'text': '{{Info/Filme\n|título=The Four Feathers\n|título-pt= \n|título-br=As Quatro Plumas, ouAs Quatro Penas Brancas<ref>{{citar web|URL=http://memoria.bn.br/DocReader/154083_01/11704|título=As Quatro Penas Brancas|autor=Azeredo, Ely|data=9/3/1953|publicado=Tribuna da Imprensa (Rio de Janeiro), p. 2|acessodata=22/1/2018}}</ref>\n|imagem=Four-Feathers-1939.jpg|thumb|250px\n|legenda = John Clements e Ralph Richardson em cena do filme\n|ano=1939\n|duração=130\n|idade=\n|idioma=inglês  Árabe\n|gênero=Épico Guerra\n|produtor = Alexander Korda\n|música = Miklós Rózsa\n|direção=Zoltan Korda\n|roteiro=R. C. Sherriff(roteiro)Lajos Bíró(roteiro)Arthur Wimperis(roteiro)A.E.W. Mason (livro de 1902)\n|elenco=John ClementsJune DuprezRalph Richardson\n|código-IMDB=0031334\n|tipo=LF\n|país=\n|cor-pb=Technicolor\n}}The Four Feathers é um filme épico britânico de 1939, d

Along with [BrWaC](https://huggingface.co/datasets/brwac), the Wikipedia corpus is commonly used for unsupervised tasks, especially _language modeling_. Later in the course, we will learn more about what language modeling entails and how to use these data sets effectively in this context. But to give a hint, language models are trained to predict the probability of a sequence of words appearing in a sentence. They can capture the syntactic, semantic and contextual relations between words.

## Building Your Own Corpora

A corpus is a structured collection of texts used for research, natural language processing, and training machine learning models. The process of creating a corpus can range from setting up a simple collection of plain text files to assembling a diverse database that includes multiple file formats and data sources.

### Step 1: Setting Up a Basic Corpus

- **Definition and Storage**:  
  At its simplest, a corpus might reside as a directory on your computer containing plain text files. Each file becomes a document that the model can learn from.
  
- **Plain Text Files**:  
  A *plain text file* is one with a `.txt` extension that stores unformatted text. These files are easy to process and are often the starting point for corpus construction.

### Step 2: Expanding the Scope of the Corpus

To achieve a more representative set of data, consider including a variety of sources:

- **Multiple Formats**:  
  While plain text files form the basic structure, you can also include:
  - HTML files
  - Microsoft Word documents
  - PDF files

- **Diverse Data Sources**:  
  The richness of your corpus increases when you add content from:
  - **Databases**: Both SQL and NoSQL databases can serve as storage for textual data.
  - **APIs**: Data collected from open or proprietary APIs.
  - **Websites**: Information gathered through web scraping.

*Example Analogy*: Imagine your corpus as a large library. A basic library may have a single genre, but expanding the collection to include various genres and materials (books, magazines, newspapers) will provide a more extensive resource for research.

### Step 3: Case Study – Extracting Data from a PDF

A practical case study involves creating a corpus from a PDF document, such as a drug leaflet for Metamizole (Dipyrone). Here is the overall approach:

- **Data Extraction Process**:  
  The process begins by downloading the PDF, followed by extracting text. This extraction allows the content of the PDF to be transformed into a format suitable for analysis.

- **Tools and Techniques**:  
  In practice, Python libraries (e.g., one for handling HTTP requests and another for processing PDFs like `pypdf`) are used to automate the downloading and text extraction processes. The goal is to convert the information contained in the PDF into a text file or another format that can easily be integrated into your corpus.

### Key Considerations

- **Diversity and Quality**:  
  The performance of a machine learning model is influenced by the breadth and quality of the corpus. A diversified corpus that represents various sources and file formats can capture a wider range of language patterns and contexts.

- **Resource Management**:  
  Be mindful of the storage, processing power, and time required to build and maintain a large corpus. Proper data management ensures that the corpus remains useful and up-to-date.


In [15]:
import requests
import pypdf
from pathlib import Path

# Download the PDF file from the URL
url = "https://www.ache.com.br/wp-content/uploads/application/pdf/bula-paciente-dipirona-monoidratada-gotas.pdf"
response = requests.get(
    url, verify=False
)  # we set verify to False to avoid SSL certificate errors

# Create the tmp path if it doesn't exist
Path("outputs/tmp").mkdir(parents=True, exist_ok=True)

# Save the file locally
with open("outputs/tmp/bula.pdf", "wb") as f:
    f.write(response.content)

# Open the PDF file
pdf = pypdf.PdfReader("outputs/tmp/bula.pdf")

# Get the number of pages
num_pages = len(pdf.pages)

# Print the number of pages
print(f"The number of pages is: {num_pages}")

# Loop through each page and extract the text
for page_num in range(num_pages):
    # Get the page object
    page = pdf.pages[page_num]

    # Extract the text from the page
    text = page.extract_text()

    # Print the text
    print(text)



The number of pages is: 2
dipirona monoidratada
Medicamento Genérico Lei nº 9.787, de 1999
APRESENTAÇÕES
Solução oral (gotas) de 500 mg/ml: frascos com 10 ml 
e 20 ml.
USO ORAL 
USO ADULTO E PEDIÁTRICO ACIMA DE 3 
MESES
COMPOSIÇÃO
Cada ml (= 20 gotas) de solução oral de dipirona 
monoidratada contém:
dipirona monoidratada .....................................500 mg
Excipientes: sacarina sódica di-hidratada, metil pa ra-
beno, glicerol, edetato de cálcio dissódico hidratado, 
metabissulfito de sódio, sorbitol, amarelo de tartrazina 
e água purificada.
1 gota equivale a 25 mg de dipirona monoidratada. 
INFORMAÇÕES AO PACIENTE
1. PARA QUE ESTE MEDICAMENTO É IN DI­
CA DO?
Este medicamento é indicado como analgésico (para 
dor) e antitérmico (para febre).
2. COMO ESTE MEDICAMENTO FUNCIONA?
Dipirona é um medicamento utilizado no tratamento 
de febre e dor. Tempo médio de início de ação: de 30 
a 60 minutos após a administração e geralmente dura 
aproximadamente 4 horas.
3. QUANDO NÃO DEVO US

You can also scrape the data from web and parse with BeautifulSoup. Here is an example of scraping the data from the web and parsing it with BeautifulSoup.

In [16]:
import requests
from bs4 import BeautifulSoup

# Define the URL of the Wikipedia article
url = "https://en.wikipedia.org/wiki/Metamizole"

# Send a GET request to the URL and get the HTML content
response = requests.get(url)
html_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Find the main text content of the article
article_content = soup.find("div", {"class": "mw-parser-output"})

# Remove the table of contents and other non-text elements
for element in article_content(["table", "sup", "span", "style"]):
    element.decompose()

# Extract the text from the article content
text = article_content.get_text()

# Print the text
print(text.strip())

Medication


Not to be confused with methimazole or methazolamide.
Pharmaceutical compound


Oral solution bottle of dypirone in its packaging (Portuguese lettering)
Metamizole or dipyrone is a painkiller, spasm reliever, and fever reliever drug. It is most commonly given by mouth or by intravenous infusion. It belongs to the ampyrone sulfonate family of medicines and was patented in 1922. Metamizole is marketed under various trade names. It was first used medically in Germany under the brand name "Novalgin",  later becoming widely known in Slavic nations and India under the name "Analgin".
Sale of Metamizole is restricted in some jurisdictions following studies in the 1970s which correlated it to severe adverse effects, including  agranulocytosis. Other studies have disputed this judgement, instead claiming that it is a safer drug than other  painkillers. Metamizole is popular in many countries, where it is typically available as an  over-the-counter medication. 


Medical uses
Metami



## Annotating Your Own Data

Data annotation is a crucial step in training machine learning models. This guide will walk you through the process of labeling your data using web-based graphical user interfaces (GUIs).

### Step 1: Choosing the Right Tool for Annotation

Selecting the appropriate tool for data labeling is essential for efficient and effective annotation. Here are some recommended tools, each with unique strengths:

- **[Label Studio](https://labelstud.io/):** A versatile tool with extensive features suitable for various annotation tasks.
- **[Doccano](https://github.com/doccano/):** An open-source tool ideal for text annotation tasks with a straightforward interface.
- **[Argilla](https://docs.argilla.io/en/latest/index.html):** An advanced option with features like weak supervision and active learning, making it suitable for complex projects.

> **Reading Suggestion** For those interested in diving deeper into Human-in-the-Loop Machine Learning, consider reading [this book](https://www.manning.com/books/human-in-the-loop-machine-learning). It provides valuable insights into more advanced annotation techniques and strategies.

In this exercise, we will primarily focus on **Label Studio** due to its complete feature set, but feel free to explore Doccano or other tools from [this amazing list](https://github.com/doccano/awesome-annotation-tools).

### Step 2: Setting Up Your Annotation Environment

For convenience, we have prepared a ready-to-use Label Studio account for you. Follow these steps to get started:

1. Click on [this link](https://labelclasse.jacob.al/user/signup/?token=36cf05fc40c431c9) to create your account.
2. Complete the registration process.
3. [Log in](https://labelclasse.jacob.al) to your new Label Studio account.

> **Note:** These platforms also offer native installation processes if you're interested in a local setup.

### Step 3: Preparing the Data for Annotation

Proper data preparation is crucial for a smooth annotation process. Label Studio requires data to be in a specific JSON format:

```json
[
    {
        "id": 1,
        "data": {
            "text": "Text content goes here",
            "meta_info": {
                "meta1": "value1",
                "meta2": "value2"
            }
        }
    }
]
```

Key components of this format include:

- **id:** A unique identifier for each data point.
- **data:** Contains the actual content to be annotated.
  - **text:** The main text or content for annotation.
  - **meta_info:** Additional metadata that may be relevant to the annotation task.

To start, we'll use the first 100 lines from our dataset `ds_b2w` and convert it into the required format. The metadata (`meta_info`) can include additional information that could be useful for the labeling task.

### Additional Considerations for Data Annotation

- **Consistency:** Ensure that all annotators follow the same guidelines to maintain consistency across the dataset.
- **Quality Control:** Apply quality control measures such as cross-checking annotations or using multiple annotators for the same data points.
- **Scalability:** Choose a tool that can handle the volume of data you plan to annotate and can scale as your dataset grows.


In [17]:
ds_b2w["train"]

Dataset({
    features: ['submission_date', 'reviewer_id', 'product_id', 'product_name', 'product_brand', 'site_category_lv1', 'site_category_lv2', 'review_title', 'overall_rating', 'recommend_to_a_friend', 'review_text', 'reviewer_birth_year', 'reviewer_gender', 'reviewer_state'],
    num_rows: 132373
})

In [18]:
df_ds_b2w = pd.DataFrame(ds_b2w["train"][:100])
df_ds_b2w.head()

Unnamed: 0,submission_date,reviewer_id,product_id,product_name,product_brand,site_category_lv1,site_category_lv2,review_title,overall_rating,recommend_to_a_friend,review_text,reviewer_birth_year,reviewer_gender,reviewer_state
0,2018-01-01 00:11:28,d0fb1ca69422530334178f5c8624aa7a99da47907c44de...,132532965,Notebook Asus Vivobook Max X541NA-GO472T Intel...,,Informática,Notebook,Bom,4,Yes,Estou contente com a compra entrega rápida o ú...,1958.0,F,RJ
1,2018-01-01 00:13:48,014d6dc5a10aed1ff1e6f349fb2b059a2d3de511c7538a...,22562178,Copo Acrílico Com Canudo 500ml Rocie,,Utilidades Domésticas,"Copos, Taças e Canecas","Preço imbatível, ótima qualidade",4,Yes,"Por apenas R$1994.20,eu consegui comprar esse ...",1996.0,M,SC
2,2018-01-01 00:26:02,44f2c8edd93471926fff601274b8b2b5c4824e386ae4f2...,113022329,Panela de Pressão Elétrica Philips Walita Dail...,philips walita,Eletroportáteis,Panela Elétrica,ATENDE TODAS AS EXPECTATIVA.,4,Yes,SUPERA EM AGILIDADE E PRATICIDADE OUTRAS PANEL...,1984.0,M,SP
3,2018-01-01 00:35:54,ce741665c1764ab2d77539e18d0e4f66dde6213c9f0863...,113851581,Betoneira Columbus - Roma Brinquedos,roma jensen,Brinquedos,Veículos de Brinquedo,presente mais que desejado,4,Yes,MEU FILHO AMOU! PARECE DE VERDADE COM TANTOS D...,1985.0,F,SP
4,2018-01-01 01:00:28,7d7b6b18dda804a897359276cef0ca252f9932bf4b5c8e...,131788803,"Smart TV LED 43"" LG 43UJ6525 Ultra HD 4K com C...",lg,TV e Home Theater,TV,"Sem duvidas, excelente",5,Yes,"A entrega foi no prazo, as americanas estão de...",1994.0,M,MG


In [19]:
# Define the name of the column that contains the review text
# This column will be used for text analysis or processing tasks
text_col = "review_text"

# Define a list of column names that contain metadata about the reviews
# These columns provide additional context and information about each review
meta_cols = [
    "submission_date",  # The date when the review was submitted
    "reviewer_id",  # Unique identifier for the reviewer
    "product_id",  # Unique identifier for the product being reviewed
    "site_category_lv1",  # Top-level category of the product where the review was posted
    "site_category_lv2",  # Sub-level category of the product where the review was posted
    "product_name",  # Name of the product being reviewed
]

In [20]:
# Initialize an empty list to store dictionaries
# Each dictionary will represent a row from the DataFrame with specific structure
list_of_dicts = []

# Iterate over each row in the DataFrame 'df_ds_b2w' using itertuples for efficient row-wise access
for row in df_ds_b2w.itertuples():
    list_of_dicts.append(
        {
            "id": row.Index,
            "data": {
                "text": getattr(row, text_col),
                "meta_info": {m: getattr(row, m) for m in meta_cols},
            },
        }
    )

# Output the length of the list to verify the number of dictionaries created
len(list_of_dicts)

100

In [21]:
list_of_dicts[0]

{'id': 0,
 'data': {'text': 'Estou contente com a compra entrega rápida o único problema com as Americanas é se houver troca ou devolução do produto o consumidor tem problemas com espera.',
  'meta_info': {'submission_date': '2018-01-01 00:11:28',
   'reviewer_id': 'd0fb1ca69422530334178f5c8624aa7a99da47907c44de0243719b15d50623ce',
   'product_id': '132532965',
   'site_category_lv1': 'Informática',
   'site_category_lv2': 'Notebook',
   'product_name': 'Notebook Asus Vivobook Max X541NA-GO472T Intel Celeron Quad Core 4GB 500GB Tela LED 15,6" Windows - 10 Branco'}}}

In [22]:
# Save that to a json file
import json

with open("outputs/tmp/b2w_first_100.json", "w") as f:
    json.dump(list_of_dicts, f)

#### Using LabelStudio for Text Classification and Named Entity Recognition

During class, we'll check the practical application of LabelStudio, a versatile data labeling tool. Specifically, we will focus on creating labeled data for two significant tasks in Natural Language Processing (NLP):

1. **Text Classification**
    This task requires us to categorize each piece of text into predefined classes or categories. In our example, the classes are "Positive" or "Negative". This can be particularly useful when processing sentiment analysis or distinguishing between different types of content.

2. **Named Entity Recognition**
    The objective of NER is to identify and classify named entities in the text, such as organizations, individuals, locations, etc. We could label "Bicicleta" as a product, "São Paulo" as a location, and "Americanas" as an organization.
    These labels allow us to extract significant information from the text, thereby enabling superior understanding and processing.



## Takeaways

- High-quality, relevant corpora are the bedrock of successful NLP model development; their characteristics directly influence model capabilities and performance.

- The choice between using existing general/domain-specific corpora or building a custom one depends heavily on the specific NLP task, available resources, and required performance level.

- Data annotation, while often resource-intensive, is indispensable for creating labeled datasets that enable supervised machine learning models to perform tasks like classification and entity recognition accurately.

- Proficiency in programmatic data acquisition (e.g., PDF extraction, web scraping), manipulation (e.g., using Pandas), and formatting (e.g., preparing JSON for annotation tools) are essential practical skills for NLP practitioners.

- Understanding the nuances of different NLP tasks is critical for selecting appropriate datasets and designing effective annotation schemes to meet project goals.

- Familiarity with annotation platforms like Label Studio and the principles of creating consistent, high-quality labels is vital for building robust supervised NLP systems.

# Questions

1. What is the fundamental concept in NLP that refers to a carefully compiled collection of text data?

2. What are the two main types of corpora based on the presence or absence of labels?

3. What is the primary task of language modeling in NLP?

4. Which NLP task involves categorizing text into predefined categories based on its content?

5. What is the goal of text summarization in NLP?

6. What is the role of annotated corpora in supervised learning tasks?

7. Name one example of a domain-specific corpus mentioned in the content.

8. What is the purpose of information retrieval in NLP?

9. What format does Label Studio require for data input?

10. What is the hierarchical structure of a corpus from the highest to the smallest units?


`Answers are commented inside this cell.`

<!-- 1. In NLP, a corpus (plural: corpora) is a fundamental concept that refers to a carefully compiled collection of text data.

2. The two main types of corpora are annotated corpora and unannotated corpora.

3. The primary task of language modeling in NLP is to predict the next word in a sentence based on preceding words.

4. Text classification involves categorizing text into predefined categories based on its content.

5. The goal of text summarization in NLP is to create concise summaries of longer documents while preserving the central 
meaning.

6. Annotated corpora are essential for supervised learning tasks, where the model learns to make predictions by generalizing from labeled examples.

7. One example of a domain-specific corpus mentioned is the medical corpus, which includes clinical notes, research articles, medical textbooks, and patient records.

8. The purpose of information retrieval in NLP is to find relevant documents based on a user's query, such as in search engines.

9. Label Studio requires data to be in a specific JSON format, with each entry containing an "id" and a "data" field that includes the text content and any relevant metadata.

10. The hierarchical structure of a corpus is: Documents, Paragraphs, Sentences, Words and Punctuation, Syllables, Phonemes, Morphemes, and Characters. -->
