<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Hugging_Face/2-HF_Base_Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Base Libraries (transforms and evaluate)

## Libraries

Each of the libraries selected below serves a specific purpose in the machine learning workflow, from model training and deployment (`transformers`, `accelerate`) to data handling (`datasets`) and performance evaluation (`evaluate`). These tools collectively provide a robust infrastructure for building, deploying, and managing NLP applications.

### Transformers


**Definition:** The `transformers` library, developed by Hugging Face, is a comprehensive library for Natural Language Processing (NLP) which provides a wide range of pre-trained models designed to perform various tasks such as text classification, information extraction, question answering, summarization, translation, and more.

**Some Models Overview:**
- **BERT (Bidirectional Encoder Representations from Transformers):** Designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
- **GPT (Generative Pre-trained Transformer):** An autoregressive model that uses the left context to predict the next word in a sentence.
- **RoBERTa (Robustly Optimized BERT Approach):** A variant of BERT that modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
- **T5 (Text-To-Text Transfer Transformer):** Converts all NLP problems into a text-to-text format, using a unified approach to handle any task.

### Evaluate

**Definition:** The `evaluate` library complements `datasets` and `transformers` by providing a set of tools for model evaluation and comparison across a wide range of metrics. This library is designed to help in assessing the performance of machine learning models, particularly in the field of NLP.

**Evaluation Methods:**
- **Accuracy:** Measures the proportion of correct predictions among total predictions, commonly used in classification tasks.
- **F1 Score:** Harmonic mean of precision and recall, useful in situations with imbalanced datasets.
- **BLEU (Bilingual Evaluation Understudy):** Used primarily for evaluating machine translated text against a set of reference translations.
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** Commonly used in summarization tasks to compare the overlap between the generated summaries and a set of reference summaries.
- **Perplexity:** Often used in language modeling to assess how well a probability model predicts a sample.

## Implementation

In the realm of NLP tasks using transformer models, Hugging Face provides a highly efficient and user-friendly interface through the `pipeline` class. However, there are times when one might need more control over the process than what is offered by the `pipeline` class. In such cases, creating a manual pipeline can be beneficial. Below, I'll provide definitions and examples for both using the `pipeline` class and creating a manual pipeline.

### Pipeline Class

**Definition:**
The `pipeline` class in Hugging Face's `transformers` library provides a high-level, easy-to-use interface for applying pre-trained models to various NLP tasks such as text classification, question answering, text generation, etc. This class abstracts away much of the preprocessing and postprocessing necessary to run predictions, making it accessible for users who may not be familiar with the intricacies of the models or NLP tasks.

Let's first import the required libraries.

In [None]:
import transformers
print(transformers.__version__)
import evaluate
print(evaluate.__version__)

4.37.2
0.4.1


Here's how you would use the `pipeline` class for sentiment analysis:

#### 1. Sentiment Analysis

**Model Used:** The `cardiffnlp/twitter-roberta-base-sentiment` model is a variant of the RoBERTa-base transformer model that has been specifically fine-tuned for sentiment analysis tasks on Twitter data. This model is hosted on Hugging Face's model hub and was developed by researchers at Cardiff University's NLP group.

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification

# Using RoBERTa model, which might be more robust
sentiment_analyzer = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

result = sentiment_analyzer("Transformers are making NLP much simpler.")
print(result)


[{'label': 'LABEL_2', 'score': 0.824418306350708}]


The output shows a label and a score indicating the model's confidence in its prediction.

For the `cardiffnlp/twitter-roberta-base-sentiment` model, these labels correspond to sentiment classes that the model was trained to predict. These are:

- LABEL_0: Negative
- LABEL_1: Neutral
- LABEL_2: Positive

#### 2. Zero-Shot Classification

**Model Used:** BART large MNLI, designed for natural language inference tasks but adaptable for zero-shot classification. The model classifies an input sentence into predefined categories without needing labeled data for those categories.

In [None]:
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Classify the given text into provided categories
result = zero_shot_classifier(
    "Discover the latest trends in machine learning.",
    candidate_labels=["health", "technology", "finance", "education"]
)

# Display the classification results
print(result)

{'sequence': 'Discover the latest trends in machine learning.', 'labels': ['technology', 'health', 'finance', 'education'], 'scores': [0.9637537002563477, 0.016171658411622047, 0.010842711664736271, 0.00923205353319645]}


The output includes the text, labels, and scores for each label, showing how well the text fits into each category. Here as expectd the result is labeled as `technology`.

#### 3. Text Generation

**Model Used:** GPT-2, well-known for its ability to generate coherent and contextually relevant text.

In [None]:
text_generator = pipeline("text-generation", model="gpt2")

# Generate text starting from the given prompt
generated_text = text_generator("Artificial intelligence in healthcare ", max_length=50, truncation=True)

# Output the generated text
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Artificial intelligence in healthcare \xa0has been used extensively with "personal health" technologies for years as it is not even the first, common, technology used in pharmaceutical, diagnostic, and so on. However, what is interesting is that it was a'}]


The model outputs a generated text with maximum size defined as 50 as a continuation of the input prompt. The text is creatively expanded, reflecting the model's understanding of the context.

#### 4. Masked Language Modeling

**Model Used:** XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. .

In [None]:
mlm = pipeline("fill-mask", model="FacebookAI/xlm-roberta-base")

# Fill in the blank in the sentence below
result = mlm("The current president of the United States is <mask>.")

# Print the results
print(result)

Some weights of the model checkpoint at FacebookAI/xlm-roberta-base were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.24085475504398346, 'token': 16692, 'token_str': 'Donald', 'sequence': 'The current president of the United States is Donald.'}, {'score': 0.059821899980306625, 'token': 76140, 'token_str': 'Barack', 'sequence': 'The current president of the United States is Barack.'}, {'score': 0.05501473322510719, 'token': 14787, 'token_str': 'George', 'sequence': 'The current president of the United States is George.'}, {'score': 0.045015595853328705, 'token': 34202, 'token_str': 'President', 'sequence': 'The current president of the United States is President.'}, {'score': 0.04419546574354172, 'token': 70, 'token_str': 'the', 'sequence': 'The current president of the United States is the.'}]


The output suggests top fill-ins for the masked token, each with a confidence score.

Even it doesn't find the current president correctly, it finds the previous presidents. This shows that the model successfully captured the context given.

#### 5. Question Answering

**Model Used:** The `distilbert-base-cased-distilled-squad` model used here is a distilled version of BERT, optimized for the question-answering task on the Stanford Question Answering Dataset (SQuAD). DistilBERT retains most of the BERT's predictive power but is more lightweight and faster, making it suitable for environments where resources are limited.

In [None]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

# Provide a context and a question whose answer lies within the context
result = qa_pipeline({
    'question': 'Who is known as the father of computer science?',
    'context': 'Alan Turing is widely considered as the father of computer science.'
})

# Display the answer
print(result)

{'score': 0.9983620643615723, 'start': 0, 'end': 11, 'answer': 'Alan Turing'}


The output indicates a successful identification of the answer with high confidence, demonstrating the model's capability to effectively parse and understand natural language within the specified domain (as trained on SQuAD).

#### 6. Translation Model Using Pipeline Class & BLEU Metric

**Model Used:** Helsinki-NLP's translation model trained for translating English to French. These models are part of the Opus-MT project and have been trained on a diverse set of languages from the OPUS collection.

In [None]:
from transformers import pipeline
from evaluate import load

# Load the translation pipeline
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")

# Translate the text from English to French
translated_text = translator("This is an example of using a Hugging Face model for translation.")

# This is our reference translation for evaluation
reference_translation = ["Ceci est un exemple d'utilisation d'un modèle Hugging Face pour la traduction."]
print(f"Reference translation: {reference_translation[0]}")

# Extract the translated text
translated_text = translated_text[0]['translation_text']

# Print the results
print(f"Translated text: {translated_text}")

# Load BLEU metric
bleu = load('bleu')

# Calculate BLEU score
results = bleu.compute(predictions=[translated_text], references=[reference_translation])

# Print the BLEU score
print(f"BLEU score: {results['bleu']}")

Reference translation: Ceci est un exemple d'utilisation d'un modèle Hugging Face pour la traduction.
Translated text: C'est un exemple d'utilisation d'un modèle Hugging Face pour la traduction.
BLEU score: 0.8313539764691034



The `translation_text` field in the output contains the French translation of the English input text.

#### 7. Text Summarization Using Pipeline Class & ROUGE Metric


**Model Used:** Bart-large-cnn, designed specifically for summarization tasks, trained on the CNN/DailyMail dataset.

In [None]:
from transformers import pipeline
from evaluate import load

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Text to summarize
article_text = """
The Hugging Face library offers state-of-the-art machine learning models for natural language processing tasks.
It provides simple interfaces for NLP tasks such as text classification, translation, and more.
The library is built on top of PyTorch and TensorFlow, giving users flexibility and ease of use in model training and deployment.
"""

# Perform summarization
summary_text = summarizer(article_text, max_length=45, min_length=5, do_sample=False)

# Print the summarized text
print(f"Summarized text: {summary_text[0]['summary_text']}")

# This is our reference summary for evaluation (typically a human-generated summary)
reference_summary = "Hugging Face offers machine learning models for NLP based on PyTorch and TensorFlow."
print(f"Reference Summary: {reference_summary}")

# Load ROUGE metric
rouge = load('rouge')

# Calculate ROUGE score
rouge_results = rouge.compute(predictions=[summary_text[0]['summary_text']], references=[reference_summary])

# Print the ROUGE scores
print(f"ROUGE scores: {rouge_results}")

Summarized text: The Hugging Face library offers state-of-the-art machine learning models for natural language processing tasks. The library is built on top of PyTorch and TensorFlow, giving users flexibility and ease
Reference Summary: Hugging Face offers machine learning models for NLP based on PyTorch and TensorFlow.
ROUGE scores: {'rouge1': 0.4888888888888889, 'rouge2': 0.27906976744186046, 'rougeL': 0.4888888888888889, 'rougeLsum': 0.4888888888888889}


The ROUGE scores are a set of metrics to evaluate the quality of the summarization by comparing the machine-generated summary to a reference summary. Let's interpret each score:

1. **ROUGE-1 (0.4888888888888889)**: This score reflects the overlap of unigrams (individual words) between the generated summary and the reference summary. A ROUGE-1 score of approximately 0.489 indicates that around 48.9% of the words in the reference summary appear in the generated summary. This suggests a moderate level of word overlap, indicating that nearly half of the words from the reference are captured in the generated text.

2. **ROUGE-2 (0.27906976744186046)**: This metric looks at the overlap of bigrams (pairs of consecutive words). The ROUGE-2 score of approximately 0.279 means that about 27.9% of the consecutive word pairs in the reference summary are also found in the generated summary. This is typically lower than ROUGE-1 because it's more specific, capturing the sequence of words. The score here is lower, suggesting that there is less overlap in word pairs, indicating that the exact phrases from the reference are less commonly found in the generated summary.

3. **ROUGE-L (0.4888888888888889)**: ROUGE-L measures the Longest Common Subsequence (LCS), considering the longest sequence of words that appears in both the generated summary and the reference summary. A score of about 0.489 indicates that the longest sequence of words that appear in both texts comprises around 48.9% of the reference summary. This suggests that there are some reasonably long subsequences shared between the two summaries.

4. **ROUGE-Lsum (0.4888888888888889)**: This is similar to ROUGE-L but it considers the entire summary instead of only the longest sequence. It's useful for summarization where sentence-level structure matters. The score being the same as ROUGE-L indicates that the longest subsequence likely captures the most significant overlap in content between the generated and reference summaries.

### Overall Interpretation
The generated summary captures nearly half of the content of the reference summary in terms of single words and has a moderate level of overlap when it comes to longer phrases and sentence structures. The lower score in ROUGE-2 compared to ROUGE-1 and ROUGE-L suggests that while many correct keywords are included, the exact phrasing and order of words are not as closely matched. This could mean the summary is capturing the key concepts but may be paraphrasing or using different structures to convey the information.

### Creating a Manual Pipeline

A manual pipeline involves more granular control over the steps in the processing of data through a transformer model. This typically includes:

- loading the tokenizer and model
- performing tokenization
- running the model on these tokens
- processing the output.

This method is useful when customization is needed beyond what the `pipeline` class offers, such as batch processing, custom preprocessing or postprocessing steps, or using specific model configurations.

Here’s how you would manually create a pipeline for sentiment analysis:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Prepare input
text = "I love using transformers for machine learning."
encoded_input = tokenizer(text, return_tensors='pt')

# Run model
output = model(**encoded_input)

# Process output
scores = torch.nn.functional.softmax(output.logits, dim=-1)
predicted_label_index = scores.argmax().item()
predicted_label = model.config.id2label[predicted_label_index]
confidence = scores.max().item()

print({"label": predicted_label, "score": confidence})

{'label': 'POSITIVE', 'score': 0.9977321624755859}


Here the code outputs a sentiment label and confidence score. As can be seen the model outputs a `Positive` output with a high confidence score.

### Conclusion

Both methods offer powerful ways to leverage transformer models for NLP tasks, with the `pipeline` class being more suited for straightforward applications and quick testing. In contrast, the manual pipeline method provides greater flexibility and control, which can be essential for production environments or research purposes where custom behavior is necessary.