
 # Hugging Face Transformers - A Complete Guide

 This notebook provides a complete guide on how to use Hugging Face Transformers to perform common Natural Language Processing (NLP) tasks such as:
 - Sentiment Analysis
 - Text Summarization
 - Question Answering
 - Text Translation
 - Text Generation


Additionally, it will demonstrate how to fine-tune a pre-trained model on a custom dataset for specific tasks.

# 1. Installing Necessary Libraries
Before we can start, we need to install the required Python packages. We will use the Hugging Face `transformers` and `datasets` libraries along with `torch`, which is the backend framework that runs the models.


# 2. Using Hugging Face Pipelines

Hugging Face provides a high-level abstraction called `pipeline`. The `pipeline` is designed to allow you to quickly apply a model to a task without needing to worry about the underlying details.

You can use the `pipeline` function to load a pre-trained model for different tasks such as sentiment analysis, text generation, summarization, etc.

Let's start by importing the `pipeline` function from the Hugging Face Transformers library.


In [1]:
from transformers import pipeline
import torch

device = (
    torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
)

### Task 1: Sentiment Analysis
Sentiment Analysis is the task of classifying a given text into positive, negative, or neutral sentiments.

In this example, we will use a pre-trained model for sentiment analysis. The `pipeline` will automatically download and load a model that has been pre-trained on a large dataset to perform this task.

In [2]:
model_name = "mrm8488/deberta-v3-ft-financial-news-sentiment-analysis"


classifier = pipeline("sentiment-analysis", model=model_name, device=device)
result = classifier(
    "The development of a recombinant polyclonal antibody therapy for COVID-19 by GigaGen represents an early-stage positive news in response to a global health crisis. However, such initiatives often come with high risk and uncertainty given the complexity and time required for clinical trials and approval processes. Additionally, competition in the COVID-19 treatment space is intense, with many companies pursuing similar therapies. These factors make it essential to remain cautious, monitoring further developments and data closely."
)


print(f"Sentiment Analysis Result: {result}")

Sentiment Analysis Result: [{'label': 'positive', 'score': 0.966559886932373}]


# 3. Fine-Tuning Pre-trained Models
While the pre-trained models provided by Hugging Face are powerful, you may want to fine-tune them for a specific task or dataset.

Fine-tuning involves taking a pre-trained model and training it further on your own data. This can improve the model’s performance for specific use cases.

For this section, we’ll load the IMDB dataset (which contains movie reviews) and fine-tune a pre-trained model for sentiment classification.

### Step 1: Load Dataset
We'll use Hugging Face's datasets library to load the IMDB dataset.

Datasets from the dataset library often come with pre-defined splits of the data, such as `train` and `test` sets.

It is possible to filter or slice datasets to focus on specific subsets of the data, using the `select` method.

In [3]:
from datasets import load_dataset, Features, ClassLabel, Value

# Define the features
features = Features(
    {
        "text": Value("string"),
        "labels": ClassLabel(names=["0", "1", "2"]),
    }
)

# Load the dataset with the specified features
dataset = load_dataset(
    "csv",
    data_files="/Users/akseljoonas/Documents/predtrade/bert fine tuning/datasets_big/dataset_finetuning_3_labels.csv",
    features=features,
)

# Now you can stratify by 'labels'
split_dataset = dataset["train"].train_test_split(
    test_size=0.1, stratify_by_column="labels"
)

train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [4]:
train_dataset

Dataset({
    features: ['text', 'labels'],
    num_rows: 3492
})

### Step 2: Tokenize the Dataset
The dataset needs to be tokenized before it can be fed into the model. Tokenization converts the text data into numerical format (tokens) that the model can process.

We'll use the `AutoTokenizer` class from HuggingFace to tokenize the data. The `AutoTokenizer` class automatically selects the appropriate tokenizer for the model based on the `model_name`.

Tokenization or transformation of the dataset can be done using the `map` method, which applies a function to all the elements of the dataset. This is easily done by defining a function that tokenizes the text data and then applying it to the dataset. When `batched=True`, the function will be applied to batches of data, which can improve performance by applying the function in parallel.

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def tokenize_function(examples):
    # print(examples["text"][0])
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3492 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/388 [00:00<?, ? examples/s]

In [6]:
tokenized_test


Dataset({
    features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 388
})

### Step 3: Load a Pre-trained Model
Now that the data is tokenized, we'll load a pre-trained model that we'll fine-tune for sentiment classification.

We'll use distilbert-base-uncased for this task.

We need to import `AutoModelForSequenceClassification` for that. The key feature of this class is that it adds a classification head on top of the pre-trained transformer model to allow it to classify sequences into one or more categories (e.g., positive/negative sentiment, spam/ham, etc.). The `from_pretrained` method loads the pre-trained model with the specified configuration. The `num_labels` parameter specifies the number of labels in the classification task (binary in this case).

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name#, num_labels=2, ignore_mismatched_sizes=True
).to(device)

### Step 4: Set Up the Trainer
Hugging Face provides the Trainer class to help with the training and fine-tuning of models. We need to set up the trainer by providing the model, training arguments, and the datasets.


In [8]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="weighted", labels=[0, 1, 2]
    )
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

In [9]:
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight(
    class_weight="balanced",
    classes=np.unique(dataset["train"]["labels"]),
    y=dataset["train"]["labels"],
)
class_weights = torch.tensor(class_weights, dtype=torch.float)
print("Class Weights:", class_weights)

Class Weights: tensor([2.8055, 0.4608, 2.1133])


In [10]:
import torch.nn as nn
import torch.nn.functional as F


# Define custom loss function
class WeightedCrossEntropyLoss(nn.Module):
    def __init__(self, class_weights):
        super(WeightedCrossEntropyLoss, self).__init__()
        self.class_weights = class_weights

    def forward(self, logits, labels):
        return F.cross_entropy(logits, labels)#, weight=self.class_weights)

In [11]:
from transformers import Trainer, TrainingArguments


# Define custom trainer
class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Move class_weights to the same device as logits
        loss_fct = WeightedCrossEntropyLoss(self.class_weights.to(logits.device))
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

In [12]:
training_args = TrainingArguments(
    output_dir="./results",  # Output directory
    eval_strategy="epoch",  # Evaluate after each epoch
    save_strategy="epoch",
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    num_train_epochs=10,  # Number of epochs
    weight_decay=0.01,  # Strength of weight decay
    load_best_model_at_end=True,  # Load the best model at the end
    metric_for_best_model="f1",  # Use F1 score to select the best model
    save_total_limit=1,  # Limit the total amount of checkpoints
)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,  # Replace with your actual training dataset
    eval_dataset=tokenized_test,  # Replace with your actual evaluation dataset
    compute_metrics=compute_metrics,
    class_weights=class_weights,  # Pass the class weights
    data_collator=data_collator,
)

### Step 5: Fine-tune the Model
Now that the trainer is set up, we can start the fine-tuning process.

Run the following cell to fine-tune the model.

In [13]:
trainer.train()

  0%|          | 0/4370 [00:00<?, ?it/s]

RuntimeError: MPS backend out of memory (MPS allocated: 16.19 GB, other allocations: 29.39 GB, max allowed: 45.90 GB). Tried to allocate 375.29 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

### Step 6: Evaluate the Model
After training, we can evaluate the model’s performance on the test set.

In [None]:
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

### Step 7: Try out model

In [None]:
input_string = "BioVie Announces Alignment with FDA on Clinical Trial to Assess Bezisterim in Parkinson’s Disease,SUNRISE-PD to evaluate the effect of bezisterim (NE3107) on motor and non-motor symptoms in ~60 patients with Parkinson’s disease who are naïve to carbidopa/levodopa,SUNRISE-PD to evaluate the effect of bezisterim (NE3107) on motor and non-motor symptoms in ~60 patients with Parkinson’s disease who are naïve to carbidopa/levodopa"

# Tokenize the input string
inputs = tokenizer(input_string, return_tensors="pt").to(device)

# Get predictions (logits)
with torch.no_grad():  # Disable gradient computation since we're just doing inference
    outputs = model(**inputs)
    logits = outputs.logits
print(logits)
predicted_label = torch.argmax(logits, dim=1).item()


print(f"Predicted label: {predicted_label}")

### Step 8. Saving the Fine-tuned Model
After training, it is often useful to save the fine-tuned model, so you can use it later without needing to re-train it.

In [None]:
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")