# Text Summarization

<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [ales.zagar@fri.uni-lj.si](mailto:ales.zagar@fri.uni-lj.si) for any comments.</sub>

This notebook requieres a GPU-accelerated environment. We suggest using Google Colab to run it. 

# Google Colab specifics
Before running this notebook you should select GPU-accelerated environment.

In [1]:
# !pip install datasets transformers rouge-score nltk openai

In [2]:
import transformers
import nltk
import os
nltk.download('punkt')

print(transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


4.39.3


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alesz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [3]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [4]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  metric = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

We will select a small number of examples and perform a few-shot fine-tuning to reduce training time. Increase the number of examples to get better results. 

In [5]:
num_of_samples = 100
raw_datasets['train'] = raw_datasets['train'].select(list(range(num_of_samples)))
raw_datasets['validation'] = raw_datasets['validation'].select(list(range(num_of_samples)))
raw_datasets['test'] = raw_datasets['test'].select(list(range(num_of_samples)))

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=3):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"The team went into administration in October but, as revealed by BBC Sport, have secured investment from Stephen Fitzpatrick, boss of energy firm Ovo.\nFormer Sainsbury's boss Justin King has joined as interim chairman.\nHe said he was confident that Manor had ""the right people, the right values and sheer hard work"" and would be ""competitive at the highest level"".\nKing is not financially involved in the team but will take a leading role on the business side of the operation.\nFitzpatrick's investment is a personal one and the money he has put into the team does not come from Ovo.\nHe said: ""I have a lifelong passion for F1 and can't wait for the season ahead.""\nManor Marussia have announced Englishman Will Stevens will be one of their drivers and said a deal to sign the second would be completed soon.\nThe team's new car, a modified version of the 2014 model, must pass F1's mandatory crash tests before they can race at the season-opening Australian Grand Prix in Melbourne from 13-15 March.\nThose tests are scheduled to take place this week.\nGraeme Lowdon, who has been with the team through its various guises as Manor, Virgin and Marussia, remains as president and sporting director.\nHe said: ""It has been a challenging period for all of us but we've come through it and now we just want to go racing again.\n""With formidable new business leadership in Stephen Fitzpatrick and the board presence of Justin King, we are now in a great place ahead of the new season. This is a fantastic and very rewarding moment for all those involved with the team.""",The Manor Marussia team have confirmed they intend to return to Formula 1 in time for the start of the season.,31723471
1,"Riding shotgun, Mrs Obama sang along to hits by Beyonce and Stevie Wonder - although her security limited the drive to the White House compound.\nMrs Obama confessed she had only ridden in the passenger's seat of a car once in the last seven years.\nCorden began hosting CBS's The Late, Late Show in March last year.\nCorden producer: How we got Michelle Obama to sing\nSinger, Missy Elliott, joined the ride for the song ""This Is For My Girls"", which is promoting Mrs Obama's Let Girls Learn initiative that supports girls' education worldwide.\nThe First Lady also rapped along to Missy Elliott's 2001 hit, Get Ur Freak On.\nOther guests on Cordon's Karaoke have included Adele, Sir Rod Stewart, Sir Elton John, Justin Bieber, Mariah Carey, Jennifer Lopez and One Direction.",The US First Lady joined comedian and television star James Cordon for Wednesday night's episode of Carpool Karaoke on The Late Late Show.,36855749
2,"A video was released via social media from MotoGP's Valencia Grand Prix, appearing to show the Italian colliding with the fan while riding a motorcycle.\nThe nine-time world champion apologised for the incident and said that he hoped she was ok.\nRossi, 37, added it was difficult for him to move quickly around the paddock.\nFan Ana Cabanillas Vazquez told Spanish radio station COPE she would have accepted the apology if she thought it ""had been an accident"".\n""Seeing the video, you can tell that it was done on purpose,"" she said.\n""I have a small bruise on my leg. I'll consider pressing charges.""\nRossi finished fourth in Valencia, the final race of the MotoGP season and came second in the championship standings behind Spain's Marc Marquez.",A fan has threatened to press charges against Valentino Rossi following an incident in the paddock that occurred while she was taking a selfie.,37977826


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, model_max_length=512)

If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [9]:
tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."], truncation=True, padding=True)

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0]]}

In [10]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [11]:
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [12]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [13]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [14]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [15]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    # fp16=True,  # uncomment this for training on gpu
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [16]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [17]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [18]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


We can now finetune our model by just calling the `train` method:

In [19]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,3.686377,16.3754,1.8904,13.0638,13.2695,18.97




TrainOutput(global_step=7, training_loss=4.007627759660993, metrics={'train_runtime': 95.2307, 'train_samples_per_second': 1.05, 'train_steps_per_second': 0.074, 'total_flos': 13445362089984.0, 'train_loss': 4.007627759660993, 'epoch': 1.0})

# Summarization evaluation

**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) is a metric most commonly used for the evaluation of automatically generated text summaries. It measures the quality of a summary by the number of overlapping units (n-grams, sequences of texts, etc.) between summaries created by humans and summaries created by summarization systems. ROUGE is not a single metric but a family of metrics. The most commonly used are ROUGE-N and ROUGE-L. The first measures the overlapping of n-grams (typically unigrams and bigrams), while the second measures the longest common subsequence found in both summaries. 

The original formula expresses the ratio between overlapping units $\sum matching\_units$ and the number of units in the reference summary $\sum reference\_units$:

\begin{equation}
    R_{ROUGE} = \frac{\sum matching\_units} {\sum reference\_units}
\end{equation}

Usually, the $F_1$ version is reported. To calculate the $F_1$ variant, we need to calculate the precision which is the ratio between overlapping units $\sum matching\_units$ and the number of units in the system summary $\sum system\_units$:

\begin{equation}
    P_{ROUGE} = \frac{\sum matching\_units} {\sum system\_units}
\end{equation}

After that, we can calculate the harmonic mean of precision and recall to get the $F_1$ score:

\begin{equation}
    F_{1_{ROUGE}} = 2 \cdot \frac{R_{ROUGE} \cdot P_{ROUGE}}{R_{ROUGE} + P_{ROUGE}}
\end{equation}

The ROUGE metric can be misleading and inaccurate in many cases. For example, observe the following two sequences:

1. The black cat is on the mat.
2. The mat is on the black cat.  

ROUGE returns a perfect ROUGE-1 score although their meanings differ. On the other hand, sentences with similar meanings and no overlapping words will have a zero score:

1. It is cold.
2. Chilly weather persists.

[BERTScore](https://arxiv.org/pdf/1904.09675.pdf) is an alternative that addresses this problem by comparing word embeddings instead. 

In [20]:
# ROUGE score on toy example
generated = ["I am a student".split()]
golden = ["I am a professor".split()]
result = metric.compute(predictions=generated, references=golden, use_stemmer=True)
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
result = {k: round(v, 4) for k, v in result.items()}
print(result)

{'rouge1': 75.0, 'rouge2': 66.6667, 'rougeL': 75.0, 'rougeLsum': 75.0}


In [21]:
# Evaluate the model on validation and test sets
print(f"Evaluating model {model_name}")
val_results = trainer.evaluate()
test_results = trainer.predict(test_dataset=tokenized_datasets['test'])

print('Val results: ', val_results)
print('Test results:', test_results.metrics)

Evaluating model t5-small


Val results:  {'eval_loss': 3.6863772869110107, 'eval_rouge1': 16.3754, 'eval_rouge2': 1.8904, 'eval_rougeL': 13.0638, 'eval_rougeLsum': 13.2695, 'eval_gen_len': 18.97, 'eval_runtime': 40.2779, 'eval_samples_per_second': 2.483, 'eval_steps_per_second': 0.174, 'epoch': 1.0}
Test results: {'test_loss': 3.745168924331665, 'test_rouge1': 18.6558, 'test_rouge2': 3.0408, 'test_rougeL': 15.028, 'test_rougeLsum': 15.4121, 'test_gen_len': 19.0, 'test_runtime': 39.8343, 'test_samples_per_second': 2.51, 'test_steps_per_second': 0.176}


# Manual evaluation

Automatic evaluation metrics are great for getting a cheap and fast insight into how generative models perform. However, as we saw, they have many drawbacks and often a human manual evaluation is needed. Manual evaluation of summaries involves assessing the quality and accuracy of a summarized text. This process is subjective to some extent but crucial for understanding how well a summary captures the essence of the original text. Here are some evaluation criteria that is commonly used in research.

### Evaluation Criteria
- **Content Coverage**: Does the summary capture the key points, arguments, and data from the original text?
- **Accuracy**: Is the information in the summary true to the original text, without misinterpretation or distortion?
- **Conciseness**: Does the summary eliminate redundant or non-essential information, presenting the core message efficiently?
- **Coherence and Structure**: Is the summary logically structured, easy to follow, and coherent in its presentation?
- **Readability**: Is the summary well-written, with proper grammar, punctuation, and an appropriate style for the intended audience?
- **Non-Redundancy**: Does the summary avoid repeating the same information unnecessarily?
- **Referential Clarity**: Are the pronouns and terms used in the summary clearly linked to their referents?
- **Focus**: Does the summary maintain a focus on the most important aspects of the original text without diverging into less relevant details?

### Scoring
Manual evaluation usually consists of assigning a score (e.g., 1-10) for each criterion based on the summary's performance and calculate statistics after that.

# Inference

In [22]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

example_id = 0
sum_text = raw_datasets['test'][example_id]['document']
input_ids = tokenizer(f"summarize: {sum_text}", return_tensors="pt", max_length=512, truncation=True).input_ids
input_ids = input_ids.to(device)
outputs = model.generate(input_ids,
                          max_length=256,
                          no_repeat_ngram_size=2,
                          num_beams=5,
                          num_return_sequences=5
                          )

In [23]:
print("\nDOCUMENT:\n", raw_datasets['test'][example_id]['document'])
decoded_preds = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGENERATED:\n", decoded_preds)


DOCUMENT:
 Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.
Workers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.
The Welsh Government said more people than ever were getting help to address housing problems.
Changes to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.
Prison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.
However, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.
Andrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the need fo

# Save and reload your models

In [24]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

Save your model on disk. 

In [None]:
trainer.save_model("/content/drive/MyDrive/Colab Notebooks/models/summarization/sum-model")

Reload the model from disk

In [None]:
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model_checkpoint = "/content/drive/MyDrive/Colab Notebooks/models/summarization/sum-model"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(DEVICE)

# Machine translation

In [25]:
from transformers import MarianMTModel, MarianTokenizer
from datasets import load_metric

# Sample dataset: English source texts and their French translations
dataset = [
    {"en": "I love natural language processing.", "fr": "J'aime le traitement automatique des langues."},
    {"en": "This is an example of text translation.", "fr": "Ceci est un exemple de traduction de texte."},
]

In [26]:
# Load the tokenizer and model for English to French translation
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

In [27]:
# Load the BLEU metric from datasets
import evaluate
bleu_metric = evaluate.load("bleu")

In [28]:
# Updated function to translate text using Hugging Face Transformers
def translate(texts, model, tokenizer):
    # Prepare the input data using the tokenizer's __call__ method
    encoded_texts = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    # Generate translated texts
    translated_texts = model.generate(**encoded_texts)
    
    # Decode the generated tokens to text
    return [tokenizer.decode(t, skip_special_tokens=True) for t in translated_texts]

In [29]:
# Translate the source texts
source_texts = [item["en"] for item in dataset]
translated_texts = translate(source_texts, model, tokenizer)

# Prepare references and predictions for BLEU calculation
references = [[ref] for ref in [item["fr"] for item in dataset]]  
predictions = [pred for pred in translated_texts]  

print(references)
print(predictions)

[["J'aime le traitement automatique des langues."], ['Ceci est un exemple de traduction de texte.']]
["J'adore le traitement du langage naturel.", "C'est un exemple de traduction de texte."]


## Machine translation evaluation: BLEU Score

The BLEU (Bilingual Evaluation Understudy) score provides a measure for automatically evaluating machine-translated text. The essence of BLEU is to compare n-grams of the machine's output with those of human translations, calculating precision scores for each n-gram level. The final BLEU score is a geometric mean of these precisions, adjusted by a penalty for short translations.

### Key Components

- **N-grams**: Sequences of 'n' words used for comparison between the machine translation and reference translations.
- **Precision**: The ratio of the number of matching n-grams in the machine translation to the total number of n-grams in the machine translation.
- **Brevity Penalty (BP)**: A penalty applied to discourage overly short translations, which can artificially inflate precision scores.

### BLEU Score Formula

The BLEU score is calculated as follows:

1. Compute n-gram precision scores for various n-gram lengths (usually 1 to 4).
2. Calculate the geometric mean of these precision scores.
3. Apply the brevity penalty to penalize short machine translations.

The formula for the BLEU score is:

$$
\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log(p_n)\right)
$$

Where:
- $p_n$ is the precision for n-grams.
- $w_n$ is the weight for each n-gram precision (often set to 1/N for equal weighting).
- $\text{BP}$ (Brevity Penalty) is calculated as:

$$
\text{BP} = \left\{
  \begin{array}{ll}
  1 & \text{if } c > r \\
  e^{(1-r/c)} & \text{if } c \leq r
  \end{array}
\right.
$$

In the BP formula, $c$ is the length of the machine-translated text, and $r$ is the effective reference corpus length.


In [30]:
# We import tokenizer from nltk
from nltk.tokenize import word_tokenize

# Calculate and print the BLEU score
bleu_score = bleu_metric.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
print(f"BLEU Score: {bleu_score['bleu'] * 100}")

BLEU Score: 48.54690995445692


# Exercises

- Try other pretrained models for text summarization: https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
- Change fine-tuning and inference parameters
- Fine-tune models for other text generation tasks (e.g., question-answering, story generation, ...): https://github.com/huggingface/notebooks/tree/main/examples

# References

This notebook is an adaptation of https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb