## [CamemBERTsum](https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization) fine-tuned on [mlsum-fr](https://huggingface.co/datasets/viewer/?dataset=mlsum)

> MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from **online newspapers**, it contains **1.5M+ article/summary pairs** in five different languages -- namely, **French**, German, Spanish, Russian, Turkish.
> https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization#dataset

>* Size of downloaded dataset files: 591.27 MB
>* Size of the generated dataset: 1537.36 MB
>* Total amount of disk used: 2128.63 MB
>* An example of 'validation' looks as follows.
>```json
{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}
>```
>https://huggingface.co/datasets/mlsum#fr

In [1]:
from pathlib import Path

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel

from book_loader import BookLoader

In [2]:
data_path = Path("data/D5627-Dolan.docx").expanduser().resolve()

start_marker = r"^Introduction$"
compiled_header_marker = (rf"(?:^Chapitre \d+ /.+"
                          rf"|{start_marker}"
                          rf"|^Stress, santé et performance au travail$)")
chapter_marker = r"^Chapitre (\d+) /$"
na_span_markers = (
    r"^exerCiCe \d\.\d /$",
    '|'.join([chapter_marker,
              r"^Les caractéristiques personnelles\.",
              r"/\tLocus de contrôle$",
              r"^L'observation de sujets a amené Rotter",
              r"^Lorsqu'une personne souffre de stress"]))

book = BookLoader(data_path,
                  {"start_marker": start_marker,
                   "end_marker": r"^Annexe /$",
                   "chapter_marker": chapter_marker,
                   "header_marker": compiled_header_marker,
                   "ps_marker": r"^Conclusion$",
                   "na_span_markers": na_span_markers})

chapters: list[list[str]] = book.chapters

  warn("Skipping unexpected tag: %s" % (current.tag),


In [3]:
torch.cuda.is_available()
# TODO: Investigate this. Maybe has to do with exe install task you need Samuel for.
#       Other solution is to run this directly in WSL as a .py

False

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/camembert2camembert_shared-finetuned-french-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


In [5]:
def generate_summary(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [6]:
chapters = ['\n'.join(p for p in chapter) for chapter in book.chapters]

In [7]:
summaries = {idx: generate_summary(chapter) for idx, chapter in enumerate(chapters)}

Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.


In [8]:
for chap_idx, summ in summaries.items():
    print(f"Chapter {chap_idx}:\n{summ}\n")

# TODO: Investigate how these summaries compare to the references
# TODO: Investigate why it's one sentence per summary

Chapter 0:
Le monde organisationnel est soumis à des exigences de productivité historiques, exigences qui demeurent, faut-il l'admettre, complexes à concilier avec une promotion et un soutien de la santé psychologique au travail.

Chapter 1:
Le coin du coach. Si la main-d'œuvre, mieux formée qu'auparavant, cherche à maîtriser son travail par la participation.

Chapter 2:
Une grande partie du stress que les gens ressentent ne vient pas d'avoir trop à faire, il vient de ne pas finir ce qu'ils ont commencé.

Chapter 3:
Le stress est un élément normal de notre vie quotidienne. Bien que ce ne soit pas toujours le cas, il est généralement associé à un état négatif ou à une expérience préjudiciable qu'il faut éliminer à tout prix.

Chapter 4:
Le coin du coach. Le lecteur pourrait être surpris de voir au cœur d'un livre sur le stress professionnel un chapitre entier consacré à la notion d'estime de soi. Effectivement, nul ne peut entièrement contrôler l'émergence de situations potentiellement 