## [CamemBERTsum](https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization) fine-tuned on [mlsum-fr](https://huggingface.co/datasets/viewer/?dataset=mlsum)

> MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from **online newspapers**, it contains **1.5M+ article/summary pairs** in five different languages -- namely, **French**, German, Spanish, Russian, Turkish.
> https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization#dataset

>* Size of downloaded dataset files: 591.27 MB
>* Size of the generated dataset: 1537.36 MB
>* Total amount of disk used: 2128.63 MB
>* An example of 'validation' looks as follows.
>```json
{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}
>```
>https://huggingface.co/datasets/mlsum#fr

In [1]:
from pathlib import Path

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel

from book_loader import BookLoader

In [2]:
data_path = Path("data/D5627-Dolan.docx").expanduser().resolve()

start_marker = r"^Introduction$"
compiled_header_marker = (rf"(?:^Chapitre \d+ /.+"
                          rf"|{start_marker}"
                          rf"|^Stress, santé et performance au travail$)")
chapter_marker = r"^Chapitre (\d+) /$"
na_span_markers = (
    r"^exerCiCe \d\.\d /$",
    '|'.join([chapter_marker,
              r"^Les caractéristiques personnelles\.",
              r"/\tLocus de contrôle$",
              r"^L'observation de sujets a amené Rotter",
              r"^Lorsqu'une personne souffre de stress"]))

book = BookLoader(data_path,
                  {"start_marker": start_marker,
                   "end_marker": r"^Annexe /$",
                   "chapter_marker": chapter_marker,
                   "header_marker": compiled_header_marker,
                   "ps_marker": r"^Conclusion$",
                   "na_span_markers": na_span_markers})

chapters: list[list[str]] = book.chapters

  warn("Skipping unexpected tag: %s" % (current.tag),


In [3]:
torch.cuda.is_available()
# TODO: Investigate this. Maybe has to do with exe install task you need Samuel for.
#       Other solution is to run this directly in WSL as a .py

False

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/camembert2camembert_shared-finetuned-french-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


In [7]:
chapters = ['\n'.join(p for p in chapter) for chapter in book.chapters[1: -1]]

In [41]:
# model.config.max_length = 64
len_out_seq = model.config.max_length
def generate_summary(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    # https://huggingface.co/docs/transformers/main_classes/text_generation
    output = model.generate(input_ids,
                            attention_mask=attention_mask,
                            min_length=len_out_seq * 8,
                            max_length=len_out_seq * 8,
                            repetition_penalty=0.9,
                            num_beams=10)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [42]:
summaries = {idx + 1: generate_summary(chapter) for idx, chapter in enumerate(chapters)}

Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:6 for open-end generation.


In [43]:
for chap_idx, summ in summaries.items():
    print(f"Chapter {chap_idx}:\n{summ}\n")

# Note: Summaries seem to be hybrid (ext and abs) and lean towards extracting
#       the leading spans

Chapter 1:
Le coin du coach. L'amélioration de l'efficience organisationnelle nécessite de considérer la qualité de vie et la santé psychologique au travail comme des leviers de maximisation de la performance. Or trois facteurs ont, entre autres, rapidement transformé cette réalité : l'allongement des cycles de récession ; la Le Revue Revue St St Stimhihihi (voir l'esprit du coin) et l'irruption de nouvelles approches managériales sur le plan de la productivité, et non comme des antagonistes naturels. Or, la main-d'œuvre, les organisations, se sentent de plus en plus contraintes d'accroître leur productivité. Or Or qu'il n'y a-t-il la nécessité de rehausser du même coup que l'entreprise. Or l'ensemble de la collectivité et sur la capacité concurrentielle. Or donc, c'est la nécessité d'améliorer la productivité et la capacité capacité à maîtriser son travail par la participation des travailleurs, les conditions conditions de travail, l'évolution démographique et le bien-être au travail 