## [CamemBERTsum](https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization) fine-tuned on [mlsum-fr](https://huggingface.co/datasets/viewer/?dataset=mlsum)

> MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from **online newspapers**, it contains **1.5M+ article/summary pairs** in five different languages -- namely, **French**, German, Spanish, Russian, Turkish.
> https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization#dataset

>* Size of downloaded dataset files: 591.27 MB
>* Size of the generated dataset: 1537.36 MB
>* Total amount of disk used: 2128.63 MB
>* An example of 'validation' looks as follows.
>```json
{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}
>```
>https://huggingface.co/datasets/mlsum#fr

In [1]:
from pathlib import Path
import re
from tqdm import tqdm

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel

from spacy.lang.fr import French

from book_loader import BookLoader

In [2]:
doc_path = Path("data/D5627-Dolan.docx").expanduser().resolve()

start_marker = r"^Introduction$"
slice_markers = (start_marker, re.compile(r"^Annexe /$"))
conclusion_marker = r"^Conclusion$"
compiled_header_marker = re.compile(
    rf"^Chapitre \d+ /.+"
    rf"|{start_marker}"
    rf"|^Stress, santé et performance au travail$"
    rf"|{conclusion_marker}")
chapter_marker = rf"^Chapitre \d+ /$|{conclusion_marker}"
na_span_markers = (
        r"^exerCiCe \d\.\d /$",
        '|'.join([chapter_marker,
                  r"^Les caractéristiques personnelles\.",
                  r"/\tLocus de contrôle$",
                  r"^L'observation de sujets a amené Rotter",
                  r"^Lorsqu'une personne souffre de stress"]))

book = BookLoader(doc_path,
                  {"slice_markers": slice_markers,
                   "chapter_marker": chapter_marker,
                   "header_marker": compiled_header_marker,
                   "na_span_markers": na_span_markers})

chapters: list[list[str]] = book.chapters

  warn("Skipping unexpected tag: %s" % (current.tag),


In [3]:
torch.cuda.is_available()
# TODO: Investigate this. Maybe has to do with exe install task you need Samuel for.
#       Other solution is to run this directly in WSL as a .py

False

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/camembert2camembert_shared-finetuned-french-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


In [5]:
# model.config.max_length = 64
len_out_seq = model.config.max_length
def generate_summary(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    # https://huggingface.co/docs/transformers/main_classes/text_generation
    output = model.generate(input_ids,
                            attention_mask=attention_mask,
                            min_length=len_out_seq * 8,
                            max_length=len_out_seq * 8,
                            repetition_penalty=0.5,
                            num_beams=10)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [12]:
nlp = French()
nlp.add_pipe("sentencizer")

# Remove last sentence as the decoder tends to generate it incompletely
def trim(text):
    assert isinstance(text, str)
    sents = map(str, nlp(text).sents)
    all_sents_but_last = list(sents)[:-1]
    return '\n'.join(all_sents_but_last)

In [13]:
chapters_to_summarize = book.chapters[1: -3] # Chapters 1 to 3 to align with available references

chapters_to_summarize = ['\n'.join(p for p in chapter)
                         for chapter in chapters_to_summarize]
summaries = {
    idx + 1: trim(generate_summary(chapter))
    for idx, chapter in tqdm(enumerate(chapters_to_summarize))
}



0it [00:00, ?it/s][A[ASetting `pad_token_id` to `eos_token_id`:6 for open-end generation.
0it [01:26, ?it/s]


1it [01:24, 84.20s/it][A[ASetting `pad_token_id` to `eos_token_id`:6 for open-end generation.


2it [02:45, 82.45s/it][A[ASetting `pad_token_id` to `eos_token_id`:6 for open-end generation.


3it [03:56, 78.83s/it][A[A


In [14]:
# nlp = French()
# nlp.add_pipe("sentencizer")

# doc = nlp("Ceci est une phrase. Ceci en est une autre.")
# sents = list(map(str, doc.sents))
# assert len(sents) == 2
# sents

In [15]:
for chap_idx, summ in summaries.items():
    print(f"CHAPITRE {chap_idx}:\n")
    print(summ)
#     for sent in nlp(summ).sents:
#         print(f"{sent}\n")
    print("-------------------------------------")

# Note: Summaries seem to be hybrid (ext and abs) and lean towards extracting leading spans

# TODO: Write these out to a file in a format that would facilitate ROUGE evaluation
# TODO: Write a "evaluation.ipynb"

CHAPTER 1:

Le coin du coach.
L'amélioration de l'efficience organisationnelle nécessite de considérer la qualité de vie et la santé psychologique au travail comme des leviers de maximisation de la performance.
Or trois facteurs ont, entre autres, rapidement transformé cette réalité, et non comme des antagonistes naturels.
Or, Le L'''C''s'Cen's, l'avenir du travail, est de plus en plus besoin de nouvelles approches managériales permettant de rehausser du même coup leur qualité et leur productivité.
Or Or, la main-d'œuvre, force est en effet nécessaire pour l'organisation, estime l'entreprise est en train de s'affranchir de la concurrence mondiale et de la capacité concurrentielle.
Or la capacité de vie au travail au travail et de leur productivité, comme l'ensemble de leur travail.
Or cette quête de la productivité est maintenant nécessaire pour simplement assurer la pérennité de leur capacité de travail.
Le travail, au-delà de la force concurrentielle, s'il faut s'appuyer sur la capac