## [CamemBERTsum](https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization) fine-tuned on [mlsum-fr](https://huggingface.co/datasets/viewer/?dataset=mlsum)

> MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from **online newspapers**, it contains **1.5M+ article/summary pairs** in five different languages -- namely, **French**, German, Spanish, Russian, Turkish.
> https://huggingface.co/mrm8488/camembert2camembert_shared-finetuned-french-summarization#dataset

>* Size of downloaded dataset files: 591.27 MB
>* Size of the generated dataset: 1537.36 MB
>* Total amount of disk used: 2128.63 MB
>* An example of 'validation' looks as follows.
>```json
{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}
>```
>https://huggingface.co/datasets/mlsum#fr

In [1]:
from pathlib import Path

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel

from book_loader import BookLoader

In [2]:
data_path = Path("data/D5627-Dolan.docx").expanduser().resolve()

start_marker = r"^Introduction$"
compiled_header_marker = (rf"(?:^Chapitre \d+ /.+"
                          rf"|{start_marker}"
                          rf"|^Stress, santé et performance au travail$)")
chapter_marker = r"^Chapitre (\d+) /$"
na_span_markers = (
        [r"^exerCiCe \d\.\d /$"],
        [chapter_marker,
         r"^Les caractéristiques personnelles\.",
         r"/\tLocus de contrôle$",
         r"^L'observation de sujets a amené Rotter",
         r"^Lorsqu'une personne souffre de stress"])

book = BookLoader(data_path,
                  {"start_marker": start_marker,
                   "end_marker": r"^Annexe /$",
                   "chapter_marker": chapter_marker,
                   "header_marker": compiled_header_marker,
                   "ps_marker": r"^Conclusion$",
                   "na_span_markers": na_span_markers})

chapters: list[list[str]] = book.chapters

  warn("Skipping unexpected tag: %s" % (current.tag),


In [3]:
torch.cuda.is_available()
# TODO: Investigate this. Maybe has to do with exe install task you need Samuel for.
#       Other solution is to run this directly in WSL as a .py

False

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/camembert2camembert_shared-finetuned-french-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


In [5]:
def generate_summary(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [6]:
# TODO: Patching out tables on-the-fly until it's done in book_loaders.py
# chapters = ['\n'.join(paragraph for paragraph in chapter if isinstance(paragraph, str))
#             for chapter in book.chapters]
# print(chapters[1][:5])

In [8]:
chapters = ['\n'.join(p for p in chapter) for chapter in book.chapters]

[['Introduction',
  "Nous croyons que cette nouvelle édition paraît à point nommé, car le monde organisationnel est soumis à des exigences de productivité historiques, exigences qui demeurent, faut-il l'admettre, complexes à conjuguer avec une promotion et un soutien de la santé psychologique au travail. Si certains cadres envisagent stratégiquement d'accroître la productivité qualitative et quantitative de leur personnel en les poussant à la limite de leurs capacités, trop souvent ces mêmes gestionnaires ont de la difficulté à entrevoir les effets humains collatéraux de cette quête productiviste. Ils ne voient pas (ou simplement, sous-estiment) l'incidence directe qu'a sur leur propre vie et leur santé les pressions psychologiques qu'ils subissent ainsi que celles qu'ils font conséquemment subir à leur équipe de travail. Il serait évidemment utopique d'espérer aujourd'hui rencontrer des contextes de travail exempts de toutes formes de stress. Le père du stress, Hans Selye, l'exprimait

In [7]:
summaries = {idx: generate_summary(chapter) for idx, chapter in enumerate(chapters)}

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

In [None]:
summaries[0]

In [None]:
for chap_idx, summ in summaries.items():
    print(f"Chapter {chap_idx}:\n{summ}\n")

# TODO: Investigate how these summaries compare to the references
# TODO: Investigate why it's one sentence per summary