# LLMs for summarization and translation
Input and target are sequences.

## Summarization

Extractive summarization
> select and combine parts of the original text. BERT (encoder model), T5 (encoder / decoder)

Abstractive summarization
> generate a summary word by word (sequence to sequence)

In [1]:
from datasets import load_dataset

dataset = load_dataset("ILSUM/ILSUM-1.0", "English")
print(f"Features: {dataset['train'].column_names}")

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4.78k/4.78k [00:00<00:00, 7.22MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 46.5M/46.5M [00:02<00:00, 18.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 16.8M/16.8M [00:01<00:00, 9.99MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.37M/3.37M [00:01<00:00, 2.57MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████| 12565/12565 [00:00<00:00, 31500.12 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████| 4487/4487 [00:00<00:00, 35230.15 examples/s]
Generating validation split: 100%|██████████

Features: ['id', 'Article', 'Heading', 'Summary']





In [2]:
example = dataset['train'][21]
example['Article']

'This is how an Apple Watch saved a man\'s life after detecting accidentIt all started when Gabe Burdett was waiting for his father Bob at their pre-designated location for some mountain biking at the Riverside State Park when he received a text alert from his dad\'s Apple Watch, saying it had detected a "hard fall".Burdett, from city of Spokane in Washington State later received another update from the Watch, saying his father had reached Sacred Heart Medical Center."We drove straight there but he was gone when we arrived. I get another update from the Watch saying his location has changed with a map location of SHMC. Dad flipped his bike at the bottom of Doomsday, hit his head and was knocked out until sometime during the ambulance ride," Burdett wrote in a Facebook post.The Watch notified 911 with the location and within 30 minutes, emergency medical services (EMS) took the injured Bob to the hospital."If you own an Apple Watch, set up your hard fall detection, it\'s not just for wh

Interesting attributes for training:
* Article: full text
* Summary: ground truth for testing predictions

Load pre-trained LLM

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-small"  # encoder/decoder model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer.encode(
    'summarize: ' + example['Article'], # specify task by adding a prefix
    return_tensors='pt', max_length=512, truncation=True
)

summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode(
    summary_ids[0], skip_special_tokens=True)

print(summary)



a man was waiting for his father when he received a text alert from his dad's apple watch. the watch notified 911 with the location and within 30 minutes, emergency medical services took the injured Bob to the hospital. the watch notified 911 with the location and within 30 minutes, emergency medical services took the injured Bob to the hospital.


## Another example of summarization

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

dataset = load_dataset("opinosis")
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████| 3.40k/3.40k [00:00<00:00, 7.42MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5.97k/5.97k [00:00<00:00, 10.6MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 757k/757k [00:00<00:00, 12.1MB/s]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 4694.19 examples/s]


In [12]:
print(f"Number of instances: {len(dataset['train'])}")
# Show the names of features in the training fold of the dataset
print(f"Feature names: {dataset['train'].column_names}")

Number of instances: 51
Feature names: ['review_sents', 'summaries']


In [15]:
# Encode the input example, obtain the summary, and decode it
example = dataset['train'][-2]['review_sents']
input_ids = tokenizer.encode("summarize: " + example, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\nOriginal text\n", example[:400])
print("\nSummary\n", summary)


Original text
 I bought the 8, gig Ipod Nano that has the built, in video camera .
  Itunes has an on, line store, where you may purchase and download music and videos which will install onto the ipod .
I have lots of music cd's and dvd's, so currently I'm just interested in storing some of my music and videos on the ipod so I can enjoy them on my vacation, and while at work .
There's a right way and wrong wa

Summary
 I bought the 8, gig Ipod Nano that has the built, in video camera. Itunes has an on, line store, where you may purchase and download music and videos which will install onto the ipod.


## Translation

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# input_seq = "Je suis le poinçonneur des Lilas, le gars qu'on croise et qu'on ne regarde pas."
input_seq = """Je revois la ville en fête et en délire
Suffoquant sous le soleil et sous la joie
Et j'entends dans la musique les cris, les rires
Qui éclatent et rebondissent autour de moi"""
input_ids = tokenizer.encode(input_seq, return_tensors='pt')
translated_ids = model.generate(input_ids)
translated_text = tokenizer.decode(
    translated_ids[0], skip_special_tokens=True
)
translated_text

'I see the city in celebration and delirium Suffocating under the sun and joy And I hear in the music the screams, the laughter that burst and bounce around me'