<a href="https://colab.research.google.com/github/adalves-ufabc/2022.Q2-PLN/blob/main/2022_Q2_PLN_Notebook_31.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Processamento de Linguagem Natural [2022.Q2]**
Prof. Alexandre Donizeti Alves

### **Sumarização Abstrativa de Textos**

### **Usando Transformers [Huggingface]**

Na tarefa de sumarização, é difícil responder à pergunta se o resumo do texto é bom? Uma das perguntas mais importantes que queremos responder é – O texto é informativo o suficiente? Resumos de textos não são tão difíceis de fazer com `Huggingface Transformers`. 

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
text = '''The tower is 324 meters (1,063 ft) tall, about the same height 
as an 81-storey building, and the tallest structure in Paris. Its base is square, 
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel 
Tower surpassed the Washington Monument to become the tallest man-made structure 
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters. 
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is 
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, 
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.'''

In [4]:
text

'The tower is 324 meters (1,063 ft) tall, about the same height \nas an 81-storey building, and the tallest structure in Paris. Its base is square, \nmeasuring 125 meters (410 ft) on each side. During its construction, the Eiffel \nTower surpassed the Washington Monument to become the tallest man-made structure \nin the world, a title it held for 41 years until the Chrysler Building in New York\nCity was finished in 1930. It was the first structure to reach a height of 300 meters. \nDue to the addition of a broadcasting aerial at the top of the tower in 1957, it is \nnow taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, \nthe Eiffel Tower is the second tallest free-standing structure in France\nafter the Millau Viaduct.'

In [9]:
from transformers import pipeline

# using pipeline API for summarization task
summarization = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [10]:
summary_text = summarization(text)[0]['summary_text']
summary_text

' The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris . During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world .'

O Huggingface contém a seção `Models` onde você pode escolher a tarefa com a qual deseja lidar - no nosso caso, escolheremos a Sumarização. 

A tarefa de sumarização usa um `transformer` de codificador-decodificador padrão – rede neural com um modelo de atenção. Os `transformers` introduziram a ‘atenção’, que é responsável por capturar a relação entre todas as palavras que ocorrem em uma frase. Neste tutorial usaremos um exemplo de texto e três modelos.

Vamos utilizar os seguintes modelos:

     BART (default)
     Pegasus
     T5

**BART**

O BART é um codificador-decodificador padrão do `Transformer`, mas na tarefa de pré-treinamento temos uma abordagem semelhante a um resumo extrativo – frases importantes são extraídas de um documento de entrada e unidas como uma sequência de saída das frases restantes.

In [6]:
from transformers import pipeline

summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")
summarizer(text)

Downloading config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'summary_text': 'The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square,  \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.'}]

**Pegasus**

In [5]:
from transformers import pipeline

summarizer = pipeline("summarization", model = "google/pegasus-xsum")
summarizer(text)

[{'summary_text': 'The Eiffel Tower is a free-standing structure in Paris, France.'}]

**T5**

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead

In [13]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

Downloading spiece.model:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [14]:
inputs = tokenizer.encode("summarize: " + text,
                          return_tensors='pt',
                          max_length=512,
                          truncation=True)

In [15]:
summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)

In [16]:
summary = tokenizer.decode(summary_ids[0])
summary

'<pad> the tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. it is the second tallest free-standing structure in France after the millau Viaduct. the tower is the second tallest free-standing structure in france after the millau Viaduct. it was the first structure to reach a height of 300 meters.</s>'

**Mais informações:**

> https://rubikscode.net/2022/04/25/text-summarization-with-huggingface-transformers/

> https://betterprogramming.pub/how-to-summarize-text-with-googles-t5-4dd1ae6238b6