Integrantes:

Alex Echeverria <br>
Heloisy Rodrigues <br>
Luiz Fernando Vidal <br>
Luiz Guilherme Corrêa <br>

## Setup

### Instalação

In [None]:
!pip install transformers==4.20.0
!pip install keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.20.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m104.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.12.1 transformers-4.20.0
Looking in indexes: https://pypi.org/simple, ht

### Imports

In [None]:
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Configurações para o fine-tunning

In [None]:
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024
MIN_TARGET_LENGTH = 5
MAX_TARGET_LENGTH = 512
BATCH_SIZE = 8
LEARNING_RATE = 1e-4
MAX_EPOCHS = 5


MODEL_CHECKPOINT = "t5-small"

## Download do dataset

Será utilizado o dataset [Extreme Summarization (XSum)](https://arxiv.org/abs/1808.08745). Ele possui artigos da BBC juntamente com seus resumos, contendo 226.711 artigos.

A função load_dataset do Hugging Face será utilizada para carregar o dataset.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading and preparing dataset xsum/default to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71. Subsequent calls will reuse this data.


Features presentes no dataset:

- **document**: o artigo da BBC em si
- **summary**: o resumo do artigo
- **id**: o ID do par artigo-resumo

In [None]:
print(raw_datasets)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})


In [None]:
#printando um exemplo
print(f"Artigo:{raw_datasets[0]['document']}\nResumo:{raw_datasets[0]['summary']}")

Artigo:The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but i

Dividindo os dados em 90% treino e 10% teste

In [None]:
raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

## Pré-processamento dos dados


Como o T5 será o modelo em que o fine-tunning será aplicado, o tokenizador a ser utilizada também é o utilizado no T5. Através do tokenizador, o T5 irá tokenizar o texto com base em seu vocabulário, além de já converter os tokens para seus respectivos id's.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Como o T5 é capaz de executar diversas tasks em um formato text-to-text, aqui é definido o prefixo a ser colocado antes do texto de input, indicando que o T5 deve resumir tal texto.

In [None]:
prefix = "summarize: "

A função a seguir é responsável por adicionar o prefixos aos textos de entrada, tokenizar os textos de entrada, mas também tokenizar a variável alvo, que são os resumos.

In [None]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs


Em seguida, o dataset carregado é transformado utilizando a função definida acima.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

## Criando o modelo

Como será uma tarefa textual em que uma sequência é convertida em outra sequência, a classe a ser utilizada para carregar o modelo em que o fine-tunning será aplicado será a TFAutoModelForSeq2SeqLM.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


O data collator abaixo será utilizado para dividir os dados em batches, além de aplicar padding caso seja necessário.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Abaixo os dados são divididos em treino, teste e geração. Os dados de geração são utilizados para calcular a métrica ROUGE durante o treinamento.

Além disso, os dados também são divididos em batches.

In [None]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

Compilando o modelo:

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


## Treino e avaliação do modelo

O modelo será avaliado de acordo com as métricas Rouge-1, Rouge-2 e Rouge-L.

In [None]:
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [None]:
from rouge import Rouge

Função que irá calcular as métricas durante o treinamento:

In [None]:
import keras_nlp
#import rouge_score

#rouge_l = keras_nlp.metrics.RougeL()
rouge = Rouge()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    #result = rouge_l(decoded_labels, decoded_predictions)
    result = rouge.get_scores(decoded_predictions, decoded_labels, avg=True)
    
    result = {"Rouge-1": result["rouge-1"]["f"], "Rouge-2": result["rouge-2"]["f"], "RougeL": result["rouge-l"]["f"]}

    return result


Iniciando o treinamento:

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]


model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5c76285a30>

Salvando o modelo treinado:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model.save_weights('/content/drive/MyDrive/summary_model_weights.h5')

## Inferência

In [None]:
model.load_weights('/content/drive/MyDrive/summary_model_weights.h5')

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

def calculate_tfidf(text, texts):
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(texts)
    feature_names = tfidf.get_feature_names_out()
    tfidf_values = tfidf.transform([text]).toarray()[0]
    tfidf_df = pd.DataFrame({'feature_names': feature_names, 'tfidf_values': tfidf_values})
    hashtags = tfidf_df.nlargest(5, 'tfidf_values').feature_names.to_numpy()
    result = [f"#{h}" for h in hashtags]
    return result


In [None]:
def generate_summary_news(index):
  resumo = summarizer(
        raw_datasets["test"][index]["document"],
        min_length=MIN_TARGET_LENGTH,
        max_length=128,
      )[0]['summary_text']
  
  hashtags = calculate_tfidf(raw_datasets['test'][index]['document'],raw_datasets['test']['document'])
  print(f"Texto:\n{raw_datasets['test'][index]['document']}\n")
  print(f"Resumo:\n{resumo}\n")
  for item in hashtags:
    print(item, end=' ')


In [None]:
generate_summary_news(17)

Texto:
The man, who was a passenger in the car, was shot in the head in West Bromwich shortly after 14:30 GMT, West Midlands Police said.
The victim, who was in his 30s, died at the scene in Dartmouth Street and police have opened a murder investigation.
Police said the junction of High Street and Dartmouth Street was cordoned off while forensic inquiries took place.
A post-mortem examination is due to take place.
Det Insp Martin Slevin said: "The investigation is at an early stage, my officers are currently carrying out inquiries at the scene and house to house and CCTV. There will also be extra reassurance patrols in the local area."
He appealed for witnesses to come forward.

Resumo:
A man has died after being shot in the head in a car in West Bromwich.

#dartmouth #street #inquiries #slevin #the 

In [None]:
generate_summary_news(11)

Texto:
Brisbane-born Selman, 20, scored 62 and 78 in two games for the county's second XI in August.
He has represented Queensland at Under-19 level, but has dual Australian and United Kingdom citizenship.
Selman spent most of the 2015 season with Kent's second XI and also turned out once for Gloucestershire seconds.
He has been playing for Tunbridge Wells in the Kent Premier League for the past two seasons but does not have first-class experience.
Selman was also a talented Australian Rules footballer at state age-group level.
He could challenge Will Bragg and James Kettleborough for places alongside captain Jacques Rudolph in the top three of the Glamorgan batting order.

Resumo:
Glamorgan have signed former Gloucestershire batsman James Selman on a two-year contract.

#selman #xi #kent #australian #kettleborough 

In [None]:
print(raw_datasets['test'][17]['document'])

The man, who was a passenger in the car, was shot in the head in West Bromwich shortly after 14:30 GMT, West Midlands Police said.
The victim, who was in his 30s, died at the scene in Dartmouth Street and police have opened a murder investigation.
Police said the junction of High Street and Dartmouth Street was cordoned off while forensic inquiries took place.
A post-mortem examination is due to take place.
Det Insp Martin Slevin said: "The investigation is at an early stage, my officers are currently carrying out inquiries at the scene and house to house and CCTV. There will also be extra reassurance patrols in the local area."
He appealed for witnesses to come forward.
