In [1]:
import torch
import pandas as pd
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration
from bert_score import score

train_df = pd.read_csv("dataset/train_processed.csv")

In [2]:
tokenizer_t5 = T5Tokenizer.from_pretrained("google-t5/t5-large")

device = torch.device("cuda") # GPU usage
model_t5 = T5ForConditionalGeneration.from_pretrained("google-t5/t5-large")
model_t5.to(device)

tokenizer_t5.model_max_length = 4096 

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
def ReadingWithJumpingWindow(model, tokenizer, text):
    torch.cuda.empty_cache()
    if len(text.split()) < tokenizer.model_max_length:
        inputs = tokenizer("summarize: " + text,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length,\
                                num_beams=4, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    JUMP = 100
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(int(start + tokenizer.model_max_length + JUMP), len(words))
        chunks.append(' '.join(words[int(start):int(start + tokenizer.model_max_length/2)]) \
                        + " " + ' '.join(words[int(start + tokenizer.model_max_length/2) + JUMP:end]))
        start += tokenizer.model_max_length/2

    print(len(chunks))
    
    summarized_chunks = []
    for chunk in chunks:
        inputs = tokenizer("summarize: " + chunk,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length,\
                                num_beams=4, early_stopping=True)
        summarized_chunks.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
        torch.cuda.empty_cache()
    
    summarized_text = ' '.join(summarized_chunks)
    
    if len(summarized_text.split()) > tokenizer.model_max_length:
        return ReadingWithJumpingWindow(model, tokenizer, summarized_text)
    else:
        inputs = tokenizer("summarize: " + summarized_text,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length,\
                                num_beams=4, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

The following test was done to assess the performance on the model on texts that it can analyze without splitting it in chunks

In [4]:
F1_t5_less_4096 = []

for index, row in train_df[train_df['reference_tokens'] < 4096][:3].iterrows():
    reference_text = row["reference"]
    reference_summary = row["summary"]
    print(row["reference_tokens"])

    result_summary = ReadingWithJumpingWindow(model_t5, tokenizer_t5, reference_text)
    print(result_summary)
    P, R, F1 = score([result_summary], [reference_summary], lang='en', verbose=False)
    print(f"T5 BertScore F1: {F1.item():.2f}")
    F1_t5_less_4096.append(F1.item())
    torch.cuda.empty_cache()

np.save('F1_t5_less_4096.npy', F1_t5_less_4096)

sum = 0
for _ in F1_t5_less_4096:
    sum += _
print(sum / len(F1_t5_less_4096))

654
equipment using ultra-wideband technology is defined as 'equipment incorporating a technology for short-range radiocommunication' decision is to harmonise the technical condition for the availability and efficient use of radio spectrum in the union.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
3116
this agreement shall apply to the civil aviation regulatory system of the people's republic of china and the civil aviation regulatory system of the european union. each party shall ensure that the other party is kept informed of all it relevant law, regulation, standard and procedure.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.82
92
this regulation shall enter into force on the day following that of it publication in the official journal of the european union. it shall apply not later than 30 month after it entry into force.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.77
0.7944852709770203


The following test was done to assess the performance on the model on texts that it must split in chunks to perform the summarization task

In [5]:
F1_t5_more_4096 = []

for index, row in train_df[train_df['reference_tokens'] > 4096][:3].iterrows():
    reference_text = row["reference"]
    reference_summary = row["summary"]
    print(row["reference_tokens"])

    result_summary = ReadingWithJumpingWindow(model_t5, tokenizer_t5, reference_text)
    print(result_summary)
    P, R, F1 = score([result_summary], [reference_summary], lang='en', verbose=False)
    print(f"T5 BertScore F1: {F1.item():.2f}")
    F1_t5_more_4096.append(F1.item())
    torch.cuda.empty_cache()

np.save('F1_t5_more_4096.npy', F1_t5_more_4096)

sum = 0
for _ in F1_t5_more_4096:
    sum += _
print(sum / len(F1_t5_more_4096))

8443
5
eu has adopted a set of rules for the preparation of prospectuses. they include minimum information to be included in the registration document. additional information with respect to an entity other than the issuer shall be included in the security note.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
19979
10
national administrator shall open an operator holding account in the union registry. account holder may request the removal of an authorised representative. national administrator may refuse to approve an authorised representative if the information and document provided are incomplete, out-of-date or false.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
5394
3
up to 40 % of the total amount of the innovation fund support to a specific project shall be disbursed upon reaching the pre-determined milestone. the amount paid or to be paid to the project proponent in accordance with article 5 of the financial regulation shall be proportionately recovered or reduced.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.79
0.7953208883603414
