In [1]:
import torch
import pandas as pd
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration
from bert_score import score

train_df = pd.read_csv("dataset/train_processed.csv")

In [2]:
tokenizer_t5 = T5Tokenizer.from_pretrained("google-t5/t5-large")

device = torch.device("cuda") # GPU usage
model_t5 = T5ForConditionalGeneration.from_pretrained("google-t5/t5-large")
model_t5.to(device)

tokenizer_t5.model_max_length = 4096 

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
def ReadingWithJumpingWindow(model, tokenizer, text):
    torch.cuda.empty_cache()
    if len(text.split()) < tokenizer.model_max_length:
        inputs = tokenizer("summarize: " + text,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    JUMP = 100
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(int(start + tokenizer.model_max_length + JUMP), len(words))
        chunks.append(' '.join(words[int(start):int(start + tokenizer.model_max_length/2)]) \
                        + " " + ' '.join(words[int(start + tokenizer.model_max_length/2) + JUMP:end]))
        start += tokenizer.model_max_length/2

    print(len(chunks))
    
    summarized_chunks = []
    for chunk in chunks:
        inputs = tokenizer("summarize: " + chunk,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length)
        summarized_chunks.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
        torch.cuda.empty_cache()
    
    summarized_text = ' '.join(summarized_chunks)
    
    if len(summarized_text.split()) > tokenizer.model_max_length:
        return ReadingWithJumpingWindow(model, tokenizer, summarized_text)
    else:
        inputs = tokenizer("summarize: " + summarized_text,\
                           return_tensors="pt",\
                           max_length=tokenizer.model_max_length,\
                           truncation=True).to(device)
        outputs = model.generate(**inputs, min_length=0, max_new_tokens=tokenizer.model_max_length)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

The following test was done to assess the performance on the model on texts that it can analyze without splitting it in chunks

In [4]:
F1_t5_less_4096 = []

for index, row in train_df[train_df['reference_tokens'] < 4096][:5].iterrows():
    reference_text = row["reference"]
    reference_summary = row["summary"]
    print(row["reference_tokens"])

    result_summary = ReadingWithJumpingWindow(model_t5, tokenizer_t5, reference_text)
    print(result_summary)
    P, R, F1 = score([result_summary], [reference_summary], lang='en', verbose=False)
    print(f"T5 BertScore F1: {F1.item():.2f}")
    F1_t5_less_4096.append(F1.item())
    torch.cuda.empty_cache()

np.save('F1_t5_less_4096.npy', F1_t5_less_4096)

sum = 0
for _ in F1_t5_less_4096:
    sum += _
print(sum / len(F1_t5_less_4096))

654
equipment using ultra-wideband technology is defined as equipment. equipment using ultra-wideband technology must meet technical condition. equipment using ultra-wideband technology must be used indoors.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.79
3116
agreement shall apply to civil aviation regulatory system of the people's republic of china and the civil aviation regulatory system of the european union. parties agree to cooperate in all areas of civil aviation safety. agreement shall be binding on both parties and shall remain in force until terminated by either party.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.81
92
the technical specification for electronic ship reporting in inland navigation shall be a set out in the annex. regulation shall enter into force on the day following that of it publication in the official journal of the european union. it shall apply not later than 30 month after it entry into force.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.78
2577
eu regulation establishes requirement for the labelling of electric mains-operated refrigerating appliance with a direct sale function. it shall enter into force on the twentieth day following that of it publication in the official journal of the european union.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.81
2877
light source and separate control gear must comply with ecodesign requirement. they must be placed on the market in a containing product and be replaceable. light source and separate control gear must be available on a free-access website.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
0.7979308128356933


The following test was done to assess the performance on the model on texts that it must split in chunks to perform the summarization task

In [5]:
F1_t5_more_4096 = []

for index, row in train_df[train_df['reference_tokens'] > 4096][:5].iterrows():
    reference_text = row["reference"]
    reference_summary = row["summary"]
    print(row["reference_tokens"])

    result_summary = ReadingWithJumpingWindow(model_t5, tokenizer_t5, reference_text)
    print(result_summary)
    P, R, F1 = score([result_summary], [reference_summary], lang='en', verbose=False)
    print(f"T5 BertScore F1: {F1.item():.2f}")
    F1_t5_more_4096.append(F1.item())
    torch.cuda.empty_cache()

np.save('F1_t5_more_4096.npy', F1_t5_more_4096)

sum = 0
for _ in F1_t5_more_4096:
    sum += _
print(sum / len(F1_t5_more_4096))

8443
5
eu has adopted a set of rules to regulate the disclosure of information in prospectuses and security notes. the rules apply to all securities issued by a regulated market or by a third country. the information must be provided in the order in which they are required.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
19979
10
account governed by the law and fall under the jurisdiction of the member state of their administrator. account holder may object to the suspension of access to account if it is not resolved within a reasonable period. account holder may report any fraud or suspected fraud to the competent national law enforcement authority.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.78
5394
3
eu innovation fund is intended to support projects demonstrating high level of innovation. the fund is intended to be used to support projects reducing greenhouse gas emissions. the commission shall invite the proponent of those project to submit a full application.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.82
7139
4
uas operator shall declare to the competent authority that they are able to carry out the operation. uas operator shall register themselves in the member state where they have their residence or principal place of business. uas operator shall provide the competent authority with the information required to carry out the operation.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.79
10476
6
eu regulation applies to uas intended to be operated in the 'open' category. it also applies to uas intended to be operated in the'specific' category. it establishes rule on making uas available on the market and their free movement.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


T5 BertScore F1: 0.80
0.7999849915504456
