## GPT2 text generation summarizer pipeline

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp /content/drive/MyDrive/TFM-MUECIM/*.py /content
!cp /content/drive/MyDrive/TFM-MUECIM/GPT2_trained_model_202504-5epochs-medium.tar.gz /content
!cd /content; tar xzf GPT2_trained_model_202504-5epochs-medium.tar.gz

In [3]:
!pip install transformers



In [4]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, set_seed

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')  # gpt2
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

In [6]:
trainedModelFile = 'GPT2_trained_model_202504'
model = GPT2LMHeadModel.from_pretrained(trainedModelFile)

In [8]:
document = '''
THE COMMISSION OF THE EUROPEAN COMMUNITIES
,
Having regard to the Treaty establishing the European Economic Community;
Having regard to Council Regulation No 1009/67/EEC (1) of 18 December 1967 on the common organisation of the market in sugar, as amended by Regulation (EEC) No 2100/68, (2) and in particular Articles 8 (3) and 32 (4) thereof;
Whereas it should be made clear that the notification referred to in the first indent of Article 32 (2) of Regulation No 1009/67/EEC places an obligation on the factory or undertaking concerned to keep the quantity carried forward in store during the period referred to in the second indent of the same subparagraph without any reimbursement of storage costs ; whereas it should also be specified that if, despite the obligation to store, that quantity is disposed of during the above-mentioned period storage costs shall not be reimbursed ; whereas, in order to prevent such disposal from making the carry forward system ineffective, it is important that the quantity concerned should continue to be treated as production within the basic quota for the following marketing year ; whereas, if that system is to be effective, the production levy applicable during the marketing year in which the sugar was produced should be charged in the event of premature disposal;
Whereas the measures provided for in this Regulation are in accordance with the Opinion of the Management Committee for Sugar;
The following Article 4a shall be inserted in Regulation (EEC) No 103/69 (3):
"Article 4a
1. By virtue of its notification to the Member State concerned pursuant to the first indent of the first subparagraph of Article 32 (2) of Regulation No 1009/67/EEC, the factory or undertaking incurs an obligation to keep the quantity carried forward in store during the period referred to in the second indent of that subparagraph.
2. The quantity in respect of which the factory or undertaking does not fulfil the obligation referred to in paragraph 1:  (a) shall not qualify for reimbursement of storage costs pursuant to Article 8 of Regulation No 1009/67/EEC for that part of the period referred to in paragraph 1 during which that quantity was kept in store;
(b) shall be subject to the production levy applicable during the marketing year in which that quantity was produced;  (1) OJ No 308, 18.12.1967, p. 1. (2) OJ No L 309, 24.12.1968, p. 4. (3) OJ No L 14, 21.1.1969, p. 9.
(c) shall be treated as production within the basic quota for the factory or undertaking concerned for the marketing year to which that quantity should have been carried forward."
'''
TEXT_LENGTH = 760
MAX_LEN = 760 # TOKENS
if len(document) > TEXT_LENGTH:
    document = document[:TEXT_LENGTH]
document += ' TL;DR: '

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
tokenizer.pad_token = tokenizer.eos_token

input_ids = tokenizer.encode(
    document,
    return_tensors='pt',
    truncation=True,
    max_length=MAX_LEN)

unknown_token_id = tokenizer.unk_token_id
if unknown_token_id in input_ids:
    print(f"Found unknown token id {unknown_token_id} in input_ids. Replacing with eos_token_id.")
    input_ids = torch.where(input_ids == unknown_token_id, tokenizer.eos_token_id, input_ids)


In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Check the dtype of input_ids and ensure it's torch.long
if input_ids.dtype != torch.long:
    input_ids = input_ids.type(torch.long)


# Clamp input_ids to be within the vocabulary range
input_ids = torch.clamp(input_ids, 0, tokenizer.vocab_size - 1)

input_ids = input_ids.to(device)

MAX_LEN_GEN = max(7 * input_ids.shape[1] / 4, 1024)

generated_outputs = model.generate(
  input_ids,
  do_sample = True,
  top_k = 50,
  top_p = 0.85,
  pad_token_id = tokenizer.eos_token_id,
  max_length = MAX_LEN_GEN
)

generatedText = ''

for _, generated_output in enumerate(generated_outputs):
    generatedText += tokenizer.decode(generated_output, skip_special_tokens = True)

textFields =  generatedText.split('TL;DR:')
docLen = len(document)
baseText = textFields[0]
baseLen = len(baseText)
discarded =  docLen - baseLen
summary = textFields[1]
summaryLen = len(summary)
print(f'\n\nSUMMARY:\nDocument length: {docLen} chars\nInput length: {baseLen} chars (discarded: {discarded} chars)\nResponse length: {len(summary)} chars')
print(100 * '-')
print(summary)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.




SUMMARY:
Document length: 768 chars
Input length: 761 chars (discarded: 7 chars)
Response length: 2045 chars
----------------------------------------------------------------------------------------------------
  (a) the factory or undertaking has not paid the costs referred to in paragraph 1, (b) no reimbursement of storage costs may be collected;
Whereas the first indent of the second subparagraph of Article 32 (2) of Regulation No 1009/67/EEC requires the factory or undertaking concerned to keep the quantity carried forward in store, subject to the observance of the rules laid down in Article 4 of the said Regulation;
Whereas the second indent of the second subparagraph of Article 32 (2) of Regulation No 1009/67/EEC requires the factory or undertaking concerned to pay to the consumer the amounts due on the quantities in question, without reimbursement of storage costs, provided that such reimbursement is not later than the date of expiry of the storage contract  or, in the case of 