#Instalación de paquetes

In [1]:
# Pytorch viene por defecto en google colab
!pip install accelerate -U
!pip install transformers[torch]

Collecting accelerate
  Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.10.0->

In [75]:
import transformers
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

# función para calcular perplexidad

In [84]:
def calculate_perplexity(text,tokenizer,model,device):
  model=model.to(device)
  inputs = tokenizer(text, return_tensors = "pt").to(device)
  loss = model(input_ids = inputs["input_ids"].to(device), labels = inputs["input_ids"].to(device)).loss
  ppl = torch.exp(loss)
  return ppl.item()


In [79]:
if torch.cuda.is_available():
    device = "cuda:0"
else:
    device = "cpu"

# Modelo Base

In [191]:
base_model_id='gpt2'

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,

)

base_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
)

# Se genera Texto acerca de Duitama y se calcula su respectiva perplexidad

In [334]:
base_inputs = base_tokenizer("Duitama is", return_tensors = "pt").to(device)


In [335]:
base_tokenizer.decode(base_model.generate(base_inputs['input_ids'],max_new_tokens=200)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Duitama is a small, but very well-built, and very well-designed, and very well-designed. It's a very good design, and it's very well-designed. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a very good design. It's a"

In [336]:
calculate_perplexity("Duitama is",base_tokenizer,base_model,device) # initial perplexity

1399.706787109375

In [330]:
base_inputs = base_tokenizer("Duitama is important", return_tensors = "pt").to(device)

In [332]:
base_tokenizer.decode(base_model.generate(base_inputs['input_ids'],max_new_tokens=200)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Duitama is important to me because it's a place where I can go to learn about the world and the people I love. It's a place where I can go to learn about the world and the people I love. It's a place where I can go to learn about the world and the people I love.\n\nI'm a big fan of the show, and I'm really excited to see what the next season will bring. I'm a big fan of the show, and I'm really excited to see what the next season will bring. I'm a big fan of the show, and I'm really excited to see what the next season will bring.\n\nI'm a big fan of the show, and I'm really excited to see what the next season will bring. I'm a big fan of the show, and I'm really excited to see what the next season will bring.\n\nI'm a big fan of the show, and I'm really excited to see what the next"

In [333]:
calculate_perplexity("Duitama is important",base_tokenizer,base_model,device) # initial perplexity

1598.3819580078125

# Modelo para entrenar

In [202]:

model_id='gpt2'

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_id,

)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
)

In [203]:
#grouped_data['line_text']=grouped_data['line_text']+tokenizer.eos_token

In [167]:
tokenizer.pad_token = tokenizer.eos_token

# Usando un archivo de texto sobre Duitama extraido de Wikipedia se transformará en dataset

In [204]:
train_data=transformers.TextDataset(tokenizer=tokenizer,file_path='duitama.es.en.txt',block_size=64)

Token indices sequence length is longer than the specified maximum sequence length for this model (8837 > 1024). Running this sequence through the model will result in indexing errors



# Se definen argumentos de entrenamiento



In [205]:
training_args = transformers.TrainingArguments(
    output_dir="/content/output3",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_steps = 400,
    save_steps=800,
    warmup_steps=100,
    )

In [206]:
data_collator = transformers.DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
)


# Entrenamiento ( se dejaron pocas epocas por propósitos de practicidad en la prueba y porque google colab es limitado en el uso de gpu)

In [207]:
trainer.train()

Step,Training Loss
500,1.0724


TrainOutput(global_step=500, training_loss=1.072354248046875, metrics={'train_runtime': 254.3246, 'train_samples_per_second': 54.261, 'train_steps_per_second': 1.966, 'total_flos': 450728755200000.0, 'train_loss': 1.072354248046875, 'epoch': 100.0})

# se establece el pipeline para al inferencia

In [220]:
bot=pipeline('text-generation',model='/content/output3/checkpoint-500',tokenizer=model_id,device=0,max_length=200,truncation=True)

In [318]:
!zip -r /content/final_model.zip /content/output3

  adding: content/output3/ (stored 0%)
  adding: content/output3/runs/ (stored 0%)
  adding: content/output3/runs/Aug01_02-04-26_2f7ed00d4317/ (stored 0%)
  adding: content/output3/runs/Aug01_02-04-26_2f7ed00d4317/events.out.tfevents.1722477873.2f7ed00d4317.858.4 (deflated 60%)
  adding: content/output3/checkpoint-500/ (stored 0%)
  adding: content/output3/checkpoint-500/trainer_state.json (deflated 56%)
  adding: content/output3/checkpoint-500/optimizer.pt (deflated 8%)
  adding: content/output3/checkpoint-500/config.json (deflated 51%)
  adding: content/output3/checkpoint-500/model.safetensors (deflated 7%)
  adding: content/output3/checkpoint-500/rng_state.pth (deflated 25%)
  adding: content/output3/checkpoint-500/training_args.bin (deflated 51%)
  adding: content/output3/checkpoint-500/scheduler.pt (deflated 56%)
  adding: content/output3/checkpoint-500/generation_config.json (deflated 24%)


# Se hacen los mismo prompts que el modelo base y se calculan las perplexidades

In [315]:
text="Duitama is"
final_output=bot(text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [316]:
final_output[0]['generated_text']

'Duitama is a Colombian municipality, located in the department of Boyacá, in central-eastern Colombia, in the Alto Chicamocha region. It is the capital and largest urban center in the province of Tundama. It is known as "The Civic Capital of Boyacá" and "The Civic Capital of Boyacá" and "The Civic Capital of Boyacá" for its excellent industrial activity, commercial, transportation and cultural activities. The city is known as the "Holland Highway" and "The Civic Capital of Boyacá" for its excellent access to the water resources of the rivers and streams that bathe the local area. The city is famous for its fruit orchards of apples, peaches, pears, curubas and plums; also grapes, wheat, barley and wines. The fruit orchards of apples, peaches, curubas and plums are known as the "Pears of Boyacá'

In [317]:
calculate_perplexity(text,tokenizer,model,device)



1.0838747024536133

In [306]:
text="Duitama is important"
final_output=bot(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [307]:
final_output[0]['generated_text']

"Duitama is important for the transportation of cargo and it is the objective of the companies that provide it to carry out the maintenance of the structures that constitute the Contracting Post of the Municipality of Duitama. It is a strategic point for the provision of services and it is known as the 'Bundesliga of Cargo' because of its completeness and completeness as well as the fact that it is the first cargo transportation center in the country. Furthermore, it presents a productive industrial area and, above all, it presents a productive agricultural area. The agricultural area of Duitama is classified as one of the best in the world by the World Agricultural Organization, awarded the best possible status in the category of Agricultural Areas and, due to its large variety of crops and vegetables, it is one of the main food providers for the local population. The biological resources of the rivers and streams that bathe the landscape are recognized as being of exceptional quality

In [308]:
calculate_perplexity(text,tokenizer,model,device)


70.11921691894531