## Generación de Texto con Búsqueda Greedy (Greedy Search) usando GPT-2

GPT-2 es un modelo con una arquitectura **decoder-only** (solo decodificador), lo que significa que está diseñado para predecir el siguiente token en una secuencia basándose en los tokens anteriores. Es inherentemente adecuado para tareas de generación de texto a partir de un *prompt* o contexto inicial.



### 1. Carga del Modelo y Tokenizador

Primero, cargamos el modelo y el tokenizador necesarios desde la librería `transformers` de Hugging Face:

* devive: Nos aseguramos que estemos usando el poder computacional de nuestra GPU ;p
* model: Modelo preentrenado, en este caso usaremos un modelo preentrado `gpt2-xl`
* tokenizer: Tokenizador para nuestro modelo, se ajusta segun el modelo seleccionado

ref: https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/auto

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

In [45]:
device

'cuda'

In [7]:
tokenizer

GPT2TokenizerFast(name_or_path='gpt2-xl', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

## 2. Greedy Search Decoding --> torch implementation

In [46]:
import pandas as pd
from src.greedy_search import greedy_search_generation

In [47]:
input_txt = "I"
iterations = greedy_search_generation(model, tokenizer, input_txt)
df = pd.DataFrame(iterations)
df

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,I,'m (9.32%),think (6.08%),am (4.13%),have (3.90%),don (3.75%)
1,I'm,not (17.84%),going (5.71%),a (5.38%),sure (3.83%),sorry (3.70%)
2,I'm not,sure (13.76%),going (11.80%),a (8.03%),saying (7.66%),trying (2.16%)
3,I'm not sure,if (15.01%),what (13.82%),how (10.56%),I (8.58%),that (6.73%)
4,I'm not sure if,I (14.76%),it (13.96%),you (12.76%),this (10.01%),the (6.76%)
5,I'm not sure if I,'m (20.09%),can (10.83%),'ll (7.12%),should (7.07%),'ve (6.32%)
6,I'm not sure if I'm,going (9.95%),a (4.82%),the (4.74%),ready (4.09%),supposed (4.02%)
7,I'm not sure if I'm going,to (96.20%),through (0.27%),back (0.27%),too (0.25%),crazy (0.25%)


In [48]:
iterations[-1]

{'Input': "I'm not sure if I'm going",
 'Choice 1': ' to (96.20%)',
 'Choice 2': ' through (0.27%)',
 'Choice 3': ' back (0.27%)',
 'Choice 4': ' too (0.25%)',
 'Choice 5': ' crazy (0.25%)'}

In [49]:
n_steps = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
iterations = greedy_search_generation(model, tokenizer, input_txt, n_steps=n_steps)
df = pd.DataFrame(iterations)
df

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,"In a shocking finding, scientist discovered a ...",The (29.40%),""" (7.33%)",According (4.11%),In (2.38%),A (1.94%)
1,"In a shocking finding, scientist discovered a ...",researchers (9.20%),discovery (6.21%),scientists (5.67%),herd (3.55%),unic (3.12%)
2,"In a shocking finding, scientist discovered a ...",", (12.27%)",were (8.31%),believe (5.64%),from (5.59%),found (3.50%)
3,"In a shocking finding, scientist discovered a ...",from (27.48%),who (16.70%),led (16.54%),Dr (3.07%),a (1.64%)
4,"In a shocking finding, scientist discovered a ...",the (62.88%),Argentina (1.19%),University (0.99%),Columbia (0.85%),Arizona (0.81%)
...,...,...,...,...,...,...
123,"In a shocking finding, scientist discovered a ...",to (97.53%),that (0.80%),by (0.59%),at (0.27%),when (0.24%)
124,"In a shocking finding, scientist discovered a ...",find (99.49%),learn (0.19%),see (0.16%),discover (0.05%),finding (0.02%)
125,"In a shocking finding, scientist discovered a ...",that (97.64%),the (1.04%),unic (0.39%),a (0.26%),out (0.11%)
126,"In a shocking finding, scientist discovered a ...",the (97.43%),unic (0.80%),they (0.34%),there (0.14%),", (0.13%)"


In [50]:
# mostrar el último paso
df.iloc[-1]['Input']

'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\n\nThe researchers, from the University of California, Davis, and the University of Colorado, Boulder, were conducting a study on the Andean cloud forest, which is home to the rare species of cloud forest trees.\n\n\nThe researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.\n\n\nThe researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.\n\nThe researchers were surprised to find that the unicorns were able to communicate with each other, and even with humans.\n\nThe researchers were surprised to find that the'