# Fine-tune a DialoGPT model

Adapted from the notebook in [this Medium post](https://towardsdatascience.com/make-your-own-rick-sanchez-bot-with-transformers-and-dialogpt-fine-tuning-f85e6d1f4e30?gi=e4a72d1510f0).

## Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
!pip -q install transformers

In [3]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks")

In [4]:
# all the imports

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

## Get Data from Kaggle

In [None]:
!mkdir ~/.kaggle
!cp /content//kaggle.json ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d thedevastator/friends-tv-show-dialog-sequences

!unzip friends-tv-show-dialog-sequences.zip -d my_dataset

friends-tv-show-dialog-sequences.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  friends-tv-show-dialog-sequences.zip
  inflating: my_dataset/sequences.csv  


In [6]:
data = pd.read_csv('HIMYM.csv')

In [7]:
data

Unnamed: 0,Season,Episode,Character,Line
0,1,1,Marshall,(Opens ring) Will you marry me.
1,1,1,Ted,"Yes, perfect! And then you're engaged, you pop..."
2,1,1,Marshall,"Got it. Thanks for helping me plan this out, Ted."
3,1,1,Ted,"Dude, are you kidding? It's you and Lily! I've..."
4,1,1,Marshall,"(laughs) yeah, sorry. We thought you were asleep."
...,...,...,...,...
22779,8,24,Marshall,It is so much fun.
22780,8,24,Marshall,I love crushed nuts.
22781,8,24,Barney,"""I'm probably saying some political stuff righ..."
22782,8,24,Barney,Whoa. Is there going to be a fight?


In [8]:
#Como hay expresiones con paréntesis (del guión), elimino todas las expresiones entre paréntesis

# Función para eliminar expresiones entre paréntesis
def eliminar_expresiones(texto):
    return re.sub(r'\([^)]*\)', '', texto)

# Aplicar la función a la columna 'line'
data['Line'] = data['Line'].apply(eliminar_expresiones)

In [9]:
data

Unnamed: 0,Season,Episode,Character,Line
0,1,1,Marshall,Will you marry me.
1,1,1,Ted,"Yes, perfect! And then you're engaged, you pop..."
2,1,1,Marshall,"Got it. Thanks for helping me plan this out, Ted."
3,1,1,Ted,"Dude, are you kidding? It's you and Lily! I've..."
4,1,1,Marshall,"yeah, sorry. We thought you were asleep."
...,...,...,...,...
22779,8,24,Marshall,It is so much fun.
22780,8,24,Marshall,I love crushed nuts.
22781,8,24,Barney,"""I'm probably saying some political stuff righ..."
22782,8,24,Barney,Whoa. Is there going to be a fight?


In [10]:
CHARACTER_NAME = 'Barney'

In [22]:
contexted = []

# Cogemos las 7 lines anteriores como contexto
n = 7


# Tomamos solo las líneas de Rick (como principales, como contexto tomamos las de todos)
for i in data[data.Character == CHARACTER_NAME].index:
    if i < n:
        continue
    row = []
    current_episode = data.Episode[i]
    prev = i - 1 - n  # restamos 1 para que la fila contenga la respuesta actual y las 7 respuestas anteriores
    for j in range(i, prev, -1):
        if data.Episode[j] == current_episode:
            row.append(data.Line[j])
        else:
            # Si la línea no pertenece al mismo episodio, dejamos de agregar al contexto
            break
    contexted.append(row)

columns = ['response', 'context']
columns = columns + ['context/' + str(i) for i in range(n - 1)]

df = pd.DataFrame.from_records(contexted, columns=columns)

# Rellenar las celdas vacías con una cadena vacía
df = df.fillna('')

In [23]:
df.sample(6)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
33,"You hate olives? Lily loves them, you can't st...",Yeah...,"So, Marshall. This ""Olive Theory"" based on you...",Thank you.,For starters,Are you trying to get me drunk?,Would you like those olives with some Gin and ...,I was just hoping to get those olives... that ...
555,"Oh, no. It just takes a while to power down.","Video's pretty good on this phone, huh?",Sorry. Don't buy it. You're making it up. You'...,Yeah.,Nope.,Yeah.,Nope.,"My life rocks! Money, suits and s*x. These are..."
3413,"Yeah, because we're friends!",10 minutes?,"You have 10 minutes to get here, the window cl...",And I know our relationship has suffered in th...,You're an animal!,"Ted, will you calm down? I'm your pal!",You let him with Barney?,"Ted, change of plans!"
2420,I'll take the vichyssoise with a hint of cream...,Go. You're wrong.,I love the guys here. This is real guys. Witho...,You went to the hospital?,"And you? Hey, man! You got anything in the leg?",Turn drink.,This is Robin.,Hi everyone.
4423,"Yeah, me, too.",Scooby got out!,"Oh, my God. Scooby ate the whole tray. Wait. W...","Guys, I found this on the floor.",Dude. Listen to me. You have nothing to be ash...,"Okay, I'll tell you. Okay. One time Barney saw...","What did Barney mean when he said ""calzone""?",Yeah?
2004,"Oh, I understand. Duck Stan is a good bro. I'm...","I need a new bro, what do you say?","Starry Porten, Barney Stinson.",So what do you think? 9:30 or 10 o'clock?,"Yeah. Yeah. So, listen you should meet me in M...","Pitt, Barney Stinson!","No offence Randy, but there is a long list of ...","Look, I'm crazy about this girl, if waiting is..."


In [25]:
#Hacemos el split en train y validation
trn_df, val_df = train_test_split(df, test_size=0.1)
trn_df

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
599,And then you slip it to the guy with a discree...,So badly.,Show me you're psyched! Let's do this! Ow! Tha...,"Wow, that was really specific.","What was that? Marshall, I should feel tremors...",I'm psyched!,"Yeah, but it's one thing to say it, it's anoth...","Oh, I am. I'm, I'm psyched."
215,Was that chick at the end really a client?,"Love Solution's Ellen Pierce, a beacon of hope...","I'm here with Ellen Pierce, New York's premier...","Ooh, my story's on. Ted, pay attention. Carl, ...","Oh, yeah, sorry, my bad. You're a man.","It, it was a mouse.",Marshall ran away from a cockroach.,"But those things coming out of his head, those..."
4111,Want to go touch a bunch of stuff?,You're right.,This is ridiculous. We are two grown adults st...,"Mmm. Ah, this Scotch is good. How's your drink?","Yeah, old stuff's great.","So, what are we protesting tonight? Rising cos...","Zoey? Well, well, well.","Look, I don't hate all of it. Tonight's fun. T..."
1828,"Ah, there she is.","Yeah, I know, Barney, you showed me. And that ...","Right, and if you recall, her computer had a w...",Yes. You told her you were Ted and that you we...,"So you remember who this one is, right?",No! What's the matter with you? Get off of him...,And I forgive you. I love you.,"Look, Meg, we need to talk."
3161,"T'was the night before new year's, and the wea...",Not really.,"Ted, many a man--nay, many a soul--has their o...","What the hell is ""the sexless innkeeper""?","Oh, my god! You're right! He's totally the sex...",You're the sexless innkeeper.,Westchester. Why?,"Ted, let me ask you a question. Where does thi..."
...,...,...,...,...,...,...,...,...
3923,How dare you!,"Yeah, that's what your mom said.","Whoa, Ted, that thing you're packing is way to...",How did he do that?,You may be able to talk the brain surgeons you...,No one believes that story.,You guys are adorable. You seriously believe t...,Unsubscribe.
4324,"What? No, don't go! You want to see a magic t...",I'm telling you. The power of Valentine's Day...,I thought we didn't care about Valentine's Day.,What...? So that's it? A couple of white Urke...,What?,Well...,"Oh, yeah, I agreed to that.","Oh, it is. Plus, if you win, you get free piz..."
2653,2: They want to kill you. As my lunch with Wendy.,It is this server?,"The Lilium, not stupid. Sorry, I have no scoot...","As my high school boyfriend, Scooter.","Good question, there are four reasons for a ""m...",Why do you want to lunch with an ex?,You have to lunch?,Because I will not invite him to lunch... again.
401,"Here, you need a mint.",I can't believe you made it.,"Hey, uh, come on in.",Derek?! Derek.,Cheers. Well said.,"Ranjit, put her in park. Dudes, I'm sure party...",Three minutes!,"Move, you stupid taxi!"


In [14]:
# Creamos un dataset adecuado a nuestro modelo

# Esta función toma una fila de un DataFrame (row), un tokenizador (tokenizer), y un parámetro opcional eos (end of sentence), que es True por defecto.
# La función primero invierte la lista de oraciones (row).
# Luego, tokeniza cada oración usando el tokenizador, agrega el token de fin de oración (EOS) al final y los concatena en una lista.
# La función devuelve una lista plana (conv) de los tokens de las oraciones en el orden invertido.

def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        # Se ajusta el block_size para tener en cuenta la longitud máxima del modelo y la longitud máxima de una sola oración según el tokenizador.
        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)


        # Se verifica si ya existen características en caché en un archivo específico y se cargan si es así, a menos que se indique explícitamente sobrescribir la caché (args.overwrite_cache).
        # Si no hay características en caché o se elige sobrescribir, se construyen las características del conjunto de datos.
        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)


            # Para cada fila del DataFrame (df), se llama a la función construct_conv para obtener la lista de tokens (conv) y se agrega a self.examples.
            # Las características construidas se guardan en caché para un acceso más rápido en el futuro.
            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    # Devuelve la longitud del conjunto de datos, es decir, la cantidad de ejemplos.
    def __len__(self):
        return len(self.examples)

    # Devuelve un ejemplo del conjunto de datos en la posición item como un tensor de tipo largo (torch.tensor).
    # El ejemplo es la representación tokenizada de una conversación (la lista plana de tokens obtenida de construct_conv).
    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [15]:
#Creamos ahora el manejo de los puntos de control y su control en el caché.

# Función para cargar y almacenar ejemplos en caché
def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    # Retorna un objeto de la clase ConversationDataset, que maneja la construcción del conjunto de datos
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)

# Función para fijar la semilla para la reproducibilidad
def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

# Función para obtener la lista de puntos de control (checkpoints) ordenados
def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    # Busca todos los archivos que coincidan con el prefijo de los puntos de control
    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    # Para cada archivo, recopila el tiempo de modificación (si se usa) y la ruta del punto de control
    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    # Ordena los puntos de control según el tiempo de modificación o el número en el nombre
    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted

# Función para rotar (eliminar) puntos de control antiguos si se alcanza el límite de almacenamiento total
def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Verifica si se debe eliminar checkpoints antiguos
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    # Calcula cuántos checkpoints se deben eliminar
    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]

    # Elimina los checkpoints antiguos
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

## Build Model

In [16]:
# Cargamos el tokenizer y el modelo de HuggingFace
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [17]:
# Configuración del logger para el módulo actual
logger = logging.getLogger(__name__)

# Lista de clases de configuración de modelos disponibles en la biblioteca
# Esta lista se construye a partir de las clases de configuración asociadas a modelos que tienen LM_Head
MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())

# Tipos de modelos, extraídos de las clases de configuración
# Estos tipos se utilizan para determinar el tipo de modelo durante el fine-tuning
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [18]:
# Argumentos
class Args():
    def __init__(self):
        self.output_dir = '/content/output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 10
        self.max_steps = -1
        self.warmup_steps = 50
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

## Train and Evaluate

In [19]:
# Función para entrenar el modelo
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Entrenar el modelo """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()  # Configurar escritura de resumen para TensorBoard

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    # Función para crear el batch y realizar el padding necesario
    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    # Seleccionar el sampler de entrenamiento basado en si se está ejecutando localmente o en paralelo/distribuido
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    # Crear un DataLoader para el conjunto de entrenamiento
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last=True
    )

    # Configuración del número total de pasos de optimización
    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Ajustar el modelo para manejar el tokenizador y preparar el optimizador y el scheduler
    model = model.module if hasattr(model, "module") else model  # Manejar entrenamiento distribuido/paralelo
    model.resize_token_embeddings(len(tokenizer))

    # Configurar grupos de parámetros para el optimizador con y sin decaimiento de peso
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Verificar si existen estados de optimizador o scheduler guardados
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Cargar estados de optimizador y scheduler
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    # Inicializar AMP (Apex) para entrenamiento en precisión mixta si está habilitado
    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Por favor, instala apex desde https://www.github.com/nvidia/apex para usar entrenamiento en fp16.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # Entrenamiento en múltiples GPUs (debería ser después de la inicialización de fp16 de Apex)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Entrenamiento distribuido (debería ser después de la inicialización de fp16 de Apex)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Entrenar
    logger.info("***** Iniciando entrenamiento *****")
    logger.info("  Num ejemplos = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Tamaño de lote instantáneo por GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Tamaño total de lote de entrenamiento (con paralelismo, distribuido y acumulación) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Pasos de acumulación de gradientes = %d", args.gradient_accumulation_steps)
    logger.info("  Total de pasos de optimización = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0

    # Verificar si se está continuando el entrenamiento desde un punto de control
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # Establecer global_step al global_step del último punto de control guardado desde la ruta del modelo
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuando entrenamiento desde un punto de control, se saltará al global_step guardado")
            logger.info("  Continuando entrenamiento desde la época %d", epochs_trained)
            logger.info("  Continuando entrenamiento desde el global step %d", global_step)
            logger.info("  Se omitirán los primeros %d pasos en la primera época", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Iniciando fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    # Iterador de entrenamiento
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Agregado aquí para reproducibilidad
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteración", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Saltar cualquier paso ya entrenado si se está reanudando el entrenamiento
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            # Saltar el procesamiento de lotes con más de 1024 tokens
            if inputs.shape[1] > 1024:
                continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # Las salidas del modelo son siempre una tupla en transformers (ver documentación)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() para promediar en entrenamiento paralelo multi-gpu
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            # Aplicar escala de pérdida si se utiliza fp16
            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()

            # Actualizar parámetros y programación de learning rate
            if (step + 1) % args.gradient_accumulation_steps == 0:
                # Aplicar clipping de gradiente si se usa fp16
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                optimizer.step()
                scheduler.step()  # Actualizar programación de learning rate
                model.zero_grad()
                global_step += 1

                # Registrar métricas y salvar checkpoints si es necesario
                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Registrar métricas
                    if args.local_rank == -1 and args.evaluate_during_training:
                        # Solo evaluar cuando hay una GPU única, de lo contrario las métricas pueden no promediar bien
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                # Guardar checkpoint del modelo si es necesario
                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Guardar checkpoint del modelo
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Manejar entrenamiento distribuido/paralelo
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Guardando checkpoint del modelo en %s", output_dir)

                    # Rotar checkpoints antiguos si es necesario
                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Guardando estados de optimizer y scheduler en %s", output_dir)

            # Detener el entrenamiento si se alcanza el número máximo de pasos de optimización
            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    # Cerrar escritura de resumen de TensorBoard si se está ejecutando localmente
    if args.local_rank in [-1, 0]:
        tb_writer.close()

    # Devolver el número global de pasos y la pérdida total dividida por el número de pasos
    return global_step, tr_loss / global_step

# Evaluación de algún modelo
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop para manejar la evaluación doble de MNLI (matched, mis-matched)
    eval_output_dir = args.output_dir

    # Cargar y cachear ejemplos para la evaluación
    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)

    # Función para agrupar ejemplos en lotes y realizar el padding necesario
    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    # Configurar sampler y dataloader para evaluación
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last=True
    )


    # Evaluación en múltiples GPUs
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Establecemos los parámetros iniciales de la evaluación
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0

    # Evaluamos
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1
    # Calculamos la pérdida y la perplexity como métricas de nuestro modelo
    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}


    # Escribimos los resultados en un archivo .txt
    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [20]:
# Función principal del programa
def main(df_trn, df_val):

    # Argumentos de la línea de comandos
    args = Args()

    # Verificar si se debe continuar desde un punto de control existente
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    # Verificar si el directorio de salida existe y no está vacío (a menos que se haya especificado sobrescritura)
    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Configuración para CUDA, GPU y entrenamiento distribuido
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Configuración de registros (logging)
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Configuración de la semilla para reproducibilidad
    set_seed(args)

    # Cargar configuración, tokenizer y modelo
    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

    # Entrenamiento
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Guardar el modelo entrenado y los resultados
    if args.do_train:
        # Crear directorio de salida si no existe
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Guardar modelo entrenado, configuración y tokenizer usando `save_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Asegurar manejo correcto de entrenamiento distribuido/paralelo
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Buena práctica: guardar argumentos de entrenamiento junto con el modelo entrenado
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Cargar modelo y vocabulario entrenados
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluación
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reducir registros
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results


## Run the Main Function

In [26]:
#Por fin, ejecutamos todo el proceso de entrenamiento y evaluación.
main(trn_df, val_df)

Token indices sequence length is longer than the specified maximum sequence length for this model (4949 > 1024). Running this sequence through the model will result in indexing errors


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]



Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]



Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]



Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]

Iteración:   0%|          | 0/1074 [00:00<?, ?it/s]



Evaluating:   0%|          | 0/119 [00:00<?, ?it/s]

{'perplexity_': tensor(4.5510)}

## Load the Trained Model

In [27]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('/content/output-small')

In [30]:
# Let's chat for 4 lines
for step in range(4):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens,
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature=0.8
    )

    # pretty print last ouput tokens from bot
    print("RickBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:Ey, Barney!


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


RickBot: Hey, wait. Listen. You two. You want to go to the bar?
>> User:Of course! This can be a great night!


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


RickBot: Yeah, we'll just stay here, okay?
>> User:And the bar?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


RickBot: That's the point. We're trying to find out who won the Superbowl.
>> User:You?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


RickBot: I did. I won!


## Push Model to Hugging Face

In [32]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [33]:
!git config --global user.email "davidgonveg@gmail.com"
# Tip: using the same email as your huggingface.co account will link your commits to your profile
!git config --global user.name "Dagonez"

In [40]:
MY_MODEL_NAME = 'DialoGPT-small-Barney-Bot'

#Añadir el token de huggingface de: https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [41]:
model.push_to_hub(MY_MODEL_NAME)
tokenizer.push_to_hub(MY_MODEL_NAME)

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Dagonez/DialoGPT-small-Barney-Bot/commit/feac1a1727bdee34084e714c82d2ef633deb3215', commit_message='Upload tokenizer', commit_description='', oid='feac1a1727bdee34084e714c82d2ef633deb3215', pr_url=None, pr_revision=None, pr_num=None)

## All Done!

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=40d91331-6c83-44f4-b569-6fa68d59a8f8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>