The Sarcastic Wizard Chatbot

**Project Description:**  Our project offers Harry Potter fans a unique opportunity to interact with a witty and entertaining AI, designed to emulate the personas of the beloved characters from the Harry Potter universe.Approach:**


1.   **Data Pre-processing:** A significant portion of our effort is dedicated to data pre-processing. We extract conversations based on characters, starting with Harry as the primary character. This approach allows us to create a chatbot that can simulate conversations as if they were coming from Harry himself. The framework is designed to be extensible, enabling the addition of other characters like Hermione and Ron in the future.
2.   **Harry Potter Dataset:** We use a specially curated dataset that includes dialogues and interactions from the Harry Potter series. This dataset has been modified to have a humorous twist, ensuring that the chatbot's responses are not only relevant but also entertaining.
3. **Pre-trained DialoGPT Model:** We use Microsoft's DialoGPT, a state-of-the-art conversational AI model, as the foundation for our chatbot. This model is renowned for its ability to generate human-like responses in a chat context.
4. **Fine-tuning:** We fine-tune the DialoGPT model with our Harry Potter dataset for three epochs. This process allows the model to adapt to the specific language and style of the Harry Potter universe, ensuring that the chatbot's responses are contextually appropriate and engaging.
5. **Character-based Conversations:** Our chatbot is capable of switching between characters, allowing users to interact with different characters from the Harry Potter series. This feature adds depth to the chatbot experience, making it more immersive and personalized.






In [None]:
# Import libraries
import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

# Transformer model utilities import
from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING, # Mapping of model with language model head
    WEIGHTS_NAME, # Weight file name
    AdamW, # Adam optimizer
    AutoConfig, # Models auto-configuration
    PreTrainedModel, # Pre-trained model class
    PreTrainedTokenizer, #Pre-trained tokenizer class for pre-trained models
    get_linear_schedule_with_warmup, # Schedule with warm-up steps
)

# Trying to import SummaryWriter from Pytorch, while not available then import from tensorboardX
try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

In [None]:
# Load the dataset
data= pd.read_csv('./HarryPotter1.csv', sep=';')
# data = pd.read_csv('./unified_data.csv', sep=';', engine='python')
data.head()

Unnamed: 0,Character,Sentence
0,Dumbledore,"I should've known that you would be here, Prof..."
1,McGonagall,"Good evening, Professor Dumbledore."
2,McGonagall,"Are the rumors true, Albus?"
3,Dumbledore,"I'm afraid so, professor."
4,Dumbledore,The good and the bad.


In [None]:
sum((data['Character'] == 'Harry') | (data['Character'] == 'Ron')) # Occurrences of Harry or Ron


239

In [None]:
len(data) # Displaying dataset entries


1587

In [None]:
data.rename(columns={'Character':'name','Sentence':'line'},inplace=True) # Rename colums for consistency


In [None]:
# Create a context dataframe for one character only
Character_name='Harry'
contexted = []

# context window of size 7
n = 7
# Looping through the data to build context for each by character
for i in data[data.name == Character_name].index:
    if i < n:
        continue # When not enough context, then skip early entries
    row = []
    prev = i - 1 - n # We additionally substract 1, so row will contain current responce and 7 previous responces
    for j in range(i, prev, -1):
        row.append(data.line[j]) # Append lines to create the context
        contexted.append(row)
# Explaining column names for the dataframe
columns = ['response', 'context']
columns = columns + ['context/' + str(i) for i in range(n - 1)]
# Building a dataframe from contexted list
df = pd.DataFrame.from_records(contexted, columns=columns)
df

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
0,"Yes, Aunt Petunia.","Why don't you just cook the breakfast, and try...","Happy birthday, son.","Here he comes, the birthday boy.",We're going to the zoo!,"Wake up, cousin!",Now!,Up. Get up!
1,"Yes, Aunt Petunia.","Why don't you just cook the breakfast, and try...","Happy birthday, son.","Here he comes, the birthday boy.",We're going to the zoo!,"Wake up, cousin!",Now!,Up. Get up!
2,"Yes, Aunt Petunia.","Why don't you just cook the breakfast, and try...","Happy birthday, son.","Here he comes, the birthday boy.",We're going to the zoo!,"Wake up, cousin!",Now!,Up. Get up!
3,"Yes, Aunt Petunia.","Why don't you just cook the breakfast, and try...","Happy birthday, son.","Here he comes, the birthday boy.",We're going to the zoo!,"Wake up, cousin!",Now!,Up. Get up!
4,"Yes, Aunt Petunia.","Why don't you just cook the breakfast, and try...","Happy birthday, son.","Here he comes, the birthday boy.",We're going to the zoo!,"Wake up, cousin!",Now!,Up. Get up!
...,...,...,...,...,...,...,...,...
1235,I'm not going home. Not really.,"Feels strange to be going home, doesn't it?","I do. But your cousin don't, do he? Eh? Off yo...","But Hagrid, we're not allowed to do magic away...","Oh, listen, Harry, if that dolt of a cousin of...",Oh. Go on...on with you.,"Thanks, Hagrid.",This is for you.
1236,I'm not going home. Not really.,"Feels strange to be going home, doesn't it?","I do. But your cousin don't, do he? Eh? Off yo...","But Hagrid, we're not allowed to do magic away...","Oh, listen, Harry, if that dolt of a cousin of...",Oh. Go on...on with you.,"Thanks, Hagrid.",This is for you.
1237,I'm not going home. Not really.,"Feels strange to be going home, doesn't it?","I do. But your cousin don't, do he? Eh? Off yo...","But Hagrid, we're not allowed to do magic away...","Oh, listen, Harry, if that dolt of a cousin of...",Oh. Go on...on with you.,"Thanks, Hagrid.",This is for you.
1238,I'm not going home. Not really.,"Feels strange to be going home, doesn't it?","I do. But your cousin don't, do he? Eh? Off yo...","But Hagrid, we're not allowed to do magic away...","Oh, listen, Harry, if that dolt of a cousin of...",Oh. Go on...on with you.,"Thanks, Hagrid.",This is for you.


In [None]:
# Split data to train and test datasets
# Train the model on the training set and test the model with the test set
trn_df, val_df = train_test_split(df, test_size=0.1)
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
124,"Mummy, Dad, come here!",I never knew my parents either.,I see. That's me as well.,Do you miss your family?,Was it nice there?,"You're from Burma, aren't you?","I mean, do you talk to people often?",Do you...?
982,Nicholas Flamel...Who's Nicholas Flamel?,I should not have said that.,I should not have said that.,I shouldn't have said that.,Nicholas Flamel?,What that dog is guarding is strictly between ...,It's dangerous.,You're meddlin' in things that ought not to be...
922,Let's go this way.,"The staircases change, remember?",What's happening?,Ahh!,Who doesn't?,She knows more about you than you do.,"I'm telling you, it's spooky.",I-I didn't know.
828,"Clearly, Hermione knows.","Clearly, fame isn't everything is it, Mr. Potter?",Pity.,"I don't know, Sir.",And what is the difference between monkshood a...,"I don't know, Sir.","Where, Mr Potter, would you look if I asked yo...","You don't know? Well, let's try again."
255,"Sorry, no.","Of course, you know all about Hogwarts.","Rubeus Hagrid, Keeper of Keys and Grounds at H...","Excuse me, who are you?",It's not every day that your young man turns 1...,Thank you.,"Baked it myself, words and all.","Afraid I might have sat on it at somepoint, bu..."


In [None]:
# Create a dataset suitable for our model
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist] # Flattening list of sublist function
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row])) # Reversing conv order and tokenizing each sentence

    conv = flatten(conv) # Flattening tokenized sentences list
    return conv
# Defining custom Dataset class for conversation data
class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512): # Initializing method
        # Modifying block_size for the tokenizer's max input
        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)
        # Defining directory and cached_features_file for processed features storing
        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )
        # In the case of file existence and we are not overwriting cache, we load features
        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else: # Else, we process dataframe and store the features processed
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows(): # For each individual row, we construct conversation and append to examples
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)
            # Logging and saving features to cached file
            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
    # Defining __len__ method to return dataset size
    def __len__(self):
        return len(self.examples)
    # Defining __getitem__ method to fetch specific items from dataset
    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [None]:
# Caching and storing of data/checkpoints
# Defining a function that loads and cache the conversation dataset examples
def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn) # Returning validation or training dataset as a ConversationDataset object

# Defining a function to sed random seeds for duplicability across multiple function calls
def set_seed(args):
    random.seed(args.seed) # Seed Python's random module
    np.random.seed(args.seed) # Seed Numpy's random module
    torch.manual_seed(args.seed) # Seed Pytorch
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed) # Seed all GPU's

# Defining function to sort the checkpoints by modification time or checkpoint number
def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []
    # Glob to detect all checkpoints matching the prefix
    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))
    # Iterating from found checkpoints and sort them by modification time or checkpoint number
    for path in glob_checkpoints:
        if use_mtime: # If using modification time
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else: # Else if using checkpoint number gotten from regex
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
    # Sorting checkpoints based on sorting key, either modification time or checkpoint number
    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted] # Extracting paths
    return checkpoints_sorted

# Defining function that manages checkpoint rotation depending on the limit of total checkpoints to keep
def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit: # If there is no limit then exit the function
        return
    if args.save_total_limit <= 0: # If limit is non-positive then exit the function
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime) # Check if the number of checkpoints surpass the limit
    if len(checkpoints_sorted) <= args.save_total_limit:
        return # If within limit then not perform needed
    # Calculating many checkpoints should be deleted
    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted: # Deleting past checkpoints
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint) # Deleting checkpoint directory

In [None]:
# Build the model
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch
# Instantiate the tokenizer for the DialoGPT model
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small") # Loading the pre-trained DialoGPT model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

# Setting up logging for the script
logger = logging.getLogger(__name__)
# Holding a list of model configuration classes available transformers library
MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) # Extracting and storing the model types from configuration classes


In [None]:
# Arguments to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-large'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-large'
        self.config_name = 'microsoft/DialoGPT-large'
        self.tokenizer_name = 'microsoft/DialoGPT-large'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3 #50
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

In [None]:
# Defining function to train the model
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]: # Initializing TensorBaoard writer if it's running on the main process
        tb_writer = SummaryWriter()
    # Calculating effective batch size through all GPUs
    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
    # Defining a collate function to pad tensors to the same lenght
    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
    # Choosing the applicable sampler for data loading
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )
    # Setting the total numbers of training steps
    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
    # Preparing the model for distributed training if required
    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )
     # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
    # Setting up mixed-precision training
    if args.fp16:
        try:
            from apex import amp # NVIDIA's amp for efficient mixed-precision training
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Training the loop initialization
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Setting seed before starting epoch to ensure duplicability
    for _ in train_iterator: # Maing training loop iterating over epochs
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) # tqdm displays a progress bar that demonstrates the training iterations progress Anna
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue
            # Asssigning batch as inputs and labels for model
            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue # Skipping batches that exceeds max sequence length
            # Moving inputs and labels to GPU or CPU
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train() # Setting the model to training
            outputs = model(inputs, labels=labels) # Calculating outputs of the model on the inputs
            loss = outputs[0]  # Getting the loss from model's outputs
            if args.n_gpu > 1: # Averaging the loss from GPUs for parallel training
                loss = loss.mean()
            if args.gradient_accumulation_steps > 1: # Normalizing loss to report for gradient accumulation
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss: # Usign NVIDIA's apex for mixed precision training
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item() # Accumulating the training losses for log
            if (step + 1) % args.gradient_accumulation_steps == 0: # Performing optimization step
                if args.fp16: # Clip gradients to stop exploding gradient problem in deep networks
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step() # Adjusting weights from computed gradients
                scheduler.step()  # Updating learning rate schedule
                model.zero_grad() # Clearing gradients for next step
                global_step += 1 # Incrementing the global step counter
                # Logging metrics and evaluating model
                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step) # Logging learning rate and loss values
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss
                    if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                        checkpoint_prefix = "checkpoint"
                        # Save model checkpoint
                        output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                        os.makedirs(output_dir, exist_ok=True)
                        # Saving the model and tokenizer config to output directory
                        model_to_save = (
                             model.module if hasattr(model, "module") else model
                        )  # Take care of distributed/parallel training
                        model_to_save.save_pretrained(output_dir)
                        tokenizer.save_pretrained(output_dir)
                        # Saving training arguments
                        torch.save(args, os.path.join(output_dir, "training_args.bin"))
                        logger.info("Saving model checkpoint to %s", output_dir)
                        # Managing checkpoints storing
                        _rotate_checkpoints(args, checkpoint_prefix)
                        # Saving optimizer and scheduler states
                        torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                        torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                        logger.info("Saving optimizer and scheduler states to %s", output_dir)
                    print(global_step)
            # Break loop if the max number of training steps is approached
            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps: # Break condition for the outer loop
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

In [None]:
# Defining function to evaluate the model's performance
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir
    # Loading evaluation dataset
    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True) # Making sure output directory exists
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) # Calculating batch size
    # Note that DistributedSampler samples randomly
    # Defining a collate function to setup batches
    def collate(examples: List[torch.Tensor]):
        # Using padding token if defined, if not then pad without specific token
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
    # DataLoder loads the dataset for evaluation
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # If using multiple GPUs then we wrap the model for data parallelism
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Logging evaluation start
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    # Initializing variables to track evaluation loss and the steps
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval() # Setting model to evaluation mode
    # Iterating evaluation batches
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
      # Setting up inputs and labels from the batch for the model
        inputs, labels = (batch, batch)
        # Moving inputs and labels to conf device
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)
        # Disableing gradient calculation
        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1
    # Calculating loss through all evaluation steps
    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))
    # Prepating result dictionary
    result = {"perplexity": perplexity}
    # Writing evaluation results to file
    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [None]:
# Defining main function for script
def main(df_trn, df_val):
    # Initializing arguments
    args = Args()

    # Checking for continuation from last checkpoint
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]
    # Ensuring the output directory is ready to save training outputs
    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setting up CUDA and GPU preferences for training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Initializing logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Setting a random seed for duplicablity
    set_seed(args)

    # Loading conf, tokenizer and model
    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    # Log training and evaluation parameters
    logger.info("Training/evaluation parameters %s", args)

    # Starting training if stated
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False) # Load and cache the training dataset
        # Training the model
        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Saving trained model and tokenizer
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Reloading the model and tokenizer from saved checkpoint for evaluation
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

# Initializing results dictionary for evaluation results
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]: # Conducting evaluation is stated
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints: # Evaluating all checkpoints if stated
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        # Evaluating each checkpoint and updated results
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

In [None]:
main(trn_df, val_df)



config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.75G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/279 [00:00<?, ?it/s]

Iteration:   0%|          | 0/279 [00:00<?, ?it/s]

Iteration:   0%|          | 0/279 [00:00<?, ?it/s]



Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

{'perplexity_': tensor(1.0402)}

In [None]:
# Load the trained model
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-large')
model = AutoModelWithLMHead.from_pretrained('output-large')

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
# Looping for a conversation of for 4 lines
for step in range(10):
    # Prompting the user for input and tokenize the input text with an end of sequence token
    new_user_input_ids = tokenizer.encode(input(">> User:")+tokenizer.eos_token, return_tensors='pt') # 'pt' states that the returned tensor is PyTorch tensor
    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids # Creating a continuous conversation history feed to the model resulting responses Anna

    # Generating a response from the model
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature=0.8
    )
    # Printing out the model's response to user's input
    print("Bot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:Where can I find Nicholas Famel?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Bot: Nicholas Flamel?
>> User:yes the one who has free tickets


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Bot: All right, Ringwald.


KeyboardInterrupt: Interrupted by user