<a href="https://colab.research.google.com/github/be-mich/LIGN-167-Project/blob/main/Experiment_5_hyperparameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook conducts the fifth experiment. Here I experiment with model generation hyperparameters such as top-k, top-p sampling, to see how this can impact the model's subjective performance. I use the model crafted from experiment 2 (back translated data with shuffling) to conduct this experiment, because this model had the lowest perplexity. 

In [None]:
#setup 
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
#clear cuda memory 
torch.cuda.empty_cache()

In [None]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks")

In [None]:
# all the imports

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

!pip install transformers

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    #SGD, #try other optimizer- not available in transformers
    AutoConfig,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter
# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)




Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 36.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


In [None]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 2 #try for memory issues
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

Import Seinfeld scripts data from google drive (previously obtained from Kaggle)

In [None]:
! gdown https://drive.google.com/uc?id=1wSKl_642G6lsk4AmrYXpa_8XWb4hcY_B

Downloading...
From: https://drive.google.com/uc?id=1wSKl_642G6lsk4AmrYXpa_8XWb4hcY_B
To: /content/drive/MyDrive/Colab Notebooks/scripts.csv
  0% 0.00/4.52M [00:00<?, ?B/s] 93% 4.19M/4.52M [00:00<00:00, 41.6MB/s]100% 4.52M/4.52M [00:00<00:00, 44.0MB/s]


In [None]:
seinfeld_scripts = pd.read_csv("./scripts.csv")

In [None]:
seinfeld_scripts.head()

Unnamed: 0.1,Unnamed: 0,Character,Dialogue,EpisodeNo,SEID,Season
0,0,JERRY,Do you know what this is all about? Do you kno...,1.0,S01E01,1.0
1,1,JERRY,"(pointing at Georges shirt) See, to me, that b...",1.0,S01E01,1.0
2,2,GEORGE,Are you through?,1.0,S01E01,1.0
3,3,JERRY,"You do of course try on, when you buy?",1.0,S01E01,1.0
4,4,GEORGE,"Yes, it was purple, I liked it, I dont actuall...",1.0,S01E01,1.0


In [None]:
#make df
type(seinfeld_scripts)

pandas.core.frame.DataFrame

In [None]:
#check for empty dialogue
seinfeld_scripts[seinfeld_scripts['Dialogue'].isnull()==True]


Unnamed: 0.1,Unnamed: 0,Character,Dialogue,EpisodeNo,SEID,Season
13529,13529,** Pies - Just in case you did not know what t...,,7.0,S04E07,4.0
14938,14938,"[On the bright side, Kramer and George arrive ...",,12.0,S04E12,4.0
18353,18353,(The show begins. There are three different se...,,24.0,S04E24,4.0
18354,18354,(Jerry's doing his stand-up routine at a comed...,,24.0,S04E24,4.0
18355,18355,"We see the title 'Jerry', then, sitting at the...",,24.0,S04E24,4.0
42939,42939,Definitions of several items in the Chicken Ro...,,8.0,S08E08,8.0
45847,45847,HAL,,18.0,S08E18,8.0
49651,49651,Notice,,8.0,S09E08,9.0
50013,50013,The definition of Sari or Saree is,,8.0,S09E08,9.0
53142,53142,MONTAGE,,18.0,S09E18,9.0


In [None]:
#drop rows with missing values

seinfeld_scripts = seinfeld_scripts.dropna().reset_index(drop=True)
print(len(seinfeld_scripts))

54606


Clean up data 

In [None]:
#can get rid of some columns- we only need character and dialogue
seinfeld_scripts = seinfeld_scripts.drop(columns=['Unnamed: 0','EpisodeNo','SEID','Season'])

In [None]:
seinfeld_scripts.head()

Unnamed: 0,Character,Dialogue
0,JERRY,Do you know what this is all about? Do you kno...
1,JERRY,"(pointing at Georges shirt) See, to me, that b..."
2,GEORGE,Are you through?
3,JERRY,"You do of course try on, when you buy?"
4,GEORGE,"Yes, it was purple, I liked it, I dont actuall..."


In [None]:
len(seinfeld_scripts)

54606

In [None]:
sum(seinfeld_scripts.Character == 'GEORGE')

9708

In [None]:
NAME = 'GEORGE'

First restructure dataframe so that we have George's dialogue as a response, and we have 4 previous exchanges of dialogue as context. The context dialogue can include dialogue from other character's, but the response is only including George's. 
Next, try 10 previous exhanges of dialogue as context.

In [None]:
contexted = []
#context window of size 4
#or window of 10
#n = 4
n = 10

for i in seinfeld_scripts[seinfeld_scripts.Character == NAME].index:
    if i < n:
        continue
    row = []
    prev = i - 1 - n #subtract 1 so row contains current and 4/ or 10 previous responses
    for j in range(i, prev, -1):
        row.append(seinfeld_scripts.Dialogue[j])
    contexted.append(row)
columns = ['response','context']
columns = columns + ['context/' + str(i) for i in range(n-1)]

formatted_df = pd.DataFrame.from_records(contexted, columns=columns)
    

In [None]:
formatted_df.sample(6)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7,context/8
1724,"Uh, can we cut to the chase?",He's good.,"Really John Mollika, they guy that used to bar...",When I was outside I ran into John Mollika.,What are you? A *baby*!? All right. Tell her.,"I know, but I'm distracted now.","Hey, hey! We're discussing something!",What about?,"[To Jerry] Oh, remind me to talk to you about ...","No, don't say a word. If she thinks my friends...",I'll tell her I was the one who laughed.
3056,What?,You can't break up with her now. Her life is s...,"What is here comes the judge, here comes the j...",Because I kissed her in the meeting. Russell f...,Why did he fire her?,(Hangs up phone and pauses) this is great! He ...,I'll speak to you later.,(To the TV once again) What is the cha-cha? Oo...,I just feel terrible This is just terrible.,You didn't realize? How could you not realize?...,But I had no... I didn't realize.
8980,Thanks. What are you guys doin' here?,"(waiting with Elaine and Peggy) No, germ-o-pho...","Excuse me. Is this, uh, Rage-aholics?","Oh, germs. Germs. Germs!",Mm-hmm. I prepared it as I bathed.,This food was in the shower with you?,"Oh, yeah, and I use it all the time. Yeah, I m...",You have a garbage disposal in your bathtub?,Yeah. And here's to David Puddy for helping me...,"Here's to Peggy, on her first week of being ge...","Well, thank you."
6080,"That's the question, Jimmy.",Are you gonna say it again?,She probably never heard it. Don't you see wha...,What?,Oh my god.,"Yeah, her boss told me that she can't hear ver...",What?,"Well, I'm sorry. Well, George, I tried to put ...","Well, I don't think that's gonna happen.","Nothing doing. Jerry, I didn't do anything. It...",Well?
4103,(laughing) You think you're going to the other...,"I don't know, ""Hi"".",What are you going to say?,I can't.,"Wait a minute Jerry, there's a bigger issue he...",I have to. I won't be able to live with myself.,They're not mocked and humiliated when they ge...,I'm going to psyche myself into it like those ...,Cold? How are you going to do that? You're not...,I gotta talk to her. What do you think?,Boy you are really smitten.
5988,You don't like the move?,"Well, stop it!","(pops his head out of the covers, looking a bi...","Ow, George! (crawls out from beneath the cover...",Good!,Fine.,"All right, how about the next time your car br...","Well, I'll tell you what I'll do, you know. If...","Are you through, 'cuz, uh, I gotta get back to...","You can't come up with your own stuff , so you...","Yeah, that's right."


In [None]:
#see if i have  na

formatted_df.isnull().values.any()

False

In [None]:
print(len(formatted_df))

9704


In [None]:
! gdown https://drive.google.com/uc?id=1TwdjAjuJybmFcVbEVtiXKi0KHRrHbd6G

Downloading...
From: https://drive.google.com/uc?id=1TwdjAjuJybmFcVbEVtiXKi0KHRrHbd6G
To: /content/drive/MyDrive/Colab Notebooks/back_translated_tenpercent_sample.csv
  0% 0.00/602k [00:00<?, ?B/s]100% 602k/602k [00:00<00:00, 78.8MB/s]


In [None]:
#lets append the back translated data 

back_translated = pd.read_csv("./back_translated_tenpercent_sample.csv", index_col=[0])

In [None]:
back_translated

Unnamed: 0_level_0,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7,context/8
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2304,"Alright, stop it Kramer, you're freakin' me ou...","Oh, yeah. I can barely see you, George.",Me?,"Because you see, George, having the keys to Je...",I suppose I could.,"Well, you can get 'em back.","Should I give you my keys, is that the transac...","Say yes! Yes, George. Yes!","Gee, Kramer, I uh...I don't know what to say.","Yeah. Yeah, I think it would be for the best.",You want 'em back?
4450,You nailed it,he asked me,horning on my rock climbing trip. It's just su...,what?,hey nice move today,hey,he's the first cool guy I've ever been friends...,"cool guy? what are you, in 8th grade?",yeah that's right. I like it. He's such a cool...,yeah yeah what is it with you and Tony? what a...,alright alright
4560,cinnamon. Why didn't Cha get chocolate?,Cinnamon babka.,"I spilled some Chardonnay. So, what did you get?",What for?,I had to give it to the liquor store guy.,"Hey, what happened to your coat? And what is t...",Somebody put a cane on my foot. Just like the ...,"What, what happened to you?","S-s-somebody double parked, we couldn't help i...","How about a nice box of ""scram"".","Oh, no thanks."
8918,it's a pleasure to meet you.,"(Introducing the two) George, this is Nina.",Yeah!,It's been years!,Hi.,Nina?,Jerry?,Five' eight. Five' seven.,Really?,"Yeah, and a whole new me. I'm up two inches in...","Almost. (Notices his shoes) Hey, new Timberlands?"
5581,Hey.,(entering) Hey.,"Many times. You love velvet, you want to live ...",I've said that before?,"I know, you would drape yourself in velvet.","I gotta find a way to work this out, I love th...",We'll take a check please.,Ho ho. This bizarre ?harrod? experiment must end!,Not you.,I look like me and I'm working from the outsid...,He looks like you and he's working from the in...
...,...,...,...,...,...,...,...,...,...,...,...
856,It's the pesto of cities. So..?,Everybody's moving to Seattle.,Seattle.,"He's from a.. Yakima, right?","(Slightly embarrassed) Yes, a guy.",A guy?,"Elaine is having a ""houseguest."" She's picking...",Who?,I don't understand why he couldn't take a cab.,That's a tough minute. It's like waiting in th...,"(To Elaine) Oh, one more thing about the car. ..."
1071,That's pretty long.,About three years.,So how long did you live there?,"On my block, a lot of ah, people walk their do...","Well, thank you very much. I'm telling you, on...",It's just we want you to go.,"I know, it's not that.","Forget it. Go ahead, you'll have a good time.",What about this--,No.,Oh! I tell you what. How about if I come back ...
5846,(enthusiastic) Yeah. You get a couch. I get ri...,(not sure) I guess.,"No, the cushion's turned over.",But it's got a pee-stain on it.,"Yeah, sure. (big smile) Then my father will ha...",The one with the Poppie stain?!,(crafty) Not necessarily. Why don't you take b...,Now we have to buy a new couch?!,We have to replace the couch.,"Anyway, Jerry... Jerry?",And they think they're better than us?
3688,Why?,I don't know about defraying.,"Hey, guess what? The Drake broke up.",Hey.,"Don't worry, they'll make friends fast with th...",They don't know anybody in Chicago.,I hate the Drake! Maybe the whole thing was a ...,"Boy, I am really starting to dislike the Drake!",I know it is!,"Well, she can't keep it, it's not fair, that's...",He gave her all the gifts; he felt guilty.


In [None]:
#lets concat the two data frames

data_augmented = pd.concat([formatted_df,back_translated],ignore_index = True)

In [None]:
data_augmented

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7,context/8
0,How come youre not doin the second show tomorrow?,Trust me George. No one has any interest in se...,"Can you relax, its a cup of coffee. Claire is ...","Its missing, I have to do it in my head decaf ...","Are, are you sure this is decaf? Wheres the or...",Mr. Seinfeld. Mr. Costanza.,"Well, senator, Id just like to know, what you ...","(on an imaginary microphone) Uh, no, not at th...","Oh, you dont recall?","Yes, it was purple, I liked it, I dont actuall...","You do of course try on, when you buy?"
1,"Wait a second, wait a second, what coming in, ...","Well, theres this uh, woman might be comin in.",How come youre not doin the second show tomorrow?,Trust me George. No one has any interest in se...,"Can you relax, its a cup of coffee. Claire is ...","Its missing, I have to do it in my head decaf ...","Are, are you sure this is decaf? Wheres the or...",Mr. Seinfeld. Mr. Costanza.,"Well, senator, Id just like to know, what you ...","(on an imaginary microphone) Uh, no, not at th...","Oh, you dont recall?"
2,"No, you didnt!","I told you about Laura, the girl I met in Mich...","Wait a second, wait a second, what coming in, ...","Well, theres this uh, woman might be comin in.",How come youre not doin the second show tomorrow?,Trust me George. No one has any interest in se...,"Can you relax, its a cup of coffee. Claire is ...","Its missing, I have to do it in my head decaf ...","Are, are you sure this is decaf? Wheres the or...",Mr. Seinfeld. Mr. Costanza.,"Well, senator, Id just like to know, what you ..."
3,Ha.,"I thought I told you about it, yes, she teache...","No, you didnt!","I told you about Laura, the girl I met in Mich...","Wait a second, wait a second, what coming in, ...","Well, theres this uh, woman might be comin in.",How come youre not doin the second show tomorrow?,Trust me George. No one has any interest in se...,"Can you relax, its a cup of coffee. Claire is ...","Its missing, I have to do it in my head decaf ...","Are, are you sure this is decaf? Wheres the or..."
4,"Wait wait wait, what is she... (takes the milk...","(looks in the creamer) Theres no milk in here,...",Ha.,"I thought I told you about it, yes, she teache...","No, you didnt!","I told you about Laura, the girl I met in Mich...","Wait a second, wait a second, what coming in, ...","Well, theres this uh, woman might be comin in.",How come youre not doin the second show tomorrow?,Trust me George. No one has any interest in se...,"Can you relax, its a cup of coffee. Claire is ..."
...,...,...,...,...,...,...,...,...,...,...,...
10669,It's the pesto of cities. So..?,Everybody's moving to Seattle.,Seattle.,"He's from a.. Yakima, right?","(Slightly embarrassed) Yes, a guy.",A guy?,"Elaine is having a ""houseguest."" She's picking...",Who?,I don't understand why he couldn't take a cab.,That's a tough minute. It's like waiting in th...,"(To Elaine) Oh, one more thing about the car. ..."
10670,That's pretty long.,About three years.,So how long did you live there?,"On my block, a lot of ah, people walk their do...","Well, thank you very much. I'm telling you, on...",It's just we want you to go.,"I know, it's not that.","Forget it. Go ahead, you'll have a good time.",What about this--,No.,Oh! I tell you what. How about if I come back ...
10671,(enthusiastic) Yeah. You get a couch. I get ri...,(not sure) I guess.,"No, the cushion's turned over.",But it's got a pee-stain on it.,"Yeah, sure. (big smile) Then my father will ha...",The one with the Poppie stain?!,(crafty) Not necessarily. Why don't you take b...,Now we have to buy a new couch?!,We have to replace the couch.,"Anyway, Jerry... Jerry?",And they think they're better than us?
10672,Why?,I don't know about defraying.,"Hey, guess what? The Drake broke up.",Hey.,"Don't worry, they'll make friends fast with th...",They don't know anybody in Chicago.,I hate the Drake! Maybe the whole thing was a ...,"Boy, I am really starting to dislike the Drake!",I know it is!,"Well, she can't keep it, it's not fair, that's...",He gave her all the gifts; he felt guilty.


In [None]:
#preprocess tokenizer
!pip install transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")



Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
#shuffle and split data

from sklearn.model_selection import train_test_split
trn_df, test_df = train_test_split(data_augmented, test_size = 0.2, shuffle= True)
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7,context/8
4529,"Ah, great. All right, with the wine I'm in ove...",6.75.,Well you know there is an unusual number of pe...,"Ohh, it's real.",It's not real. They're all made up.,"Come on, did you ever read one of these?","Oh, that's nice.","Why? No, thatll make great dinner party conver...",I'm not getting a Penthouse Forum.,"Here, George, get a Penthouse Forum.",Were up to two dollars here.
8465,"Yeah, with crutches everyone has questions.","No more crutches, that must be a relief.","Hey hey hey, check me out, huh?",(resigned) Dammit.,"Stop it, George. Stop it. I'm sorry, you've go...",Buzz cuts? Parachute pants!,No.,What if we grew muttonchops?,Because you said this would be better. Remembe...,Well why didn't we?,"I told you, we should have taken some kind of ..."
9510,What?,I look about the same.,"Well, now I can't see Jerry.",I can't eat with you leanin' over like this. J...,"Jerry, you wouldn't believe what it's like dow...",Ooh.,"Hey, hey, hey. Look at that.",I don't know what this is.,40? I'm payin' 60 to my maid. She doesn't do l...,40.,How much you pay this maid?
908,"We have a, three-o'clock appointments.",And you don't have to pay.,"I'll tell you, but don't ask her anything abou...",What's the name of this physical therapist?,"Right, your friend Roy.",Well I've never actually done it but if I real...,So where do you get this note?,Not if you have a doctor's note.,You don't have to pay for the massage?,Yeah.,Physical therapy is covered by insurance?
652,I'm not kidding.,I don't think that's it.,(Still holding the note) I think I'm having a ...,"I don't know. That's like asking ""Where's Waldo?""","Hey, where's Kramer?",Let me see that. (Studies the note),"(Joking) Did you mess with Johnny, Jerry?",Johnny? Johnny who? Johnny Carson? Did I insul...,No. Let me see that. (Takes the paper from Jer...,(Trying to read the note) What have I done? I ...,They refuse to put cucumber in the salad. I ne...


In [None]:
# create dataset suitable for our model
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [None]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

In [None]:
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

Training and evaluating functions

In [None]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    #optimizer = SGD(optimizer_grouped_parameters, lr=args.learning_rate, momentum=0.9)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [None]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

In [None]:
main(trn_df, test_df)

03/17/2022 18:10:19 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7f19aac97c90>
03/17/2022 18:10:19 - INFO - __main__ -   Creating features from dataset file at cached
03/17/2022 18:10:40 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
03/17/2022 18:10:41 - INFO - __main__ -   ***** Running training *****
03/17/2022 18:10:41 - INFO - __main__ -     Num examples = 8539
03/17/2022 18:10:41 - INFO - __main__ -     Num Epochs = 3
03/17/2022 18:10:41 - INFO - __main__ -     Instantaneous batch size per GPU = 2
03/17/2022 18:10:41 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 2
03/17/2022 18:10:41 - INFO - __main__ -     Gradient Accumulation steps = 1
03/17/2022 18:10:41 - INFO - __main__ -     Total optimization steps = 12807


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4269 [00:00<?, ?it/s]

03/17/2022 18:31:15 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-3500
03/17/2022 18:31:35 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-3500


Iteration:   0%|          | 0/4269 [00:00<?, ?it/s]

03/17/2022 18:52:01 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-7000
03/17/2022 18:52:26 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-7000


Iteration:   0%|          | 0/4269 [00:00<?, ?it/s]

03/17/2022 19:12:40 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-10500
03/17/2022 19:12:52 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-10500
03/17/2022 19:26:09 - INFO - __main__ -    global_step = 12807, average loss = 2.3499287157527102
03/17/2022 19:26:09 - INFO - __main__ -   Saving model checkpoint to output-small
03/17/2022 19:26:21 - INFO - __main__ -   Evaluate the following checkpoints: ['output-small']
03/17/2022 19:26:24 - INFO - __main__ -   Creating features from dataset file at cached
03/17/2022 19:26:42 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
03/17/2022 19:26:42 - INFO - __main__ -   ***** Running evaluation  *****
03/17/2022 19:26:42 - INFO - __main__ -     Num examples = 2135
03/17/2022 19:26:42 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/533 [00:00<?, ?it/s]

03/17/2022 19:28:19 - INFO - __main__ -   ***** Eval results  *****
03/17/2022 19:28:19 - INFO - __main__ -     perplexity = tensor(5.9296)


{'perplexity_': tensor(5.9296)}

Now let's generate a conversation

In [None]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-small')

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=300,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=0,       
        do_sample=True, 
        top_k=50, 
        top_p=0.9,
        temperature = 0.8,
        #add bad words list
        bad_words_ids = [tokenizer(bad_word).input_ids for bad_word in ["!"]]
    )
    
    # pretty print last ouput tokens from bot
    print("GeorgeBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

NameError: ignored