<a href="https://colab.research.google.com/github/dreamingjudith/KoGPT2-personachat/blob/dev/personachat_kogpt2_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 사전 확인사항

- Runtime type 변경 후 GPU가 제대로 할당됐는지 확인하기
- 체크포인트 저장을 위한 Google Drive 연동
- 필요 모듈 설치
- 기존 코드에 남아있는 `logger`를 그대로 사용하기 위한 세팅

In [1]:
!pip install transformers==4.10.3 tokenizers==0.10.3 pytorch-lightning==1.5.10

Collecting transformers==4.10.3
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 7.9 MB/s 
[?25hCollecting tokenizers==0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.9 MB/s 
[?25hCollecting pytorch-lightning==1.5.10
  Downloading pytorch_lightning-1.5.10-py3-none-any.whl (527 kB)
[K     |████████████████████████████████| 527 kB 46.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 32.8 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (59

In [2]:
!nvidia-smi

Sat Apr 16 06:17:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!ls '/content/drive/My Drive'

'Colab Notebooks'   KoGPT2-personachat	 korquad_2.1


In [5]:
import logging

logger = logging.getLogger('cm_kogpt2')
logging.basicConfig(level=logging.INFO)

In [6]:
import torch

# 모델, 토크나이저
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

# 데이터셋
import json
import os
from torch.utils.data import DataLoader, TensorDataset
from itertools import chain

# PyTorch-Lightning을 이용한 모델 정의
from pytorch_lightning.core.lightning import LightningModule
from transformers.optimization import AdamW, get_cosine_schedule_with_warmup

# 기타 스크립트 실행을 위한 모듈
import argparse
from collections import defaultdict

import torch.nn.functional as F
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


# KoGPT2 모델, 토크나이저를 불러오기 위한 함수

In [7]:
def get_kogpt2_model():
    """Get KoGPT2 model after downloading"""

    model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')
    model.eval()

    return model


def get_kogpt2_tokenizer():
    """Get KoGPT2 Tokenizer after downloading"""

    tokenizer = PreTrainedTokenizerFast.from_pretrained("skt/kogpt2-base-v2",
                                                        bos_token='<s>',
                                                        eos_token='</s>',
                                                        unk_token='<unk>',
                                                        pad_token='<pad>',
                                                        mask_token='<mask>')

    return tokenizer

# 데이터셋 구성을 위한 함수

In [8]:
SPECIAL_TOKENS = ["<s>", "</s>", "<usr>", "<sys>", "<pad>"]
ATTR_TO_SPECIAL_TOKEN = {'bos_token': '<s>', 'eos_token': '</s>', 'pad_token': '<pad>',
                         'additional_special_tokens': ['<usr>', '<sys>']}
MODEL_INPUTS = ["input_ids", "labels", "token_type_ids"]
PADDED_INPUTS = ["input_ids", "labels", "token_type_ids"]

In [9]:
def get_dataset(tokenizer, dataset_path, dataset_cache):
    """Read PersonaChat json file and return tokenized dataset"""

    if dataset_cache and os.path.isfile(dataset_cache):
        logger.info("Load tokenized dataset from cache at %s", dataset_cache)
        dataset = torch.load(dataset_cache)

    else:
        dataset_basename = os.path.basename(dataset_path).split(".")[0]
        dataset_cache = "/content/drive/My Drive/KoGPT2-personachat/dataset/dataset_cache_{}".format(dataset_basename)

        logger.info("Reading {}".format(dataset_path))
        with open(dataset_path, "r", encoding="utf-8") as f:
            dataset = json.loads(f.read())

        logger.info("Tokenize and encode the dataset")

        def tokenize(obj):
            if isinstance(obj, str):
                return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
            if isinstance(obj, dict):
                return dict((n, tokenize(o)) for n, o in obj.items())
            return list(tokenize(o) for o in obj)
        dataset = tokenize(dataset)
        torch.save(dataset, dataset_cache)

    return dataset


def pad_dataset(dataset, padding=0):
    """ Pad the dataset.
    This could be optimized by defining a Dataset class and padding at the batch level,
    but this is simpler. """
    max_l = min(args.max_len, max(len(x) for x in dataset["input_ids"]))

    for name in PADDED_INPUTS:
        dataset[name] = [x + [padding if name != "labels" else -100] * (max_l - len(x)) if len(x) < args.max_len else x[:args.max_len] for x in dataset[name]]

    return dataset


def build_input_from_segments(persona, history, reply, tokenizer, labels=False, with_eos=True):
    """ Build a sequence of input from 3 segments: persona, history and last reply. """
    bos, eos, speaker1, speaker2 = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-1])
    sequence = [[bos] + list(chain(*persona))] + \
        history + [reply + ([eos] if with_eos else [])]
    sequence = [sequence[0]] + [[speaker2 if (len(sequence)-i) %
                                 2 else speaker1] + s for i, s in enumerate(sequence[1:])]
    instance = {}
    instance["input_ids"] = list(chain(*sequence))
    instance["token_type_ids"] = [speaker2 if i %
                                  2 else speaker1 for i, s in enumerate(sequence) for _ in s]
    instance["labels"] = [-100] * len(instance["input_ids"])
    if labels:
        instance["labels"] = ([-100] * sum(len(s) for s in sequence[:-1])) + [-100] + sequence[-1][1:]

    return instance


def get_data_loaders(args, tokenizer):
    """ Prepare the dataset for training and evaluation """
    personachat = get_dataset(tokenizer, args.dataset_path, args.dataset_cache)

    logger.info("Build inputs and labels")
    datasets = {"train": defaultdict(list), "valid": defaultdict(list)}
    for dataset_name, dataset in personachat.items():
        num_candidates = len(dataset[0]["utterances"][0]["candidates"])
        if args.num_candidates > 0 and dataset_name == 'train':
            num_candidates = min(args.num_candidates, num_candidates)
        for dialog in dataset:
            persona = dialog["personality"].copy()
            for _ in range(args.personality_permutations):
                for utterance in dialog["utterances"]:
                    history = utterance["history"][-(2 * args.max_history + 1):]
                    for j, candidate in enumerate(utterance["candidates"][-num_candidates:]):
                        labels = bool(j == num_candidates - 1)
                        instance = build_input_from_segments(
                            persona, history, candidate, tokenizer, labels)
                        for input_name, input_array in instance.items():
                            datasets[dataset_name][input_name].append(
                                input_array)
                    datasets[dataset_name]["n_candidates"] = num_candidates
                # permuted personalities
                persona = [persona[-1]] + persona[:-1]

    logger.info("Pad inputs and convert to Tensor")
    tensor_datasets = {"train": [], "valid": []}
    for dataset_name, dataset in datasets.items():
        dataset = pad_dataset(
            dataset, padding=tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-1]))
        for input_name in MODEL_INPUTS:
            tensor = torch.tensor(dataset[input_name])
            tensor = tensor.view(
                (-1, datasets[dataset_name]["n_candidates"]) + tensor.shape[1:])
            tensor_datasets[dataset_name].append(tensor)

    logger.info("Build train and validation dataloaders")
    train_dataset, valid_dataset = TensorDataset(*tensor_datasets["train"]), TensorDataset(*tensor_datasets["valid"])
    train_loader = DataLoader(train_dataset,
                              batch_size=args.train_batch_size,
                              num_workers=args.num_workers,
                              shuffle=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=args.valid_batch_size,
                              num_workers=args.num_workers,
                              shuffle=False)

    return train_loader, valid_loader

# PersonaChat model defined for PyTorch Lightning

In [10]:
class CMPersonaChat(LightningModule):
    def __init__(self, **hparams):  # should get hparams with ** if you want pass args
    # def __init__(self, hparams):  # not like this
        super(CMPersonaChat, self).__init__()
        self.save_hyperparameters()
        self.kogpt2 = get_kogpt2_model()

    @staticmethod
    def add_model_specific_args(parent_parser):
        # add model specific args
        parser = argparse.ArgumentParser(parents=[parent_parser], add_help=False)
        parser.add_argument('--lr',
                            type=float,
                            default=5e-5,
                            help='The initial learning rate')
        parser.add_argument('--warmup_ratio',
                            type=float,
                            default=0.1,
                            help='warmup ratio')
        return parser

    @property
    def num_training_steps(self) -> int:
        """Total training steps inferred from datamodule and devices.
        https://github.com/PyTorchLightning/pytorch-lightning/issues/5449#issuecomment-757863689
        https://github.com/Zasder3/train-CLIP/issues/29#issuecomment-1056339940
        """
        dataset = self.trainer._data_connector._train_dataloader_source.dataloader()

        if self.trainer.max_steps:
            return self.trainer.max_steps

        dataset_size = (
            self.trainer.limit_train_batches
            if self.trainer.limit_train_batches != 0
            else len(dataset)
        )

        num_devices = max(1, self.trainer.num_gpus, self.trainer.num_processes)
        if self.trainer.tpu_cores:
            num_devices = max(num_devices, self.trainer.tpu_cores)

        effective_batch_size = dataset.batch_size * self.trainer.accumulate_grad_batches * num_devices
        return (dataset_size // effective_batch_size) * self.trainer.max_epochs

    def forward(self, inputs, token_type_ids, labels=None):
        outputs = self.kogpt2(inputs, token_type_ids=token_type_ids, labels=labels)
        return outputs

    def training_step(self, batch, batch_idx):
        token_ids, label, mask = batch
        # forward: input(batch,max_sentence_length) -> output(batch_size, max_sentence_length,vocab)
        # e.g. (4,768) -> (4,768,50000)
        outputs = self(token_ids, token_type_ids=mask, labels=label)
        self.log("loss/train_loss", outputs.loss)

        return outputs.loss

    def validation_step(self, batch, batch_idx):
        # batch = tuple(input_tensor.to(self.hparams.device) for input_tensor in batch)
        token_ids, label, mask = batch
        outputs = self(token_ids, token_type_ids=mask, labels=label)
        self.log("loss/val_loss", outputs.loss)

        return outputs.loss

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack(outputs).mean()
        self.log("loss/avg_val_loss", avg_loss)

    def configure_optimizers(self):
        # TODO: num_training_step을 구하기 위해 dataloder 없이 manual optimization을 이용해 warmup 하게 고치기
        # Prepare optimizer
        param_optimizer = list(self.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        ]
        optimizer = AdamW(optimizer_grouped_parameters,
                          lr=self.hparams.lr, correct_bias=False)

        # Prepare learning rate scheduler
        num_train_steps = self.num_training_steps
        num_warmup_steps = int(num_train_steps * self.hparams.warmup_ratio)
        scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_warmup_steps=num_warmup_steps, num_training_steps=num_train_steps)
        lr_scheduler = {'scheduler': scheduler, 'name': 'cosine_schedule_with_warmup',
                        'monitor': 'loss', 'interval': 'step',
                        'frequency': 1}
        return [optimizer], [lr_scheduler]

# argparser

In [11]:
parser = argparse.ArgumentParser()
parser.add_argument("--dataset_path", type=str,
                    default="dataset/personachat_google_translated.json",
                    help="Path of the dataset.")
parser.add_argument("--dataset_cache", type=str,
                    default='./dataset_cache',
                    help="Path or url of the dataset cache")
parser.add_argument("--num_candidates", type=int, default=1,
                    help="Number of candidates for training")
parser.add_argument("--personality_permutations", type=int, default=1,
                    help="Number of permutations of personality sentences")
parser.add_argument("--max_history", type=int, default=2,
                    help="Number of previous exchanges to keep in history")
parser.add_argument("--name", type=str,
                    default="cm_kogpt2",
                    help="Model name for logging")
parser.add_argument("--ckpt_path", type=str,
                    help="Checkpoint path for training or evaluation")

# Shared arguments for dataloader and training
parser.add_argument('--max_len',
                    type=int,
                    default=768,
                    help='max sentence length on input (default: 768)')
parser.add_argument("--train_batch_size", type=int,
                    default=4, help="Batch size for training")
parser.add_argument("--valid_batch_size", type=int,
                    default=4, help="Batch size for validation")
parser.add_argument("--num_workers", type=int,
                    default=min(os.cpu_count(), 8), help="Number of workers for DataLoader")

# Special arguments for inference
parser.add_argument("--temperature", type=float, default=0.7, help="Sampling softmax temperature")
parser.add_argument("--top_k", type=int, default=0, help="Filter top-k tokens before sampling (<=0: no filtering)")
parser.add_argument("--top_p", type=float, default=0.9, help="Nucleus filtering (top-p) before sampling (<=0.0: no filtering)")
parser.add_argument("--no_sample", action='store_true', help="Set to use greedy decoding instead of sampling")
parser.add_argument("--min_length", type=int, default=1, help="Minimum length of the output utterances")
parser.add_argument("--max_length", type=int, default=20, help="Maximum length of the output utterances")

# Select train/inference
parser.add_argument('--mode', type=str, choices=['train', 'chat'],
                    required=True,
                    help='Script mode to execute (train, eval, chat)')

# Model configuration arguments
parser = CMPersonaChat.add_model_specific_args(parser)
parser = Trainer.add_argparse_args(parser)

# Fine-tune KoGPT2 for PersonaChat

이 때 Colab에서 argument가 정상적으로 들어가게 하기 위해 아래와 같은 방식으로 `parse_args` 함수에 인자로 `args` 리스트를 줘야 힘
```
args = parser.parse_args(args=['--mode', 'train', '--dataset_path', '/content/drive/My Drive/KoGPT2-personachat/dataset/sample.json', '--gpus', '1'])
```

In [12]:
args = parser.parse_args(args=['--mode', 'train',
                               '--dataset_path', '/content/drive/My Drive/KoGPT2-personachat/dataset/personachat_manual_translated.json',
                               '--gpus', '1',
                               '--valid_batch_size', '2'])

tokenizer = get_kogpt2_tokenizer()
train_loader, val_loader = get_data_loaders(args, tokenizer)

# TensorBoard logger settings
tb_logger = TensorBoardLogger("/content/drive/My Drive/KoGPT2-personachat/logs", name=args.name, default_hp_metric=False)
checkpoint_callback = ModelCheckpoint(
    dirpath=f'{tb_logger.log_dir}/checkpoints',
    filename='model_epoch-{epoch:02d}_avg_val_loss-{loss/avg_val_loss:.4f}',
    auto_insert_metric_name=False,
    verbose=True,
    save_last=True,
    save_top_k=10,
    mode='min',
    monitor='loss/avg_val_loss'
)

if args.ckpt_path is None:
    trainer = Trainer.from_argparse_args(
        args,
        callbacks=[checkpoint_callback],
        gradient_clip_val=1.0,
        max_epochs=3,
        accumulate_grad_batches=8,
        logger=tb_logger)

    model = CMPersonaChat(**vars(args))

# Fine-tune from saved checkpoint
else:
    trainer = Trainer(
        resume_from_checkpoint=args.ckpt_path,
        callbacks=[checkpoint_callback],
        gradient_clip_val=1.0,
        max_epochs=3,
        accumulate_grad_batches=8,
        logger=tb_logger)

    model = CMPersonaChat.load_from_checkpoint(args.ckpt_path)

model.train()
trainer.fit(model, train_loader, val_loader)
logging.info('best model path {}'.format(checkpoint_callback.best_model_path))

Downloading:   0%|          | 0.00/2.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.
INFO:cm_kogpt2:Reading /content/drive/My Drive/KoGPT2-personachat/dataset/personachat_manual_translated.json
INFO:cm_kogpt2:Tokenize and encode the dataset
INFO:cm_kogpt2:Build inputs and labels
INFO:cm_kogpt2:Pad inputs and convert to Tensor
INFO:cm_kogpt2:Build train and validation dataloaders
INFO:pytorch_lightning.utilities.distributed:GPU available: True, used: True
INFO:pytorch_lightning.utilities.distributed:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.distributed:IPU available: False, using: 0 IPUs


Downloading:   0%|          | 0.00/513M [00:00<?, ?B/s]

INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name   | Type            | Params
-------------------------------------------
0 | kogpt2 | GPT2LMHeadModel | 125 M 
-------------------------------------------
125 M     Trainable params
0         Non-trainable params
125 M     Total params
500.656   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.distributed:Epoch 0, global step 15: loss/avg_val_loss reached 4.05268 (best 4.05268), saving model to "/content/drive/My Drive/KoGPT2-personachat/logs/cm_kogpt2/version_0/checkpoints/model_epoch-00_avg_val_loss-4.0527.ckpt" as top 10


Validating: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.distributed:Epoch 1, global step 31: loss/avg_val_loss reached 3.86469 (best 3.86469), saving model to "/content/drive/My Drive/KoGPT2-personachat/logs/cm_kogpt2/version_0/checkpoints/model_epoch-01_avg_val_loss-3.8647.ckpt" as top 10


Validating: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.distributed:Epoch 2, global step 47: loss/avg_val_loss reached 4.09366 (best 3.86469), saving model to "/content/drive/My Drive/KoGPT2-personachat/logs/cm_kogpt2/version_0/checkpoints/model_epoch-02_avg_val_loss-4.0937.ckpt" as top 10
INFO:pytorch_lightning.utilities.distributed:Saving latest checkpoint...
INFO:root:best model path /content/drive/My Drive/KoGPT2-personachat/logs/cm_kogpt2/version_0/checkpoints/model_epoch-01_avg_val_loss-3.8647.ckpt


# Interactive chatting

In [13]:
import random
import warnings


def top_filtering(logits, top_k=0., top_p=0.9, threshold=-float('Inf'), filter_value=-float('Inf')):
    """ Filter a distribution of logits using top-k, top-p (nucleus) and/or threshold filtering
        Args:
            logits: logits distribution shape (vocabulary size)
            top_k: <=0: no filtering, >0: keep only top k tokens with highest probability.
            top_p: <=0.0: no filtering, >0.0: keep only a subset S of candidates, where S is the smallest subset
                whose total probability mass is greater than or equal to the threshold top_p.
                In practice, we select the highest probability tokens whose cumulative probability mass exceeds
                the threshold top_p.
            threshold: a minimal threshold to keep logits
    """
    assert logits.dim() == 1  # Only work for batch size 1 for now - could update but it would obfuscate a bit the code
    top_k = min(top_k, logits.size(-1))
    if top_k > 0:
        # Remove all tokens with a probability less than the last token in the top-k tokens
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p > 0.0:
        # Compute cumulative probabilities of sorted tokens
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probabilities > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        # Back to unsorted indices and set them to -infinity
        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = filter_value

    indices_to_remove = logits < threshold
    logits[indices_to_remove] = filter_value

    return logits


def sample_sequence(personality, history, tokenizer, model, args, current_output=None):
    special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
    if current_output is None:
        current_output = []

    for i in range(args.max_len):
        instance = build_input_from_segments(personality, history, current_output, tokenizer, with_eos=False)

        input_ids = torch.tensor(instance["input_ids"]).unsqueeze(0)
        token_type_ids = torch.tensor(instance["token_type_ids"]).unsqueeze(0)

        logits = model(input_ids, token_type_ids=token_type_ids).logits
        if isinstance(logits, tuple):  # for gpt2 and maybe others
            logits = logits[0]
        logits = logits[0, -1, :] / args.temperature
        logits = top_filtering(logits, top_k=args.top_k, top_p=args.top_p)
        probs = F.softmax(logits, dim=-1)

        prev = torch.topk(probs, 1)[1] if args.no_sample else torch.multinomial(probs, 1)
        if i < args.min_length and prev.item() in special_tokens_ids:
            while prev.item() in special_tokens_ids:
                if probs.max().item() == 1:
                    warnings.warn("Warning: model generating special token with probability 1.")
                    break  # avoid infinitely looping over special token
                prev = torch.multinomial(probs, num_samples=1)

        if prev.item() in special_tokens_ids:
            break
        current_output.append(prev.item())

    return current_output


def chat(args, tokenizer):
    dataset = get_dataset(tokenizer, args.dataset_path, args.dataset_cache)
    model = CMPersonaChat.load_from_checkpoint(args.ckpt_path)

    personalities = [dialog["personality"] for dataset in dataset.values() for dialog in dataset]
    personality = random.choice(personalities)
    for sentence in personality:
        print("Selected personality: %s" % tokenizer.decode(sentence))
    history = []

    while True:
        raw_text = input(">>> ")
        while not raw_text:
            print('Prompt should not be empty!')
            raw_text = input(">>> ")
        history.append(tokenizer.encode(raw_text))
        with torch.no_grad():
            out_ids = sample_sequence(personality, history, tokenizer, model, args)
        history.append(out_ids)
        history = history[-(2 * args.max_history + 1):]
        out_text = tokenizer.decode(out_ids)
        print(out_text)


In [14]:
args = parser.parse_args(args=['--mode', 'chat',
                               '--dataset_path', '/content/drive/My Drive/KoGPT2-personachat/dataset/personachat_manual_translated.json',
                               '--gpus', '1',
                               '--ckpt_path', '/content/drive/My Drive/KoGPT2-personachat/logs/cm_kogpt2/version_0/checkpoints/last.ckpt'])
tokenizer = get_kogpt2_tokenizer()

chat(args, tokenizer)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.
INFO:cm_kogpt2:Reading /content/drive/My Drive/KoGPT2-personachat/dataset/personachat_manual_translated.json
INFO:cm_kogpt2:Tokenize and encode the dataset


Selected personality: 저는 20 년 전에 태어났습니다.
Selected personality: 나는 미국에 산다.
Selected personality: 내가 가장 좋아하는 색은 파란색입니다.
Selected personality: 나는 남자로 태어 났고 17 살 때 여자로 전환했습니다.
>>> 안녕하세요
안녕하세요! 나는 그들을 사랑한다. 당신은요?
>>> 저도 당신을 사랑합니다. 가장 좋아하는 색은 무엇인가요?
저는 패션 디자이너를 좋아합니다.


KeyboardInterrupt: ignored