# Script Writing

GPT-2 Model Experiment 2<br>
Script writing in Korean
- Data: [짤툰](https://www.youtube.com/c/%EC%A7%A4%ED%88%B01) script data
- Model: [SKT AI KoGPT2](https://github.com/SKT-AI/KoGPT2) fine-tuning

Author: [Seongbum Seo](https://github.com/Seongbuming)

In [1]:
import torch
torch.cuda.empty_cache()

## Background Setup

In [2]:
# Install transformers library
%pip install -q git+https://github.com/huggingface/transformers.git
# Install helper functions
%pip install -q git+https://github.com/gmihaila/ml_things.git
%pip install -q fastai==2.2.5

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Clone base model
!git clone https://github.com/SKT-AI/KoGPT2
%pip install matplotlib==3.1.3

fatal: destination path 'KoGPT2' already exists and is not an empty directory.
Collecting matplotlib==3.1.3
  Using cached matplotlib-3.1.3-cp38-cp38-manylinux1_x86_64.whl (13.1 MB)
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.5.2
    Uninstalling matplotlib-3.5.2:
      Successfully uninstalled matplotlib-3.5.2
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

ml-things 0.0.1 requires matplotlib>=3.4.0, but you'll have matplotlib 3.1.3 which is incompatible.[0m
Successfully installed matplotlib-3.1.3
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.

## Model setup

In [19]:
import io
import os
import torch
import transformers
import fastai
import re
from typing import Optional
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader, random_split
from ml_things import fix_text
from transformers import AutoModelWithLMHead, PreTrainedTokenizerFast, GPT2Config
from fastai.text.all import *

# Number of training epochs
epochs = 10

# Number of batches - depending on the max sequence length and GPU memory
# For 512 sequence length batch of 10 works without cuda memory issues
# For small sequence length can try batch of 32 or higher
batch_size = 8

# Pad or truncate text sequences to a specific length
# If 'None' it will use maximum sequence of word piece tokens allowed by model
max_length = 256

# Look for GPU to use
# Will use 'cpu' by default if no GPU found
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Name of the base model to use
model_name_or_path = 'skt/kogpt2-base-v2'

# Path of data to use for training
train_data_path = './dataset/jjaltoon_scripts'

## Data

In [20]:
class ScriptDataset(Dataset):
    def __init__(self, path, use_tokenizer):
        # Check if path exists
        if not os.path.isdir(path):
            # Raise error if path is invalid
            raise ValueError('Invalid `path` variable. Needs to be a directory.')
        
        self.examples = []
        
        # Get all files from path
        files_names = os.listdir(path)
        # Go through each file and read its content
        for file_name in tqdm(files_names, desc=f'script files'):
            file_path = os.path.join(path, file_name)
            
            # Read content
            content = io.open(file_path, mode='r', encoding='utf-8').read()
            # Fix any unicode issues
            content = fix_text(content)
            # Save content
            self.examples.append(content)
        
        # Number of examples
        self.n_examples = len(self.examples)
    
    def __len__(self):
        r'''When used `len` return the number of examples.
        '''
        
        return self.n_examples
    
    def __getitem__(self, item):
        r'''Given an index return an example from the position.
        
        Arguments:
            item(:obj:`int`):
                Index position to pick an example to return.
        
        Returns:
            :obj:`str`: Script of the index position.
        '''
        
        return self.examples[item]

In [None]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def encode(self, x):
        tokens = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(tokens))
    
    def decodes(self, x):
        return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [6]:
class Gpt2ScriptWritingCollator(object):
    r'''Data Collator used for GPT-2 in a script writing task.
    
    It uses a given tokenizer and its encoder to convert any text to numbers that can go straight into a GPT-2 model.
    
    Arguments:
        use_tokenizer(:obj:`transformers.tokenization_?`):
            Transformer type tokenizer used to process raw text into numbers
        max_sequence_len(:obj:`int`, `optional`):
            Value to indicate the maximum desired sequence to truncate or pad text sequences.
            If no value is passed it will used maximum sequence size supported by the tokenizer and model.
    '''
    
    def __init__(self, use_tokenizer, max_sequence_len=None):
        # Tokenizer to be used inside the class
        self.use_tokenizer = use_tokenizer
        # Check max sequence length
        self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len
    
    def __call__(self, sequences):
        r'''This function allowes the class object to be used as a function call.
        
        Since the PyTorch DataLoader needs a collator function, can use this class as a function.
        
        Arguments:
            item(:obj:`list`):
                List of texts.
        
        Returns:
            :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model.
            It holds the statement `model(**Returned Dictionary)`.
        '''
        
        # Get all texts from sequences list
        text = [sequence['text'] for sequence in sequences]
        # Call tokenizer on all texts to convert into tensors of numbers with appropriate padding
        inputs = self.use_tokenizer(text=text, return_tensors='pt', padding=True, truncate=True, max_length=self.max_sequence_len)
        
        return inputs

In [7]:
class Dropout(Callback):
    def after_pred(self):
        self.learn.pred = self.pred[0]

In [8]:
def train(dataloader, optimizer_, scheduler_, device_):
    r'''Train PyTorch model on a single pass through the data loader.
    
    It will use the global variable `model` which is the transformer model loaded on `device_` that we want to train on.
    
    Arguments:
        dataloader(:obj:`torch.utils.data.dataloader.DataLoader`):
            Parsed data into batches of tensors.
        optimizer_(:obj:`transformers.optimization.AdamW`):
            Optimizer used for training.
        scheduler_(:obj:`torch.optim.lr_scheduler.LambdaLR`):
            PyTorch scheduler.
        device_(:obj:`torch.device`):
            Device used to load tensors before feeding to model.
        
        Returns:
            :obj:`List[
    '''
    
    # Use global variable for model
    learn = Learner(dataloader, loss_func=CrossEntropyLossFlat(), cbs=[Dropout], metrics=Perplexity()).to_fp16()
    lr = learn.lr_find()
    print(f'learning rate: {lr}')
    learn.fine_tune(epochs)

## Model

In [9]:
# Get model configuration
print('Loading configuration...')
model_config = GPT2Config.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path
)

# Get model's tokenizer
print('Loading tokenizer...')
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    model_name_or_path,
    bos_token='</s>',
    eos_token='</s>',
    unk_token='<unk>',
    pad_token='<pad>',
    mask_token='<mask>'
)
# Default to left padding
tokenizer.padding_side = 'left'
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

# Get the actual model
print('Loading model...')
model = AutoModelWithLMHead.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path,
    config=model_config
)

# Resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer))
# Fix model padding token id
model.config.pad_token_id = model.config.eos_token_id
# Load model to define device
model.to(device)
print(f'Model loaded to `{device}`.')

Loading configuration...


Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

Loading tokenizer...


Downloading:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


Loading model...


Downloading:   0%|          | 0.00/490M [00:00<?, ?B/s]

Model loaded to `cuda`.


In [None]:
# Create data collator to encode texts into numbers
gpt2_script_writing_collator = Gpt2ScriptWritingCollator(
    use_tokenizer=tokenizer,
    max_sequence_len=max_length
)

# Create PyTorch dataset
print('Dealing with train...')
train_dataset = ScriptDataset(path=train_data_path, use_tokenizer=tokenizer)
print(f'Created `train_dataset` with {len(train_dataset)} examples.')

# Move PyTorch dataset into dataloader
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=gpt2_script_writing_collator)
print(f'Created `train_dataloader` with {len(train_dataloader)} batches.')

Dealing with train...


script files:   0%|          | 0/10 [00:00<?, ?it/s]

Created `train_dataset` with 10 examples.
Created `train_dataloader` with 2 batches.


## Test

## Train