<a href="https://colab.research.google.com/github/akulczy/StoryGeneration/blob/main/Story_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ECS7022P Computational Creativity Assignment**

---

### **Computational Creativity Assignment: Utilising a fine-tuned GPT-2 model in automatic story generation**

Agata Kulczynska, Student ID: 180489015

The presented project is aimed at employing the GPT-2 model in order to generate textual descriptions and backstories of potential fictional characters.
Two GPT2 models are fine-tuned on custom datasets scraped from the Internet, and utilised to produce text.

Implementation of the system is based on the Colab Notebook linked below:
https://colab.research.google.com/drive/1vnpMoZoenRrWeaxMyfYK4DDbtlBu-M8V

Drive folder including both models and datasets:<br>
https://drive.google.com/drive/folders/1fr2C-3qzVQik6YtuJ3xsHGhTPtqI2kCY

# Logistics Code:

Logistics Code comprises of imports of libraries and packages, preprocessing of the dataset, defining parameters, structuring data into the input format of the GPT-2 model, as well as getting the GPT-2 model and tokenizer.

### Imports

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 14.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 59.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 54.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 52.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [None]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup, \
                         AutoTokenizer, AutoConfig, AutoModelForPreTraining, \
                        TrainingArguments, BeamScorer, Trainer
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv
import os
import re
import string
import nltk
from nltk.tokenize import word_tokenize
import ipywidgets as widgets

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# CSS to wrap lines when displaying the output
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
# Mount drive
"""from google.colab import drive
drive.mount('/content/drive')"""

"from google.colab import drive\ndrive.mount('/content/drive')"

In [None]:
#Empty cache 
import gc
gc.collect()
torch.cuda.empty_cache()

### Dataset Preprocessing

For the task of text generation, two separate GPT-2 models were trained. The former (referred to as Model 1) is utilised to produce textual descriptions of the fictional characters, while the later (referred to as Model 2) is employed to generate backstories. 
<br>
The training process of both models is based on the dataset scraped from the Internet. Furthermore, the training data for the model employed to produce backstories is additionally complemented with data extracted from the publicly available DnD-characters dataset. Consequently, two separate textual corpora were prepared and pre-processed for the purpose of the training phase.
<br>
The code written to scrape the data from the Internet is commented out and included at the end of this section.

In [None]:
# Method to preprocess the dataset
"""def preprocess_datasets(dataset, dataset_columns):
    idx = 0
    preprocessed_corpus = []
    for row in dataset.values:
        preprocessed = []
        # Tokenize to check the length of the text input
        tokens = word_tokenize(row[3])
        # Max length of the description is 450
        if not(len(tokens) > 450):
            # Remove selected special characters
            line = re.sub('‹_ÕÊœâ€[€@#$]', '', row[3])
            preprocessed.append(row[0])
            preprocessed.append(row[1])
            preprocessed.append(row[2])
            preprocessed.append(line)
            preprocessed_corpus.append(preprocessed)
    preprocessed_corpus = pd.DataFrame(preprocessed_corpus, columns = dataset_columns)
    return preprocessed_corpus   """   

"def preprocess_datasets(dataset, dataset_columns):\n    idx = 0\n    preprocessed_corpus = []\n    for row in dataset.values:\n        preprocessed = []\n        # Tokenize to check the length of the text input\n        tokens = word_tokenize(row[3])\n        # Max length of the description is 450\n        if not(len(tokens) > 450):\n            # Remove selected special characters\n            line = re.sub('‹_ÕÊœâ€[€@#$]', '', row[3])\n            preprocessed.append(row[0])\n            preprocessed.append(row[1])\n            preprocessed.append(row[2])\n            preprocessed.append(line)\n            preprocessed_corpus.append(preprocessed)\n    preprocessed_corpus = pd.DataFrame(preprocessed_corpus, columns = dataset_columns)\n    return preprocessed_corpus   "

In [None]:
# Load the datasets
"""data1 = pd.read_csv('dd_bios.csv', encoding='ISO-8859-1')
data2 = pd.read_csv('data1.csv', encoding='ISO-8859-1')
data3 = pd.read_csv('data2.csv', encoding='ISO-8859-1')"""

"data1 = pd.read_csv('dd_bios.csv', encoding='ISO-8859-1')\ndata2 = pd.read_csv('data1.csv', encoding='ISO-8859-1')\ndata3 = pd.read_csv('data2.csv', encoding='ISO-8859-1')"

In [None]:
# Corpus 1 - for characters' descriptions generation
"""corpus1 = data2.append(data3)
# Choose relevant columns
corpus1 = corpus1[['name', 'race', 'charClass', 'description']]

# Drop rows with empty values
corpus1['name'].replace('', np.nan, inplace=True)
corpus1.dropna(subset=['name'], inplace=True)
corpus1['race'].replace('', np.nan, inplace=True)
corpus1.dropna(subset=['race'], inplace=True)
corpus1['charClass'].replace('', np.nan, inplace=True)
corpus1.dropna(subset=['charClass'], inplace=True)
corpus1['description'].replace('', np.nan, inplace=True)
corpus1.dropna(subset=['description'], inplace=True)

# Apply preprocessing
corpus1 = preprocess_datasets(corpus1, ['name', 'race', 'charClass', 'description'])

# Check the final length
print("Corpus 1 length: " + str(len(corpus1)))

# Corpus 2 - for characters' backstories generation
corpus2a = data2.append(data3)

# Choose relevant columns
corpus2a = corpus2a[['name', 'race', 'charClass', 'background']]
# Append data
corpus2 = corpus2a.append(data1)

# Drop rows with empty values
corpus2['name'].replace('', np.nan, inplace=True)
corpus2.dropna(subset=['name'], inplace=True)
corpus2['race'].replace('', np.nan, inplace=True)
corpus2.dropna(subset=['race'], inplace=True)
corpus2['charClass'].replace('', np.nan, inplace=True)
corpus2.dropna(subset=['charClass'], inplace=True)
corpus2['background'].replace('', np.nan, inplace=True)
corpus2.dropna(subset=['background'], inplace=True)

# Apply preprocessing
corpus2 = preprocess_datasets(corpus2, ['name', 'race', 'charClass', 'background'])

# Check the final length
print("Corpus 2 length: " + str(len(corpus2)))"""

'corpus1 = data2.append(data3)\n# Choose relevant columns\ncorpus1 = corpus1[[\'name\', \'race\', \'charClass\', \'description\']]\n\n# Drop rows with empty values\ncorpus1[\'name\'].replace(\'\', np.nan, inplace=True)\ncorpus1.dropna(subset=[\'name\'], inplace=True)\ncorpus1[\'race\'].replace(\'\', np.nan, inplace=True)\ncorpus1.dropna(subset=[\'race\'], inplace=True)\ncorpus1[\'charClass\'].replace(\'\', np.nan, inplace=True)\ncorpus1.dropna(subset=[\'charClass\'], inplace=True)\ncorpus1[\'description\'].replace(\'\', np.nan, inplace=True)\ncorpus1.dropna(subset=[\'description\'], inplace=True)\n\n# Apply preprocessing\ncorpus1 = preprocess_datasets(corpus1, [\'name\', \'race\', \'charClass\', \'description\'])\n\n# Check the final length\nprint("Corpus 1 length: " + str(len(corpus1)))\n\n# Corpus 2 - for characters\' backstories generation\ncorpus2a = data2.append(data3)\n\n# Choose relevant columns\ncorpus2a = corpus2a[[\'name\', \'race\', \'charClass\', \'background\']]\n# Append 

In [None]:
# Review final corpus
"corpus1"

'corpus1'

In [None]:
"corpus2"

'corpus2'

In [None]:
# Turn into lists
"""name1 = corpus1['name'].values.tolist()
charClass1 = corpus1['charClass'].values.tolist()
race1 = corpus1['race'].values.tolist()
description = corpus1['description'].values.tolist()

name2 = corpus2['name'].values.tolist()
charClass2 = corpus2['charClass'].values.tolist()
race2 = corpus2['race'].values.tolist()
background = corpus2['background'].values.tolist()"""

"name1 = corpus1['name'].values.tolist()\ncharClass1 = corpus1['charClass'].values.tolist()\nrace1 = corpus1['race'].values.tolist()\ndescription = corpus1['description'].values.tolist()\n\nname2 = corpus2['name'].values.tolist()\ncharClass2 = corpus2['charClass'].values.tolist()\nrace2 = corpus2['race'].values.tolist()\nbackground = corpus2['background'].values.tolist()"

In [None]:
### Code implemented to scrape the data

"""
import requests
import mechanize
import http.cookiejar as cookielib
from bs4 import BeautifulSoup
import pandas as pd

# Url of the website from which the data was scraped
url = "https://3edb.com/selection.asp"

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}

r = requests.get("https://3edb.com/selection.asp", headers=headers)
c = r.content

soup = BeautifulSoup(r.content, 'html.parser')

post_params = {'cn': '', 'on': ''}
response = requests.post(url, data=post_params, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

final_df = pd.DataFrame()
rowsCounter = 0
# Range determined based on the IDs of forum posts on the website
for i in range(10000, 23691):
    try:
        r = requests.get("https://3edb.com//viewCharacter.asp?cid="+str(i), headers=headers)
        soup = BeautifulSoup(r.content, 'html.parser')
        tables = soup.find_all("table")
        name = tables[1].find_all("tr")[0].find_all("td")[0].get_text()
        if not(name == "An Error Occurred On The Server When Processing The Url. Please Contact The System Administrator.  If You Are The System Administrator Please Click Here To Find Out More About This Error."):
            description = tables[20].find_all("tr")[1].find_all("td")[0].get_text()
            personality = tables[21].find_all("tr")[1].find_all("td")[0].get_text()
            background = tables[22].find_all("tr")[1].find_all("td")[0].get_text()

            # Only add to dataset if the descriptions are included
            if not(description == "No Description Assigned.") and not(background == "No Background Assigned.") and not(personality == "No Personality Assigned."):
                charClass = tables[2].find_all("tr")[2].find_all("td")[0].get_text()
                charRace = tables[2].find_all("tr")[2].find_all("td")[1].get_text()
                age = tables[2].find_all("tr")[4].find_all("td")[2].get_text()
                gender = tables[2].find_all("tr")[4].find_all("td")[3].get_text()
                height = tables[2].find_all("tr")[6].find_all("td")[0].get_text()
                weight = tables[2].find_all("tr")[6].find_all("td")[1].get_text()
                eyes = tables[2].find_all("tr")[6].find_all("td")[2].get_text()
                hair = tables[2].find_all("tr")[6].find_all("td")[3].get_text()
                # Append to the dataframe
                df = pd.DataFrame(data={'name': [name],
                                        'charClass': [charClass],
                                        'race': [charRace],
                                        'age': [age],
                                        'gender': [gender],
                                        'height': [height],
                                        'weight': [weight],
                                        'eyes': [eyes],
                                        'hair': [hair],
                                        'description': [description],
                                        'personality': [personality],
                                        'background': [background]
                                        }
                                  )

                final_df = final_df.append(df)
                print(str(i) + " " + str(rowsCounter))
                rowsCounter += 1
    except:
        print("Continue")

final_df.to_csv("data2.csv",index=False)
"""

'\nimport requests\nimport mechanize\nimport http.cookiejar as cookielib\nfrom bs4 import BeautifulSoup\nimport pandas as pd\n\n# Url of the website from which the data was scraped\nurl = "https://3edb.com/selection.asp"\n\nheaders = {\n    \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0\',\n}\n\nr = requests.get("https://3edb.com/selection.asp", headers=headers)\nc = r.content\n\nsoup = BeautifulSoup(r.content, \'html.parser\')\n\npost_params = {\'cn\': \'\', \'on\': \'\'}\nresponse = requests.post(url, data=post_params, headers=headers)\nsoup = BeautifulSoup(response.text, \'html.parser\')\n\nfinal_df = pd.DataFrame()\nrowsCounter = 0\n# Range determined based on the IDs of forum posts on the website\nfor i in range(10000, 23691):\n    try:\n        r = requests.get("https://3edb.com//viewCharacter.asp?cid="+str(i), headers=headers)\n        soup = BeautifulSoup(r.content, \'html.parser\')\n        tables = soup.find_all("table")\n

### Parameters

Setting up the parameters for the training phase.

In [None]:
"""DEBUG = False
INPUT_DIR = '/'
USE_APEX = True
APEX_OPT_LEVEL = 'O1'
MODEL = 'gpt2' 
UNFREEZE_LAST_N = 6 
# Special tokens for the model
SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}
                    
MAXLEN = 1024  #{768, 1024, 1280, 1600}
# Size of the training set
TRAIN_SIZE      = 0.8
if USE_APEX:
    TRAIN_BATCHSIZE = 4
    BATCH_UPDATE    = 16
else:
    TRAIN_BATCHSIZE = 2
    BATCH_UPDATE    = 32
EPOCHS = 4
LR = 5e-4
EPS = 1e-8
WARMUP_STEPS = 1e2
SEED = 2020"""

'DEBUG = False\nINPUT_DIR = \'/\'\nUSE_APEX = True\nAPEX_OPT_LEVEL = \'O1\'\nMODEL = \'gpt2\' \nUNFREEZE_LAST_N = 6 \n# Special tokens for the model\nSPECIAL_TOKENS  = { "bos_token": "<|BOS|>",\n                    "eos_token": "<|EOS|>",\n                    "unk_token": "<|UNK|>",                    \n                    "pad_token": "<|PAD|>",\n                    "sep_token": "<|SEP|>"}\n                    \nMAXLEN = 1024  #{768, 1024, 1280, 1600}\n# Size of the training set\nTRAIN_SIZE      = 0.8\nif USE_APEX:\n    TRAIN_BATCHSIZE = 4\n    BATCH_UPDATE    = 16\nelse:\n    TRAIN_BATCHSIZE = 2\n    BATCH_UPDATE    = 32\nEPOCHS = 4\nLR = 5e-4\nEPS = 1e-8\nWARMUP_STEPS = 1e2\nSEED = 2020'

### Structuring textual data into GPT-2 input format

Processing textual data in order to structure it into the format accepted by the GPT-2 model. To input the context key-words, special tokens (BOS - Beginning of Sentence, SEP - Separation, and EOS - End of Sentence) are utilised. 

In [None]:
# Processing data into format accepted by GPT2, utilising GPT2 tokenizer

"""def preprocess_data(name, race, charClass, descbag, tokenizer):
    labels = []
    input_ids = []
    attention_masks = []
    i = 0
    outputs=[]

    for row in descbag:
        input_data = SPECIAL_TOKENS['bos_token'] + charClass[i] + \
        SPECIAL_TOKENS['sep_token'] + race[i] + SPECIAL_TOKENS['sep_token'] + \
        name[i] + SPECIAL_TOKENS['sep_token'] + \
        row + SPECIAL_TOKENS['eos_token']      

        # Encode data using GPT2 tokenizer
        encoding = tokenizer(input_data,                                   
                            truncation=True, 
                            max_length=MAXLEN, 
                            padding="max_length")
        
        

        outputs.append({'label': torch.tensor(encoding['input_ids']),
                'input_ids': torch.tensor(encoding['input_ids']), 
                'attention_mask': torch.tensor(encoding['attention_mask'])})
        i += 1

    return outputs"""

'def preprocess_data(name, race, charClass, descbag, tokenizer):\n    labels = []\n    input_ids = []\n    attention_masks = []\n    i = 0\n    outputs=[]\n\n    for row in descbag:\n        input_data = SPECIAL_TOKENS[\'bos_token\'] + charClass[i] +         SPECIAL_TOKENS[\'sep_token\'] + race[i] + SPECIAL_TOKENS[\'sep_token\'] +         name[i] + SPECIAL_TOKENS[\'sep_token\'] +         row + SPECIAL_TOKENS[\'eos_token\']      \n\n        # Encode data using GPT2 tokenizer\n        encoding = tokenizer(input_data,                                   \n                            truncation=True, \n                            max_length=MAXLEN, \n                            padding="max_length")\n        \n        \n\n        outputs.append({\'label\': torch.tensor(encoding[\'input_ids\']),\n                \'input_ids\': torch.tensor(encoding[\'input_ids\']), \n                \'attention_mask\': torch.tensor(encoding[\'attention_mask\'])})\n        i += 1\n\n    return outputs'

### Tokenizer and model

Download pre-trained GPT-2 model and Tokenizer.

In [None]:
# Get the GPT-2 Tokenizer

"""def get_tokenizer(special_tokens=None):
    tokenizer = AutoTokenizer.from_pretrained('gpt2') 
    # Add special tokens
    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
    return tokenizer"""

"def get_tokenizer(special_tokens=None):\n    tokenizer = AutoTokenizer.from_pretrained('gpt2') \n    # Add special tokens\n    if special_tokens:\n        tokenizer.add_special_tokens(special_tokens)\n    return tokenizer"

In [None]:
# Get the GPT-2 model

"""def get_model(tokenizer, special_tokens=None, load_model_path=None):

    if special_tokens:
        config = AutoConfig.from_pretrained('gpt2', 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = AutoConfig.from_pretrained('gpt2',                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    

    model = AutoModelForPreTraining.from_pretrained('gpt2', config=config)

    if special_tokens:
        # Resize the model based on the supplied special tokens
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.load_state_dict(torch.load(load_model_path))

    model.cuda()
    return model"""

"def get_model(tokenizer, special_tokens=None, load_model_path=None):\n\n    if special_tokens:\n        config = AutoConfig.from_pretrained('gpt2', \n                                            bos_token_id=tokenizer.bos_token_id,\n                                            eos_token_id=tokenizer.eos_token_id,\n                                            sep_token_id=tokenizer.sep_token_id,\n                                            pad_token_id=tokenizer.pad_token_id,\n                                            output_hidden_states=False)\n    else: \n        config = AutoConfig.from_pretrained('gpt2',                                     \n                                            pad_token_id=tokenizer.eos_token_id,\n                                            output_hidden_states=False)    \n\n    model = AutoModelForPreTraining.from_pretrained('gpt2', config=config)\n\n    if special_tokens:\n        # Resize the model based on the supplied special tokens\n        model.resi

# Training Code

The code cells below outline the training phases of both models.

### Get the tokenizer and the model

In [None]:
# Get the tokenizer, pass in the list of special tokens
"tokenizer1 = get_tokenizer(special_tokens=SPECIAL_TOKENS)"

# Get the model
"""gpt2_model_1 = get_model(tokenizer1, 
                  special_tokens=SPECIAL_TOKENS
                 )
"""

'gpt2_model_1 = get_model(tokenizer1, \n                  special_tokens=SPECIAL_TOKENS\n                 )\n'

In [None]:
"""tokenizer2 = get_tokenizer(special_tokens=SPECIAL_TOKENS)

gpt2_model_2 = get_model(tokenizer2, 
                  special_tokens=SPECIAL_TOKENS
                 )"""

'tokenizer2 = get_tokenizer(special_tokens=SPECIAL_TOKENS)\n\ngpt2_model_2 = get_model(tokenizer2, \n                  special_tokens=SPECIAL_TOKENS\n                 )'

### Supply data

The preprocess method is called on both datasets, and the train_test_split function is employed to obtain train and evaluation sets in 0.8:0.2 ratio.

In [None]:
### Model 1

# Structure data into format of the GPT2 input for the first corpus
"dataset1 = preprocess_data(name1, charClass1, race1, description, tokenizer1)"
# Split data to obtain train and validation datasets
"train_dataset1, val_dataset1 = train_test_split(dataset1, test_size=0.2, random_state=42)"

'train_dataset1, val_dataset1 = train_test_split(dataset1, test_size=0.2, random_state=42)'

In [None]:
### Model 2

# Structure data into format of the GPT2 input for the second corpus
"""dataset2 = preprocess_data(name2, charClass2, race2, background, tokenizer2)
train_dataset2, val_dataset2 = train_test_split(dataset2, test_size=0.2, random_state=42)"""

'dataset2 = preprocess_data(name2, charClass2, race2, background, tokenizer2)\ntrain_dataset2, val_dataset2 = train_test_split(dataset2, test_size=0.2, random_state=42)'

### Train GPT-2 Model 1 - For characters' descriptions

The first model is the GPT-2 model trained on fantasy characters' descriptions. Supplementary 'keywords' are: characters' names, races (e.g. Elf), and classes (e.g. Ranger). 
The model is trained using the Trainer module.</br> </br>Best model gets saved. Evaluation strategy is selected to be based on epochs. Model is evaluated every 16 logging step in order to save the best-performing one. The same strategy is applied to Model 2.

In [None]:
# Specify training arguments
"""training_args = TrainingArguments(
    output_dir="/",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps = 16,
    fp16=True,
    fp16_opt_level=APEX_OPT_LEVEL,
    warmup_steps=WARMUP_STEPS,    
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01,        
    save_total_limit=1,
    load_best_model_at_end=True,     
)"""

'training_args = TrainingArguments(\n    output_dir="/",\n    num_train_epochs=EPOCHS,\n    per_device_train_batch_size=TRAIN_BATCHSIZE,\n    per_device_eval_batch_size=TRAIN_BATCHSIZE,\n    gradient_accumulation_steps=BATCH_UPDATE,\n    evaluation_strategy="epoch",\n    save_strategy="epoch",\n    logging_steps = 16,\n    fp16=True,\n    fp16_opt_level=APEX_OPT_LEVEL,\n    warmup_steps=WARMUP_STEPS,    \n    learning_rate=LR,\n    adam_epsilon=EPS,\n    weight_decay=0.01,        \n    save_total_limit=1,\n    load_best_model_at_end=True,     \n)'

In [None]:
### Train Model 1 (for descriptions) using the Trainer module

"""trainer1 = Trainer(
    model=gpt2_model_1,
    args=training_args,    
    train_dataset=train_dataset1,
    eval_dataset=val_dataset1,
    tokenizer=tokenizer1
)"""

# Train
"trainer1.train()"

'trainer1.train()'

In [None]:
# Save model 1
#trainer1.save_model("/content/drive/MyDrive/compcreativity/modele")

### Train GPT-2 Model 2 - For characters' background stories

The second model is the GPT-2 model trained on the fantasy characters' background stories. Supplementary 'keywords' are: characters' names, races (e.g. Elf), and classes (e.g. Ranger). The model is trained using the Trainer module.

In [None]:
### Train Model 2 (for background stories) using the Trainer module

"""trainer2 = Trainer(
    model=gpt2_model_2,
    args=training_args,    
    train_dataset=train_dataset2,
    eval_dataset=val_dataset2,
    tokenizer=tokenizer2
)"""

# Train
"trainer2.train()"

'trainer2.train()'

In [None]:
# Save model 2
#trainer2.save_model("/content/drive/MyDrive/compcreativity/modelf")

### Evaluate Model 1

Final loss of Model 1: 0.45804083347320557</br>


In [None]:
"trainer1.evaluate()"

'trainer1.evaluate()'

### Evaluate Model 2
Final loss of Model 2: 0.6762203574180603

In [None]:
"trainer2.evaluate()"

'trainer2.evaluate()'

# Generation Code

#### Set parameters for samples generation

In [None]:
DEBUG = False
INPUT_DIR = '/'
USE_APEX = True
APEX_OPT_LEVEL = 'O1'
MODEL = 'gpt2' 
UNFREEZE_LAST_N = 6 
# Special tokens for the model
SPECIAL_TOKENS  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}
                    
MAXLEN = 1024  #{768, 1024, 1280, 1600}
# Size of the training set
TRAIN_SIZE      = 0.8
if USE_APEX:
    TRAIN_BATCHSIZE = 4
    BATCH_UPDATE    = 16
else:
    TRAIN_BATCHSIZE = 2
    BATCH_UPDATE    = 32
EPOCHS = 4
LR = 5e-4
EPS = 1e-8
WARMUP_STEPS = 1e2
SEED = 2020

#### Get GPT-2 Model and Tokenizer

In [None]:
# Method to download the pre-trained GPT2 Tokenizer

def get_tokenizer(special_tokens=None):
    tokenizer = AutoTokenizer.from_pretrained('gpt2') 
    # Add special tokens
    if special_tokens:
        tokenizer.add_special_tokens(special_tokens)
    return tokenizer

In [None]:
# Method to download the pre-trained GPT2 model

def get_model(tokenizer, special_tokens=None, load_model_path=None):

    if special_tokens:
        config = AutoConfig.from_pretrained('gpt2', 
                                            bos_token_id=tokenizer.bos_token_id,
                                            eos_token_id=tokenizer.eos_token_id,
                                            sep_token_id=tokenizer.sep_token_id,
                                            pad_token_id=tokenizer.pad_token_id,
                                            output_hidden_states=False)
    else: 
        config = AutoConfig.from_pretrained('gpt2',                                     
                                            pad_token_id=tokenizer.eos_token_id,
                                            output_hidden_states=False)    

    model = AutoModelForPreTraining.from_pretrained('gpt2', config=config)

    if special_tokens:
        # Resize the model based on the supplied special tokens
        model.resize_token_embeddings(len(tokenizer))

    if load_model_path:
        model.load_state_dict(torch.load(load_model_path))

    model.cuda()
    return model

In [None]:
# Get the tokenizer, pass in the list of special tokens
tokenizer1 = get_tokenizer(special_tokens=SPECIAL_TOKENS)

# Get the model (for Model 1)
gpt2_model_1 = get_model(tokenizer1, 
                  special_tokens=SPECIAL_TOKENS
                 )

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [None]:
# Get the tokenizer and pre-trained GPT-2 model for Model 2
tokenizer2 = get_tokenizer(special_tokens=SPECIAL_TOKENS)

gpt2_model_2 = get_model(tokenizer2, 
                  special_tokens=SPECIAL_TOKENS
                 )

#### Download models

In [None]:
#!pip install --upgrade --no-cache-dir gdown

In [None]:
import gdown, os

# Folder with the models
url = "https://drive.google.com/drive/folders/1t_ePeea0GCxYxP7FWUzbjP3Dab3mRKvq"

download_successful = None 
while download_successful == None:
    download_successful = gdown.download_folder(url, quiet=True, use_cookies=False)
    os.system('rm ~/.cache/gdown/cookies.json')

In [None]:
# Load fine-tuned Model 1
gpt2_model_1.load_state_dict(torch.load('Models/Model1/pytorch_model.bin'))
gpt2_model_1.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50262, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

In [None]:
# Load fine-tuned Model 2
gpt2_model_2.load_state_dict(torch.load('Models/Model2/pytorch_model.bin'))
gpt2_model_2.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50262, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

#### Generate samples

Functionalities to generate textual samples in the different described configurations.

In [None]:
# Data for drop-down lists
char_classes = [('', 0), ('Ranger', 1), ('Cleric', 2), ('Monk', 3), ('Paladin', 4), ('Wizard', 5), ('Bard', 6), ('Druid', 7), ('Sorcerer', 8), ('Fighter', 9), ('Rogue', 10), ('Witch', 11), ('Knight', 12),
                ('Hunter', 13), ('Villager', 14)]

char_races = [('', 0), ('Human', 1), ('Elf', 2), ('Half-elf', 3), ('Dark-elf', 4), ('Dragonborn', 5), ('Ogre', 6), ('Half-Ogre', 7), ('Gnome', 8), ('Minotaur', 9), ('Dwarf', 10), ('Goblin', 11)]

<br>![20.png](https://i.imgur.com/Qi8teH1.png
)<br><br>

**Fantasy Generator** is a tool that allows to generate textual descriptions and background stories for fictional characters.
<br>
<br>
<br>
The key words concerning the character's class and race can be chosen from the drop-down lists. The lists were constructed based on the most-often occurring labels in the dataset.
<br>
<br>
For more variability, the fields can be left empty. The model will generate the characteristics (race and class) by itself.
<br>
Character's race can be chosen, and the character's class can be left empty. However, if choosing the class, please choose the race as well, as this is the sequence in accordance to which the model was trained. 
<br>
<br>
The following configurations of text generation are available:<br>
* Generation of the character's description, with optional key-words - Please press the "Generate Description" button to produce text.
<br>
* Generation of the character's backstory, with optional key-words - Please press the "Generate Story" button to produce text.
<br>
* Generation of the character's backstory, with optional prompt sentences - Please press the "Generate Story 2" button to produce text.
<br>
<br>

The key-words chosen at the beginning get transferred to the other functionalities automatically. To generate samples without key-words, please leave the drop-down lists empty (select first, empty item).<br>
The characteristics (name, race, class) output during Description generation get transferred to the other functionalities as well. To produce background story without any keywords, please omit the Description generation and start from Story Generation.<br>
To re-run all the configurations, please re-run the "Generate samples" code. <br><br>

In [None]:
#@title Selection: { display-mode: "form" }

chosen_r = widgets.Dropdown(
    options=char_races,
    value=0,
    description='Race:',
    disabled=False,
)

chosen_r

Dropdown(description='Race:', options=(('', 0), ('Human', 1), ('Elf', 2), ('Half-elf', 3), ('Dark-elf', 4), ('…

In [None]:
#@title  { display-mode: "form" }
chosen_c = widgets.Dropdown(
    options=char_classes,
    value=0,
    description='Class:',
    disabled=False,
)

chosen_c

Dropdown(description='Class:', options=(('', 0), ('Ranger', 1), ('Cleric', 2), ('Monk', 3), ('Paladin', 4), ('…

<br>![20.png](https://i.imgur.com/YBzTqRq.png
)<br><br>

Please press the "Generate Description" button to generate a textual sample. 

In [None]:
global name
global race
global charClass
name = None
race = None
charClass = None
def get_description(arg):
    global name
    global race
    global charClass
    
    print("")
    print("Starting description generation...")
    print("")
    # Initially, prompt set to be empty
    prompt = ' '
    char_race = list(char_races[chosen_r.value])[0]
    char_class = list(char_classes[chosen_c.value])[0]
    # If race is not chosen and class is chosen
    if chosen_r.value == 0 and chosen_c.value != 0:   
        print("If selecting the character's class, please choose the character's race too. Alternatively, both fields can be left empty.")
        return     
    # If race is chosen and class is not    
    elif chosen_r.value != 0 and chosen_c.value == 0:        
        prompt = SPECIAL_TOKENS['bos_token'] + char_race + \
         SPECIAL_TOKENS['sep_token'] 
    # If both characteristics are chosen
    elif chosen_r.value != 0 and chosen_c.value != 0:
        prompt = SPECIAL_TOKENS['bos_token'] + char_race + \
         SPECIAL_TOKENS['sep_token'] + char_class + SPECIAL_TOKENS['sep_token']

    generated = torch.tensor(tokenizer1.encode(prompt)).unsqueeze(0)
    device = torch.device("cuda")
    generated = generated.to(device)

    gpt2_model_1.eval();

    sample_outputs = gpt2_model_1.generate(generated, 
                                do_sample=True,   
                                min_length=50, 
                                max_length=MAXLEN,
                                top_k=30,                                 
                                top_p=0.7,        
                                temperature=0.9,
                                repetition_penalty=2.0,
                                num_return_sequences=1
                                )

    for i, sample_output in enumerate(sample_outputs):
        text = tokenizer1.decode(sample_output, skip_special_tokens=False)
        text = text.replace("<|BOS|>", '')
        text = text.replace("<|EOS|>", '')
        text = text.replace("<|PAD|>", '')
        # Split of SEP token
        text = text.split("<|SEP|>")
        print("Name: ")
        if(len(text) >= 3):
            print(text[2])
            name = text[2]
        print("")
        print("Race: ")
        if(len(text) >= 1):            
            print(text[0].strip())
            race = text[0].strip()
        print("")
        print("Class: ")
        if(len(text) >= 2):            
            print(text[1])
            charClass = text[1]
        print("")
        print("Description: ")
        if(len(text) >= 4):
            print(text[3])
        print("")


#@title Generate description
button1 = widgets.Button(
    description='Generate Description',
    disabled=False,
    button_style='', 
    tooltip='Generate',
    icon='check'
)

button1


Button(description='Generate Description', icon='check', style=ButtonStyle(), tooltip='Generate')

In [None]:
button1.on_click(get_description)


<br>![20.png](https://i.imgur.com/2kt6ZC1.png
)<br><br>

Please press the "Generate Story" button to generate a textual sample.

In [None]:
global bg_story
def get_background(arg):
    global bg_story
    
    print("")
    print("Starting story generation...")
    print("")
    # Initially, prompt is empty
    prompt = ' '   
    # Structure the prompt with special key words (character type)
    # Keywords are supplied based on the previous outcomes
    if not(race==None) and not(charClass==None) and not(name==None):
        prompt = SPECIAL_TOKENS['bos_token'] + race + \
                SPECIAL_TOKENS['sep_token'] + charClass + SPECIAL_TOKENS['sep_token'] + \
                name + SPECIAL_TOKENS['sep_token'] 
            
    generated = torch.tensor(tokenizer2.encode(prompt)).unsqueeze(0)
    device = torch.device("cuda")
    generated = generated.to(device)

    gpt2_model_2.eval();

    # Output samples based on the prompt
    sample_outputs = gpt2_model_2.generate(generated, 
                                    do_sample=True,   
                                    min_length=50, 
                                    max_length=MAXLEN,
                                    top_k=30,                                 
                                    top_p=0.7,        
                                    temperature=0.9,
                                    repetition_penalty=2.0,
                                    num_return_sequences=1
                                    )

    for i, sample_output in enumerate(sample_outputs):
        text = tokenizer2.decode(sample_output, skip_special_tokens=False)
        text = text.replace("<|BOS|>", '')
        text = text.replace("<|EOS|>", '')
        text = text.replace("<|PAD|>", '')
        text = text.replace("<|SEP|>", " - ")
        print("{}: {}\n\n".format(i+1,  text))


#@title Generate Background Story
button2 = widgets.Button(
    description='Generate Story',
    disabled=False,
    button_style='', 
    tooltip='Generate',
    icon='check'
)

button2

Button(description='Generate Story', icon='check', style=ButtonStyle(), tooltip='Generate')

In [None]:
button2.on_click(get_background)

In [None]:
#@title Additional story generation
#@markdown Please enter the introductory sentence in the text area below and press the "Generate Stories" to prompt the story generation. 
#@markdown Alternatively, leave the text area empty to allow for an increased variability.
def generate_stories(arg):
    user_input = txtarea.value

    print("")
    print("Starting story generation...")
    print("")

    prompt = ' '

    if not(race==None) and not(charClass==None) and not(name==None):
        prompt = SPECIAL_TOKENS['bos_token'] + race + \
                SPECIAL_TOKENS['sep_token'] + charClass + SPECIAL_TOKENS['sep_token'] + \
                name + SPECIAL_TOKENS['sep_token'] + \
                user_input
    if not(user_input == ' ') and not(user_input == '') and not(user_input=='None'):
        prompt = SPECIAL_TOKENS['bos_token'] + " " + \
                SPECIAL_TOKENS['sep_token'] + " " + SPECIAL_TOKENS['sep_token'] + \
                " " + SPECIAL_TOKENS['sep_token'] + \
                user_input
            
    generated = torch.tensor(tokenizer2.encode(prompt)).unsqueeze(0)
    device = torch.device("cuda")
    generated = generated.to(device)

    gpt2_model_2.eval();

    # Output samples based on the prompt
    sample_outputs = gpt2_model_2.generate(generated, 
                                    do_sample=True,   
                                    min_length=50, 
                                    max_length=MAXLEN,
                                    top_k=30,                                 
                                    top_p=0.7,        
                                    temperature=0.9,
                                    repetition_penalty=2.0,
                                    num_return_sequences=1
                                    )

    for i, sample_output in enumerate(sample_outputs):
        text = tokenizer2.decode(sample_output, skip_special_tokens=False)
        text = text.replace("<|BOS|>", '')
        text = text.replace("<|EOS|>", '')
        text = text.replace("<|PAD|>", '')
        if not(user_input == ' ') and not(user_input == '') and not(user_input=='None'):
            text = text.replace("<|SEP|>", "")
        else: 
            text = text.replace("<|SEP|>", " - ")
        print("{}: {}\n\n".format(i+1,  text))
            


# Define widgets
txtarea = widgets.Textarea(
    value='',
    placeholder='Enter prompt',
    description='Enter prompt',
    disabled=False
)

#@ Generate stories
button3 = widgets.Button(
    description='Generate Story 2',
    disabled=False,
    button_style='', 
    tooltip='Generate',
    icon='check'
)



In [None]:
# Prompt sentence can be entered in the text area placed below, e.g. "He fell sick."
txtarea


Textarea(value='', description='Enter prompt', placeholder='Enter prompt')

In [None]:
# To process the text from the text area and generate a story, please press the button below
button3


Button(description='Generate Story 2', icon='check', style=ButtonStyle(), tooltip='Generate')

In [None]:
button3.on_click(generate_stories)