## Setup Prompts

Take the data and turn it into text for the transformer to generate and complete. The data is 

|File|Description|Columns|
|-|-|-|
|`documents.tsv`|Entire documents, including articles, books, screenplays, etc. in toki pona and their translations.|`name`: the name of the content. `content_type`: the type of content (such as `"biblical text"`). `tok`: the Toki Pona content string. `eng`: the English content string. `cmn`: the Chinese content string.|
|`chapters.tsv`|Chapters in toki pona and their translations.|`name`: the name of the content. `chapter_number`: the number of this chapter. `content_type`: the type of content (such as `"biblical text"`). `tok`: the Toki Pona content string. `eng`: the English content string. `cmn`: the Chinese content string.|
|`sentence_translations.tsv`|Translations of sentences in Toki Pona, Chinese, and English.|`tok`: the Toki Pona sentence. `eng`: the English sentence. `cmn`: the Chinese sentence.|
|`sentence.tsv`|A collection of miscellaneous sentences in Toki Pona.|`content_type`: the type of content associated with the sentence. `tok`: the sentence in Toki Pona.|

In [1]:
%pip install pandas numpy datasets transformers torch huggingface_hub

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
!git clone https://github.com/adam-mcdaniel/toki-pona-dataset
!cp toki-pona-dataset/processed/* .
!rm -Rf toki-pona-dataset

Clonando en 'toki-pona-dataset'...
remote: Enumerating objects: 2102, done.[K
remote: Counting objects: 100% (2102/2102), done.[K
remote: Compressing objects: 100% (2048/2048), done.[K
remote: Total 2102 (delta 49), reused 2092 (delta 39), pack-reused 0[K
Recibiendo objetos: 100% (2102/2102), 5.73 MiB | 1.31 MiB/s, listo.
Resolviendo deltas: 100% (49/49), listo.


In [3]:
import pandas as pd
import numpy as np
from math import exp
from random import choice
from datasets import Dataset, DatasetDict
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments


CONTENT_TYPES = [
    ENCYCLOPEDIA_ARTICLE := 'encyclopedia article',
    ARTICLE := 'article',
    BLOG_POST := 'blog post',
    MAGAZINE := 'magazine',
    BIBLE := 'biblical text',
    STORY := 'story',
    POEM := 'poem',
    SCREENPLAY := 'screenplay',
    BOOK := 'book',
    CHAPTER := 'chapter',
    ESSAY := 'essay',
    CHAT := 'chat',
    OTHER := 'other',
]

FORMATS = [
    TEXT := 'text',
    MARKDOWN := 'markdown',
    IRC_LOG := 'irc log',
]

2023-05-13 23:41:36.510724: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# The output document to put our prompts in
def prompt_eng(instruction, response):
  return f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}'''


def collate(toki_pona_text, english_text, content_type, title=None):
    toki_pona_verb = choice(['toki', 'sitelen'])
    english_verb = 'Write'

    if content_type == STORY:
        yield {'text': prompt_eng(f'{english_verb} a story in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e lipu musi.', toki_pona_text)}

    elif content_type == ARTICLE:
        yield {'text': prompt_eng(f'{english_verb} an essay in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e lipu pi pana sona.', toki_pona_text)}

    elif content_type == POEM:
        yield {'text': prompt_eng(f'{english_verb} a poem in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e toki musi.', toki_pona_text)}
    
    elif content_type == BIBLE:
        yield {'text': prompt_eng(f'{english_verb} some verses from the bible in Toki Pona', toki_pona_text)}
        yield {'text': prompt_eng(f'o toki e toki sewi.', toki_pona_text)}
    
    elif content_type == SCREENPLAY:
        yield {'text': prompt_eng(f'{english_verb} a screenplay in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e lipu musi pi jan musi.', toki_pona_text)}
    
    elif content_type == BOOK:
        yield {'text': prompt_eng(f'{english_verb} a book in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e lipu.', toki_pona_text)}

    elif content_type == MAGAZINE:
        yield {'text': prompt_eng(f'{english_verb} a magazine in Toki Pona.', toki_pona_text)}
        yield {'text': prompt_eng(f'o pali e lipu tenpo.', toki_pona_text)}

    elif content_type == CHAT:
        yield {'text': prompt_eng(f'Write some chat dialogue in Toki Pona.', toki_pona_text)}

    # Create the prompt
    if english_text:
        yield {'text': prompt_eng(f'Translate the following English text into Toki Pona:\n{english_text}', toki_pona_text)}
        yield {'text': prompt_eng(f'Translate the following Toki Pona text into English:\n{toki_pona_text}', english_text)}
        yield {'text': prompt_eng(f'o toki ante e ni. ni li toki pi toki pona. o ante e ni tawa toki pi toki Inli.\n\n{toki_pona_text}', english_text)}
        yield {'text': prompt_eng(f'o toki ante e ni. ni li toki pi toki Inli. o ante e ni tawa toki pi toki pona.\n\n{english_text}', toki_pona_text)}

def collate_documents():
    # Reload them and confirm that they are the same, convert NaNs to None
    documents_copy = pd.read_csv("documents.tsv", sep='\t')
    documents_copy = documents_copy.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in documents_copy.iterrows():
        # The genre of content
        content_type = row['content_type']

        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, content_type):
            yield pair        

def collate_chapters():
    # Reload them and confirm that they are the same, convert NaNs to None
    chapters_copy = pd.read_csv("chapters.tsv", sep='\t')
    chapters_copy = chapters_copy.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in chapters_copy.iterrows():
        # The genre of content
        content_type = row['content_type']

        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, content_type):
            yield pair

def collate_sentences():
    # Reload them and confirm that they are the same, convert NaNs to None
    sentence_translations = pd.read_csv("sentence_translations.tsv", sep='\t')
    sentence_translations = sentence_translations.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in sentence_translations.iterrows():
        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, None):
            yield pair


In [5]:
def collate_all():
    for pair in collate_documents():
        yield pair
    
    for pair in collate_chapters():
        yield pair
    
    for pair in collate_sentences():
        yield pair

dataset = Dataset.from_generator(collate_all)

train_dataset, test_dataset = dataset.train_test_split(test_size=0.1).values()
dataset = DatasetDict({"train": train_dataset,"test": test_dataset})
print(len(train_dataset), len(test_dataset))


Downloading and preparing dataset generator/default to /home/adam/.cache/huggingface/datasets/generator/default-d76f22e9a0c4cf58/0.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /home/adam/.cache/huggingface/datasets/generator/default-d76f22e9a0c4cf58/0.0.0. Subsequent calls will reuse this data.
166650 18517


In [6]:
model_to_finetune = "vicgalle/gpt2-alpaca-gpt4"

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_to_finetune, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_to_finetune)

In [8]:
def split_into_sequences(inputs):
    # The maximum sequence length to use
    max_sequence_length = 128

    # First get all of the texts together into a collated dictionary
    collated = {}
    for k in inputs.keys():
        if k not in collated:
            collated[k] = []
        for v in inputs[k]:
            collated[k] += v

    # Get the length of the first text
    first_id = next(iter(inputs))
    total_length_of_texts = len(collated[first_id]) // max_sequence_length * max_sequence_length

    # Now split the texts into sequences
    result = {}
    for k, v in collated.items():
        sequences = []
        # Split the text into chunks of max_sequence_length
        for i in range(0, total_length_of_texts, max_sequence_length):
            sequence = v[i:i + max_sequence_length]
            sequences.append(sequence)
        # Add the sequences to the result
        result[k] = sequences
    result["labels"] = result["input_ids"].copy()
    return result

In [9]:
tokenized = dataset.map(lambda x: tokenizer(x["text"]), batched=True, num_proc=4, remove_columns=["text"])
sequences = tokenized.map(split_into_sequences, batched=True, batch_size=1000, num_proc=4)

Map (num_proc=4):   0%|          | 0/166650 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (6972 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2829 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2951 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1547 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/18517 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4423 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2376 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1912 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (7622 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/166650 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/18517 [00:00<?, ? examples/s]

In [10]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
args = training_args = TrainingArguments("adamthekiwi/toki-pona-gpt2-alpaca-better", evaluation_strategy = "epoch", learning_rate=0.00002, warmup_steps=500, num_train_epochs=7, weight_decay=0.01, push_to_hub=True, save_total_limit=10)
trainer = Trainer(model=model, args=args, train_dataset=sequences["train"], eval_dataset=sequences["test"])
trainer.train()
trainer.push_to_hub()
results = trainer.evaluate()
perplexity = exp(results['eval_loss'])
print('Perplexity =', perplexity)