# Toki Pona Language Model (Iteration \#1)

### Author
- Adam McDaniel (`amcdan23`)

## Setup Prompts

Take the data and turn it into text for the transformer to generate and complete. The data is 

|File|Description|Columns|
|-|-|-|
|`documents.tsv`|Entire documents, including articles, books, screenplays, etc. in toki pona and their translations.|`name`: the name of the content. `content_type`: the type of content (such as `"biblical text"`). `tok`: the Toki Pona content string. `eng`: the English content string. `cmn`: the Chinese content string.|
|`chapters.tsv`|Chapters in toki pona and their translations.|`name`: the name of the content. `chapter_number`: the number of this chapter. `content_type`: the type of content (such as `"biblical text"`). `tok`: the Toki Pona content string. `eng`: the English content string. `cmn`: the Chinese content string.|
|`sentence_translations.tsv`|Translations of sentences in Toki Pona, Chinese, and English.|`tok`: the Toki Pona sentence. `eng`: the English sentence. `cmn`: the Chinese sentence.|
|`sentence.tsv`|A collection of miscellaneous sentences in Toki Pona.|`content_type`: the type of content associated with the sentence. `tok`: the sentence in Toki Pona.|

In [1]:
%pip install pandas numpy datasets transformers torch huggingface_hub

Defaulting to user installation because normal site-packages is not writeable


In [3]:
import pandas as pd
import numpy as np
from math import exp
from datasets import Dataset, DatasetDict
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments


CONTENT_TYPES = [
    ENCYCLOPEDIA_ARTICLE := 'encyclopedia article',
    ARTICLE := 'article',
    BLOG_POST := 'blog post',
    MAGAZINE := 'magazine',
    BIBLE := 'biblical text',
    STORY := 'story',
    POEM := 'poem',
    SCREENPLAY := 'screenplay',
    BOOK := 'book',
    CHAPTER := 'chapter',
    ESSAY := 'essay',
    CHAT := 'chat',
    OTHER := 'other',
]

FORMATS = [
    TEXT := 'text',
    MARKDOWN := 'markdown',
    IRC_LOG := 'irc log',
]

2023-05-13 22:41:13.744821: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
def collate(toki_pona_text, english_text, content_type):
    yield {'text': toki_pona_text}

    if content_type == STORY:
        yield {'text': f'Write a story in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e lipu pi toki pona: {toki_pona_text}'}

    elif content_type == ARTICLE:
        yield {'text': f'Write an article in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e lipu pi pana sona pi toki pona: {toki_pona_text}'}

    elif content_type == POEM:
        yield {'text': f'Write a poem in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e toki musi pi toki pona: {toki_pona_text}'}
    
    elif content_type == BIBLE:
        yield {'text': f'o toki e toki sewi pi jan Jawe lon toki pona: {toki_pona_text}'}
        yield {'text': f'Write some bible verses in Toki Pona: {toki_pona_text}'}
    
    elif content_type == SCREENPLAY:
        yield {'text': f'Write a screenplay in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e lipu musi pi toki pona: {toki_pona_text}'}
    
    elif content_type == BOOK:
        yield {'text': f'Write a book in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e lipu lon toki pona: {toki_pona_text}'}
    
    elif content_type == MAGAZINE:
        yield {'text': f'Write a magazine article in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e lipu tenpo lon toki pona: {toki_pona_text}'}

    elif content_type == CHAT:
        yield {'text': f'Write a chat in Toki Pona: {toki_pona_text}'}
        yield {'text': f'o toki e toki pi toki pona: {toki_pona_text}'}

    # Create the prompt
    if english_text:
        yield {'text': f'o toki ante e ni:\n\n{toki_pona_text}\n---\n{english_text}'}
        yield {'text': f'o toki ante e ni:\n\n{english_text}\n---\n{toki_pona_text}'}
        yield {'text': f'Translate this to Toki Pona:\n\n{english_text}\n---\n{toki_pona_text}'}
        yield {'text': f'Translate this to Toki Pona:\n\n{toki_pona_text}\n---\n{english_text}'}

def collate_documents():
    # Reload them and confirm that they are the same, convert NaNs to None
    documents_copy = pd.read_csv("documents.tsv", sep='\t')
    documents_copy = documents_copy.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in documents_copy.iterrows():
        # The genre of content
        content_type = row['content_type']

        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, content_type):
            yield pair        

def collate_chapters():
    # Reload them and confirm that they are the same, convert NaNs to None
    chapters_copy = pd.read_csv("chapters.tsv", sep='\t')
    chapters_copy = chapters_copy.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in chapters_copy.iterrows():
        # The genre of content
        content_type = row['content_type']

        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, content_type):
            yield pair

def collate_sentences():
    # Reload them and confirm that they are the same, convert NaNs to None
    sentence_translations = pd.read_csv("sentence_translations.tsv", sep='\t')
    sentence_translations = sentence_translations.replace(np.nan, None)

    # Add the prompts to the documents
    for i, row in sentence_translations.iterrows():
        # Toki Pona text
        toki_pona_text = row['tok']
    
        # English text
        english_text = row['eng']

        for pair in collate(toki_pona_text, english_text, None):
            yield pair

    sentences = pd.read_csv("sentences.tsv", sep='\t')
    sentences = sentences.replace(np.nan, None)
    for i, row in sentences.iterrows():
        # Toki Pona text
        toki_pona_text = row['sentence']
        yield {'text': toki_pona_text}


In [5]:
def collate_all():
    for pair in collate_documents():
        yield pair
    
    for pair in collate_chapters():
        yield pair
    
    for pair in collate_sentences():
        yield pair


dataset = Dataset.from_generator(collate_all)

train_dataset, test_dataset = dataset.train_test_split(test_size=0.1).values()
dataset = DatasetDict({"train": train_dataset,"test": test_dataset})
print(len(train_dataset), len(test_dataset))

Found cached dataset generator (/home/adam/.cache/huggingface/datasets/generator/default-af790aad65d3536b/0.0.0)


256303 28479


In [6]:
model_to_finetune = "distilgpt2"

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_to_finetune, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_to_finetune)

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [8]:
def split_into_sequences(inputs):
    # The maximum sequence length to use
    max_sequence_length = 128

    # First get all of the texts together into a collated dictionary
    collated = {}
    for k in inputs.keys():
        if k not in collated:
            collated[k] = []
        for v in inputs[k]:
            collated[k] += v

    # Get the length of the first text
    first_id = next(iter(inputs))
    total_length_of_texts = len(collated[first_id]) // max_sequence_length * max_sequence_length

    # Now split the texts into sequences
    result = {}
    for k, v in collated.items():
        sequences = []
        # Split the text into chunks of max_sequence_length
        for i in range(0, total_length_of_texts, max_sequence_length):
            sequence = v[i:i + max_sequence_length]
            sequences.append(sequence)
        # Add the sequences to the result
        result[k] = sequences
    result["labels"] = result["input_ids"].copy()
    return result

In [9]:
tokenized = dataset.map(lambda x: tokenizer(x["text"]), batched=True, num_proc=4, remove_columns=["text"])
sequences = tokenized.map(split_into_sequences, batched=True, batch_size=1000, num_proc=4)

Map (num_proc=4):   0%|          | 0/256303 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3831 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1115 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1511 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1663 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/28479 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (6517 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (12557 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1432 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4645 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/256303 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/28479 [00:00<?, ? examples/s]

In [None]:
notebook_login()

In [None]:
args = TrainingArguments("adamthekiwi/toki-pona", evaluation_strategy="epoch", learning_rate=0.00002, weight_decay=0.01, push_to_hub=True)
trainer = Trainer(model=model, args=args, train_dataset=sequences["train"], eval_dataset=sequences["test"])
trainer.train()
trainer.push_to_hub()
results = trainer.evaluate()
perplexity = exp(results['eval_loss'])
print('Perplexity =', perplexity)