# Generate Toki-Pona datasets

This notebook consolidates the data in each of the different folders into a single dataset. It generates a file for sentence translations between Toki Pona, English, and optionally Chinese, a file for sentences in Toki Pona, and file containing entire documents in each language (if available).

In [47]:
import pandas as pd
from glob import glob
import os

In [48]:
CONTENT_TYPES = [
    ENCYCLOPEDIA_ARTICLE := 'encyclopedia article',
    ARTICLE := 'article',
    BLOG_POST := 'blog post',
    MAGAZINE := 'magazine',
    BIBLE := 'biblical text',
    STORY := 'story',
    POEM := 'poem',
    SCREENPLAY := 'screenplay',
    BOOK := 'book',
    CHAPTER := 'chapter',
    ESSAY := 'essay',
    CHAT := 'chat',
    OTHER := 'other',
]

FORMATS = [
    TEXT := 'text',
    MARKDOWN := 'markdown',
    IRC_LOG := 'irc log',
]

sentence_translations = pd.DataFrame(columns=['id', 'tok', 'eng', 'cmn'])
sentences = pd.DataFrame(columns=['id', 'content_type', 'sentence'])
documents = pd.DataFrame(columns=['id', 'name', 'content_type', 'tok', 'eng', 'cmn'])
chapters = pd.DataFrame(columns=['id', 'name', 'chapter_number', 'content_type', 'tok', 'eng', 'cmn'])

## Sentence translations

Go through the files in the `phrases` folder and generate a file containing the sentence translations. These files are:

|File|Language|Description|Length|
|----|--------|-----------|------|
|`common.md`|Toki Pona and English|Common phrases and responses|~100 pairs|
|`common2.tsv`|Toki Pona and English|Common sentences|~2000 pairs|
|`tatoeba-dev.eng-toki.tsv`|Toki Pona and English|Some Tatoeba translations between Toki Pona and English ([from this dataset dated to 2021](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt/blob/main/dev/tatoeba-dev.eng-toki.tsv))|~8000 pairs|
|`tatoeba-test.eng-toki.tsv`|Toki Pona and English|Some Tatoeba translations between Toki Pona and English ([from this dataset dated to 2021](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt/blob/main/test/tatoeba-test.eng-toki.tsv))|~5000 pairs|
|`translations.tsv`|Toki Pona, English, and Chinese|Tatoeba translations between Toki Pona, English, and Chinese (dated 4/14/2023)|~33000 pairs|

In [49]:
f = open(os.path.expanduser("phrases/common2.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for tok, eng in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

1907 1907


In [50]:
f = open(os.path.expanduser("phrases/tatoeba-dev.eng-toki.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for _, _, eng, tok in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

10609 10609


In [51]:
f = open(os.path.expanduser("phrases/tatoeba-test.eng-toki.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for _, _, eng, tok in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

15599 15599


In [52]:
f = open(os.path.expanduser("phrases/translations.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for row in tsv:
    if len(row) == 4:
        _, tok, eng, cmn = row
    elif len(row) == 3:
        _, tok, eng = row
        cmn = None
    if eng == '':
        eng = None
    if cmn == '':
        cmn = None

    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, cmn]

print(len(sentences), len(sentence_translations))

48740 48740


In [53]:
print('Toki Pona:', sentences['sentence'].str.split().str.len().sum())
print('Toki Pona:', int(sentence_translations['tok'].str.split().str.len().sum()), 'English:', int(sentence_translations['eng'].str.split().str.len().sum()), 'Chinese:', sentence_translations['cmn'].str.split().str.len().sum())

Toki Pona: 402085
Toki Pona: 402085 English: 254765 Chinese: 5761.0


In [59]:
# Save the sentences and translations to a file.
sentences.to_csv(os.path.expanduser("phrases/sentences.tsv"), sep='\t', index=False)
sentence_translations.to_csv(os.path.expanduser("phrases/sentence_translations.tsv"), sep='\t', index=False)

# Reload them and confirm that they are the same.
sentences_copy = pd.read_csv(os.path.expanduser("phrases/sentences.tsv"), sep='\t')
assert sentences.equals(sentences_copy)

sentence_translations_copy = pd.read_csv(os.path.expanduser("phrases/sentence_translations.tsv"), sep='\t')
assert sentence_translations.equals(sentence_translations_copy)

## Documents and translations

Go through the files in each of the folders and add their entire contents to each field in the dataset. These files are in:

|Folder|Language|Description|Length|
|------|--------|-----------|------|
|`articles`|Toki Pona and English|Articles from Lipu Kule|Unknown|
|`chat`|Toki Pona and English|Chat logs from Unknown|Unknown|
|`comments`|Toki Pona|Comments on blog posts and reviews of books|Unknown|
|`dictionary`|Toki Pona and English|Toki Pona dictionary|Unknown|
|`encyclopedia`|Toki Pona|Articles from Wikipesija. The name of the document is the subject of the article.|Unknown|
|`magazines`|Toki Pona|Entire copies of Lipu Tenpo|Unknown|
|`stories`|Toki Pona and English|Stories in Toki Pona and English.|Unknown|
|`poems`|Toki Pona|Poems in Toki Pona.|Unknown|
|`screenplays`|Toki Pona and English|Screenplays and their translations.|Unknown|
|`bible`|Toki Pona and English|Texts relating to the bible.|Unknown|
|`livejournal-blog`|Toki Pona and English|Texts from LiveJournal blogs.|Unknown|

In [60]:
documents = pd.DataFrame(columns=['id', 'name', 'content_type', 'tok', 'eng', 'cmn'])
def get_files(dir, ext):
    # Get all the files in articles/tok/ and articles/eng/
    tok_files = glob(os.path.expanduser(f"{dir}/tok/*.{ext}"))
    eng_files = glob(os.path.expanduser(f"{dir}/eng/*.{ext}"))

    # Strip the path and extension from the filenames
    tok_files = [os.path.basename(f) for f in tok_files]
    eng_files = [os.path.basename(f) for f in eng_files]

    # Get the shared set of files
    tok_files = set(tok_files)
    eng_files = set(eng_files)
    shared_files = tok_files.intersection(eng_files)

    # Get the set of files that are only in tok/ or eng/
    tok_only_files = tok_files.difference(eng_files)
    eng_only_files = eng_files.difference(tok_files)

    return shared_files, tok_only_files, eng_only_files

shared_files, tok_only_files, eng_only_files = get_files("articles", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"articles/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"articles/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ARTICLE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"articles/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ARTICLE, tok, None, None]

print(len(documents))

52


In [61]:
shared_files, tok_only_files, eng_only_files = get_files("stories", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"stories/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"stories/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), STORY, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"stories/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), STORY, tok, None, None]

print(len(documents))

114


In [62]:
shared_files, tok_only_files, eng_only_files = get_files("bible", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"bible/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"bible/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BIBLE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"bible/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BIBLE, tok, None, None]

print(len(documents))

137


In [64]:
shared_files, tok_only_files, eng_only_files = get_files("bible", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"bible/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"bible/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BLOG_POST, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"bible/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BLOG_POST, tok, None, None]

print(len(documents))

160


In [65]:
shared_files, tok_only_files, eng_only_files = get_files("comments", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"comments/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"comments/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"comments/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, None, None]

print(len(documents))

184


In [66]:
shared_files, tok_only_files, eng_only_files = get_files("jan Kipu Corpus", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"jan Kipu Corpus/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"jan Kipu Corpus/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), OTHER, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"jan Kipu Corpus/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), OTHER, tok, None, None]

print(len(documents))

1524


In [67]:
shared_files, tok_only_files, eng_only_files = get_files("magazines", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"magazines/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"magazines/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), MAGAZINE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"magazines/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), MAGAZINE, tok, None, None]

print(len(documents))

1542


In [68]:
shared_files, tok_only_files, eng_only_files = get_files("poems", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"poems/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"poems/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), POEM, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"poems/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), POEM, tok, None, None]

print(len(documents))

1624


In [69]:
shared_files, tok_only_files, eng_only_files = get_files("screenplays", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"screenplays/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"screenplays/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), SCREENPLAY, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"screenplays/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), SCREENPLAY, tok, None, None]

print(len(documents))

1625


In [70]:
shared_files, tok_only_files, eng_only_files = get_files("encyclopedia", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"encyclopedia/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"encyclopedia/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ENCYCLOPEDIA_ARTICLE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"encyclopedia/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ENCYCLOPEDIA_ARTICLE, tok, None, None]

print(len(documents))

1984


In [71]:
shared_files, tok_only_files, eng_only_files = get_files("chat", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"chat/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"chat/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"chat/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, None, None]

print(len(documents))

2025


In [72]:
# Save the sentences and translations to a file.
documents.to_csv(os.path.expanduser("documents.tsv"), sep='\t', index=False)

# Reload them and confirm that they are the same.
documents_copy = pd.read_csv(os.path.expanduser("documents.tsv"), sep='\t', dtype={'id': 'int64', 'name': str, 'content_type': str, 'tok': str, 'eng': str, 'cmn': str if not pd.isnull('cmn') else None})
pd.testing.assert_frame_equal(documents, documents_copy, check_dtype=True)


# Find a story with an English translation
story = documents[(documents['content_type'] == STORY) & (documents['eng'].notnull())].sample(1).iloc[0]
print(story['name'])
print(story['tok'])
print(story['eng'])

Bible.txt
lipu nanpa wan pi jan Mose.

tenpo wan la jan sewi Jawe li pali e sewi laso en ma suli. taso, ma suli li jo e ala. ale li pimeja.
jan kon pi jan sewi Jawe li sewi pi telo suli.
jan sewi Jawe li toki e ni: 'o suno li wile'. suno li kama.
jan sewi Jawe li lukin e suno li pilin: suno li pona. jan sewi Jawe li pana e suno sama ala pi kon pimeja.
jan sewi Jawe li pana e nimi pi suno: tenpo suno.
jan sewi Jawe li pana e nimi pi kon pimeja: tenpo pimeja.
tenpo ni li tenpo suno nanpa wan.

 

jan sewi Jawe li toki e ni: mi pana e sike sewi laso pi telo. ona li pali e ni: telo wan li sama ala pi telo tu.
jan sewi Jawe li pali e ni. sike sewi laso li lon.
 telo sewi pi sike sewi laso li sama ala pi telo anpa pi sike sewi laso.
jan sewi Jawe li pana e nimi  pi sike sewi laso. tenpo ni li tenpo suno nanpa tu.

 

tenpo sin la jan sewi Jawe li toki e ni: o telo ale li kama insa telo wan. o ma kiwen li lon. ni li lon.
jan sewi Jawe li nimi e ma en telo suli. jan sewi li lukin. ale li pona.

In [73]:
# For each document, find the word count and add it all up
print('Toki Pona:', int(documents['tok'].str.split().str.len().sum()), 'English:', int(documents['eng'].str.split().str.len().sum()), 'Chinese:', documents['cmn'].str.split().str.len().sum())

Toki Pona: 1183457 English: 88437 Chinese: 0
