# Generate Toki-Pona datasets

This notebook consolidates the data in each of the different folders into a single dataset. It generates a file for sentence translations between Toki Pona, English, and optionally Chinese, a file for sentences in Toki Pona, and file containing entire documents in each language (if available).

In [50]:
import pandas as pd
from glob import glob
import os

In [51]:
CONTENT_TYPES = [
    ENCYCLOPEDIA_ARTICLE := 'encyclopedia article',
    BLOG_ARTICLE := 'blog article',
    MAGAZINE := 'magazine',
    BIBLE := 'biblical text',
    STORY := 'story',
    POEM := 'poem',
    SCREENPLAY := 'screenplay',
    BOOK := 'book',
    CHAPTER := 'chapter',
    ESSAY := 'essay',
    CHAT := 'chat',
    OTHER := 'other',
]

FORMATS = [
    TEXT := 'text',
    MARKDOWN := 'markdown',
    IRC_LOG := 'irc log',
]

sentence_translations = pd.DataFrame(columns=['id', 'tok', 'eng', 'cmn'])
sentences = pd.DataFrame(columns=['id', 'content_type', 'sentence'])
documents = pd.DataFrame(columns=['id', 'name', 'content_type', 'tok', 'eng', 'cmn'])
chapters = pd.DataFrame(columns=['id', 'name', 'chapter_number', 'content_type', 'tok', 'eng', 'cmn'])

## Sentence translations

Go through the files in the `phrases` folder and generate a file containing the sentence translations. These files are:

|File|Language|Description|Length|
|----|--------|-----------|------|
|`common.md`|Toki Pona and English|Common phrases and responses|~100 pairs|
|`common2.tsv`|Toki Pona and English|Common sentences|~2000 pairs|
|`tatoeba-dev.eng-toki.tsv`|Toki Pona and English|Some Tatoeba translations between Toki Pona and English ([from this dataset dated to 2021](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt/blob/main/dev/tatoeba-dev.eng-toki.tsv))|~8000 pairs|
|`tatoeba-test.eng-toki.tsv`|Toki Pona and English|Some Tatoeba translations between Toki Pona and English ([from this dataset dated to 2021](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt/blob/main/test/tatoeba-test.eng-toki.tsv))|~5000 pairs|
|`translations.tsv`|Toki Pona, English, and Chinese|Tatoeba translations between Toki Pona, English, and Chinese (dated 4/14/2023)|~33000 pairs|

In [52]:
f = open(os.path.expanduser("phrases/common2.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for tok, eng in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

1907 1907


In [53]:
f = open(os.path.expanduser("phrases/tatoeba-dev.eng-toki.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for _, _, eng, tok in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

10609 10609


In [54]:
f = open(os.path.expanduser("phrases/tatoeba-test.eng-toki.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for _, _, eng, tok in tsv:
    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, None]

print(len(sentences), len(sentence_translations))

15599 15599


In [55]:
f = open(os.path.expanduser("phrases/translations.tsv"), "r", encoding="utf-8")
tsv = [line.strip().split("\t") for line in f]
f.close()
for row in tsv:
    if len(row) == 4:
        _, tok, eng, cmn = row
    elif len(row) == 3:
        _, tok, eng = row
        cmn = None
    if eng == '':
        eng = None
    if cmn == '':
        cmn = None

    sentences.loc[len(sentences)] = [len(sentences), OTHER, tok]
    sentence_translations.loc[len(sentence_translations)] = [len(sentence_translations), tok, eng, cmn]

print(len(sentences), len(sentence_translations))

48740 48740


In [56]:
# Save the sentences and translations to a file.
sentences.to_csv(os.path.expanduser("phrases/sentences.tsv"), sep='\t', index=False)
sentence_translations.to_csv(os.path.expanduser("phrases/sentence_translations.tsv"), sep='\t', index=False)

# Reload them and confirm that they are the same.
sentences_copy = pd.read_csv(os.path.expanduser("phrases/sentences.tsv"), sep='\t')
assert sentences.equals(sentences_copy)

sentence_translations_copy = pd.read_csv(os.path.expanduser("phrases/sentence_translations.tsv"), sep='\t')
assert sentence_translations.equals(sentence_translations_copy)

## Documents and translations

Go through the files in each of the folders and add their entire contents to each field in the dataset. These files are in:

|Folder|Language|Description|Length|
|------|--------|-----------|------|
|`articles`|Toki Pona and English|Articles from Lipu Kule|Unknown|
|`chat`|Toki Pona and English|Chat logs from Unknown|Unknown|
|`comments`|Toki Pona|Comments on blog posts and reviews of books|Unknown|
|`dictionary`|Toki Pona and English|Toki Pona dictionary|Unknown|
|`encyclopedia`|Toki Pona|Articles from Wikipesija. The name of the document is the subject of the article.|Unknown|
|`magazines`|Toki Pona|Entire copies of Lipu Tenpo|Unknown|
|`stories`|Toki Pona and English|Stories in Toki Pona and English.|Unknown|
|`poems`|Toki Pona|Poems in Toki Pona.|Unknown|
|`screenplays`|Toki Pona and English|Screenplays and their translations.|Unknown|

In [111]:
documents = pd.DataFrame(columns=['id', 'name', 'content_type', 'tok', 'eng', 'cmn'])
def get_files(dir, ext):
    # Get all the files in articles/tok/ and articles/eng/
    tok_files = glob(os.path.expanduser(f"{dir}/tok/*.{ext}"))
    eng_files = glob(os.path.expanduser(f"{dir}/eng/*.{ext}"))

    # Strip the path and extension from the filenames
    tok_files = [os.path.basename(f) for f in tok_files]
    eng_files = [os.path.basename(f) for f in eng_files]

    # Get the shared set of files
    tok_files = set(tok_files)
    eng_files = set(eng_files)
    shared_files = tok_files.intersection(eng_files)

    # Get the set of files that are only in tok/ or eng/
    tok_only_files = tok_files.difference(eng_files)
    eng_only_files = eng_files.difference(tok_files)

    return shared_files, tok_only_files, eng_only_files

shared_files, tok_only_files, eng_only_files = get_files("articles", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"articles/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"articles/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BLOG_ARTICLE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"articles/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), BLOG_ARTICLE, tok, None, None]

print(len(documents))

51


In [112]:
shared_files, tok_only_files, eng_only_files = get_files("stories", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"stories/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"stories/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), STORY, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"stories/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), STORY, tok, None, None]

print(len(documents))

106


In [113]:
shared_files, tok_only_files, eng_only_files = get_files("poems", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"poems/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"poems/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), POEM, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"poems/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), POEM, tok, None, None]

print(len(documents))

188


In [114]:
shared_files, tok_only_files, eng_only_files = get_files("screenplays", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"screenplays/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"screenplays/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), SCREENPLAY, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"screenplays/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), SCREENPLAY, tok, None, None]

print(len(documents))

189


In [115]:
shared_files, tok_only_files, eng_only_files = get_files("encyclopedia", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"encyclopedia/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"encyclopedia/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ENCYCLOPEDIA_ARTICLE, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"encyclopedia/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), ENCYCLOPEDIA_ARTICLE, tok, None, None]

print(len(documents))

548


In [116]:
shared_files, tok_only_files, eng_only_files = get_files("chat", "*")

# Get the shared files and save them in the documents table
for f in shared_files:
    tok = open(os.path.expanduser(f"chat/tok/{f}"), "r", encoding="utf-8").read()
    eng = open(os.path.expanduser(f"chat/eng/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    if eng == '':
        eng = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, eng, None]

# Get the files that are only in tok/ and save them in the documents table
for f in tok_only_files:
    tok = open(os.path.expanduser(f"chat/tok/{f}"), "r", encoding="utf-8").read()
    if tok == '':
        tok = None
    documents.loc[len(documents)] = [len(documents), os.path.basename(f).replace('__', '_').replace('__', '_').replace('_', ' '), CHAT, tok, None, None]

print(len(documents))

589


In [127]:
# Save the sentences and translations to a file.
documents.to_csv(os.path.expanduser("documents.tsv"), sep='\t', index=False)

# Reload them and confirm that they are the same.
documents_copy = pd.read_csv(os.path.expanduser("documents.tsv"), sep='\t', dtype={'id': 'int64', 'name': str, 'content_type': str, 'tok': str, 'eng': str, 'cmn': str if not pd.isnull('cmn') else None})
pd.testing.assert_frame_equal(documents, documents_copy, check_dtype=True)


# Find a story with an English translation
story = documents[(documents['content_type'] == STORY) & (documents['eng'].notnull())].sample(1).iloc[0]
print(story['name'])
print(story['tok'])
print(story['eng'])

sermon.txt
jan Jesu li lukin e kulupu li tawa sewi nena li anpa e monsi ona. jan ona li kama tawa poka ona. jan Jesu li pana e sona tawa jan. ona li toki e ni:
jan mute li sona e ni: insa mi li ike. jan ni li pona. lawa pi jan sewi Jawe li pi ona.
jan mute li pilin ike. jan sewi Jawe li pona e pilin ona.
jan mute li sona e ni: mi anpa. jan ni li kama jo e ma tan jan sewi Jawe.
jan mute li wile pona. jan sewi Jawe li pona mute tawa jan ni.
jan mute li pona tawa jan ante la jan ni li pona. pali ike ante pi jan ni li lili tawa jan sewi Jawe.
jan mute li wile ala sona e ike. jan ni li lon poka pi jan sewi Jawe.
jan mute li wile pini e utala. jan ni li jan lili pi jan sewi Jawe.
jan mute li pona. taso jan ike li pakala e jan pona ni. lawa pi jan sewi Jawe li pi jan pona ni.
jan li ike tawa sina. ona li toki e ijo ike tawa sina tan ni: sina sona e mi. taso sina pona mute. o pilin pona. o pilin pona mute tan ni: ijo pona li pi sina lon ma sewi. tenpo pini la jan mute li ike tawa jan pona ante