Skip to content

Corpora

Cara Warner edited this page Apr 24, 2018 · 3 revisions

What are they?

The pantheon-generator package comes with 6 corpora. Each one is a collection of tokenized texts from which related words will be pulled when generating your pantheon. Think of them as Gene Pools. In alphabetical order they are:

  1. Erotica
  2. Fairytales
  3. Fantasy
  4. Mathematics
  5. Plants and Animals
  6. Sci-Fi

Find example output for each pool here.

How Tos...

  1. Pick your Pantheon's Gene Pool
  2. Define a new Gene Pool
  3. Add your own source material

Pick your Pantheon's Gene Pool

Before you generate any Gods call tokens.pick_gene_pool() and pass it the name of a corpus. A corpus's name is the name of the directory that contains it, ex. "sci-fi".

tokens.pick_gene_pool("sci-fi")
egg_donor = God("art","philosophy","XX")
sperm_donor = God("war","diplomacy","XY")
pantheon = pantheon(egg_donor,sperm_donor)
pantheon.spawn(5)

Define a new Gene Pool

A Gene Pool is a collection of tokenized texts. There are ~40 tokenized texts in the /data/corpora/ directory. By combining them in new ways you can produce new Gene Pools. For example, you could combine culinary-poisons.json with dutch-fairy-tales.json and deductive-logic.json to produce a Gene Pool named "eclectic". Here's how:

  1. From a terminal, cd to the pantheon directory and enter python interpreter.
  2. Declare a list variable containing the JSON files you want to include in your corpus.
  3. Generate a tokens directory and sources.txt file for your corpus using make_tokens_dir().
  4. Generate the primary tokens list. It's comprised of plural nouns.
  5. Generate the secondary or "mutant" tokens list. It's comprised of gerunds.

Here's the code:

from tokens import *
list_tokenized_texts()
sources = ["culinary-poisons.json","dutch-fairy-tales.json","deductive-logic.json"]
make_tokens_dir("eclectic",sources)
make_tokens_list("eclectic", ["VBG"])
make_tokens_list("eclectic", ["NNS"])

Your corpus is ready. Select it with tokens.set_tokens_lists(<dirname>).

Add your own source material

You can add your own sources to the /data/corpora directory. Here's how:

  1. Download .txt files into /data/corpora.
  2. From a terminal, cd to the pantheon directory and enter python interpreter.
  3. Use the tokenize_texts() method to automatically detect and tokenize these new files.

Here's the code:

from tokens import *
tokenize_texts()
  1. From a terminal, cd to the pantheon directory and enter python interpreter
  2. Import tokens and use tokenize_texts() to produce JSON files.

Your tokenized texts are now available to be included in a new corpus.

Clone this wiki locally