-
Notifications
You must be signed in to change notification settings - Fork 4
Corpora
The pantheon-generator package comes with 6 corpora. Each one is a collection of tokenized texts from which related words will be pulled when generating your pantheon. Think of them as Gene Pools. In alphabetical order they are:
- Erotica
- Fairytales
- Fantasy
- Mathematics
- Plants and Animals
- Sci-Fi
Find example output for each pool here.
How Tos...
Before you generate any Gods call tokens.pick_gene_pool()
and pass it the name
of a corpus. A corpus's name is the name of the directory that contains it, ex. "sci-fi".
tokens.pick_gene_pool("sci-fi")
egg_donor = God("art","philosophy","XX")
sperm_donor = God("war","diplomacy","XY")
pantheon = pantheon(egg_donor,sperm_donor)
pantheon.spawn(5)
A Gene Pool is a collection of tokenized texts. There are ~40 tokenized texts in the /data/corpora/ directory. By combining them in new ways you can produce new Gene Pools. For example, you could combine culinary-poisons.json
with dutch-fairy-tales.json
and deductive-logic.json
to produce a Gene Pool named "eclectic". Here's how:
- From a terminal,
cd
to the pantheon directory and enter python interpreter. - Declare a list variable containing the JSON files you want to include in your corpus.
- Generate a tokens directory and sources.txt file for your corpus using
make_tokens_dir()
. - Generate the primary tokens list. It's comprised of plural nouns.
- Generate the secondary or "mutant" tokens list. It's comprised of gerunds.
Here's the code:
from tokens import *
list_tokenized_texts()
sources = ["culinary-poisons.json","dutch-fairy-tales.json","deductive-logic.json"]
make_tokens_dir("eclectic",sources)
make_tokens_list("eclectic", ["VBG"])
make_tokens_list("eclectic", ["NNS"])
Your corpus is ready. Select it with tokens.set_tokens_lists(<dirname>)
.
You can add your own sources to the /data/corpora
directory. Here's how:
- Download .txt files into
/data/corpora
. - From a terminal,
cd
to the pantheon directory and enter python interpreter. - Use the
tokenize_texts()
method to automatically detect and tokenize these new files.
Here's the code:
from tokens import *
tokenize_texts()
- From a terminal, cd to the pantheon directory and enter python interpreter
- Import tokens and use
tokenize_texts()
to produce JSON files.
Your tokenized texts are now available to be included in a new corpus.