RESOURCES! #11

Open
dariusk opened this Issue Nov 1, 2013 · 48 comments

Projects

None yet
@dariusk
Owner
dariusk commented Nov 1, 2013

This is an open issue where you can comment and add resources that might come in handy for NaNoGenMo.

NOTE: at some point I will turn this into a more organized document, probably on the wiki for this repo.

@willf
willf commented Nov 1, 2013

I wrote a "Samsa bot" that uses Bing's Ngram database to generate text. You might find it and the associated libraries useful (all Ruby).

https://github.com/willf/microsoft_ngram/blob/master/examples/samsabot.rb

General library:

https://github.com/willf/microsoft_ngram

@dariusk
Owner
dariusk commented Nov 1, 2013

Since @willf is too humble to plug it, Wordnik is an indispensable resource for all things text-related: definitions, parts of speech, random words, rhymes, hypernyms, etc:

http://developer.wordnik.com/docs.html#!/word

@vitorio
vitorio commented Nov 1, 2013

Here's a dump of my notes about generating stories:

@rfreebern researched this problem a few years back for this game project of his:

Curses! is a single-player open-ended adventure game with the basic premise that the player is a fairy tale villain bent on wrecking many potential fairy tales as completely as possible. Fairy tale plots would be generated on-the-fly based on a basic generator template that attempts to intelligently combine dozens or hundreds of very basic fairy tale elements to create situations that are both unique and familiar. The PC's goal is not to just thwart the happy ending but to do it thoroughly: not just kill the handsome prince, but cripple and disfigure him while making the princess hate him and get exiled from her kingdom, for example.

Fairy tales are really well-explored variants of the standard storytelling archetypes described by people like Joseph Campbell. There are a couple of ways that fairy tales are organized, which include their plot outlines (although not their cultural or moral implications): Aarne-Thompson, and Propp. http://en.wikipedia.org/wiki/Aarne-Thompson_classification_system

Propp's classification system has been used as the basis for a number of generators and is still the most-used mechanism in the academic literature for such things: http://en.wikipedia.org/wiki/Vladimir_Propp

Propp generators are things like: http://www.fdi.ucm.es/profesor/fpeinado/projects/kiids/apps/protopropp/

Clicking through to their later Bard system shows examples at the bottom, and that whole KIIDS things is for interactive narrative and computational narratology, which are the academic terms for this sort of thing (I call my work in this area automated storytelling with post-hoc computational narratives, as my use and implementation aren't for interaction).

Mark Finlayson's work out of MIT is a little more recent: http://www.mit.edu/~markaf/research.html

Plugging any of that research into Google Scholar and looking at recent citations of those papers are a good way to catch up.

The massively-multiplayer video game Star Wars Galaxies tried something along these lines with their Dynamic Points of Interest, but they weren't really well executed from a design and technical implementation perspective. They had a lot of potential, but Raph Koster describes their problems here: http://www.raphkoster.com/2010/04/30/dynamic-pois/

Outside of fairy tales, there are works like Plotto, which provide narrative guides to plot generation, and the monomyth-related works by Campbell, etc.: http://www.brainpickings.org/index.php/2012/01/06/plotto/

Plotto is actually in the public domain, and can be found in the Internet Archive here: https://archive.org/details/plottonewmethodo00cook

And journalism is getting into it, too. A program at Northwestern worked out so well, taking sports stats and turning them into sports articles, they didn't publish much research at all and went right into a startup. The Wired article is here: http://www.wired.com/gadgetlab/2012/04/can-an-algorithm-write-a-better-news-story-than-a-human-reporter/all/1

The one paper I found by the Northwestern group cites one major paper from 1977 about "Tale-spin." You can look for citations from the Tale-spin article, and that brings up some interesting recent work from elsewhere: http://scholar.google.com/scholar?cites=8316499405683938909&as_sdt=5,44&sciodt=0,44&hl=en

Finally, there's this failed Kickstarter: http://www.kickstarter.com/projects/storybricks/storybricks-the-mmorpg-storytelling-toolset

Even more finally, I also found this PDF in a second set of notes: https://research.cc.gatech.edu/inc/content/sequential-recommendation-approach-interactive-personalized-story-generation

@darrentorpey

Thanks, @vitorio! That looks helpful.

@smadin
smadin commented Nov 1, 2013

(OK, I made a github account.)
https://pypi.python.org/pypi/wikipedia/1.0.3 is a python interface to wikipedia, which may also be helpful for the quick-and-dirty Markov-chain approach. It was very easy to hack together a script to fetch random Wikipedia tables for source text and churn out a "novel" of a given word-count.

@dariusk
Owner
dariusk commented Nov 1, 2013

While in-browser DOM manipulation is obviously ruled by jQuery, my favorite NodeJS DOM parser/manipulator is Cheerio, which uses jQuery-style selectors.

Also if you're in Ruby and need to do HTML/XML parsing, Nokogiri rules the roost.

@rfreebern

I'm hanging out in #nanogenmo on FreeNode if anyone wants to join. We can toss ideas around on a casual basis there.

@dariusk
Owner
dariusk commented Nov 1, 2013

For those who aren't super IRC-literate, or just don't want to install an irc client, you can go here, pick a username, and visit #nanogenmo from your web browser:

http://webchat.freenode.net/?channels=#nanogenmo

@jiko
jiko commented Nov 1, 2013

The Bard project looks awesome. Thanks @vitorio!

@jiko
jiko commented Nov 1, 2013

Some Python resources:

@agladysh
agladysh commented Nov 2, 2013

An article about generator of Recursive Fairy Tales in Haskell (in Russian): http://habrahabr.ru/post/136007/

Google Translate: http://translate.google.com/translate?hl=en&sl=ru&tl=en&u=http%3A%2F%2Fhabrahabr.ru%2Fpost%2F136007%2F

@dariusk dariusk referenced this issue Nov 2, 2013
Open

Great idea! #20

@darkliquid

Not strictly related, but there are several story-based/narrative-focused roleplaying games that could be used/formalised into a system for generating overall plot structures. I'm currently looking at Microscope, Fiasco and FATE Core as potential systems for having characters 'play' through a game and recording what they do and what actions they take to generate stories.

@jiko
jiko commented Nov 2, 2013

Here's some of my Python code for generating sentences based on supplied text. None of the Twitter-related code has been tested with v1.1 of the Twitter API, but worked fine on v1.

  • Jambot, my first Twitter bot. Uses a 3-gram Markov model by default.
  • JamLitBot, a site that generates random 'sentences' and runs on Heroku. Here is the source code, which builds on JamBot's.
  • @lovecraft_ebooks also builds on JamBot, but uses a 4-gram Markov model.
  • omnibot simplifies bot creation and management. It includes three distinct text-generation methods.
  • wikov makes Lorem Ipsum from Wikipedia pages using a 2-gram Markov model.
@jiko
jiko commented Nov 3, 2013

The Dada Engine, which powers the infamous Postmodernism Generator, might come in handy. There's an online manual and a clone on GitHub.

@erkyrath
erkyrath commented Nov 3, 2013

Not a resource, but a suggestion: when you complete a novel, change the title of your issue to "$NovelTitle by $Author", so that we can easily browse them.

(Yeah, someone is now going to actually title their novel "$NovelTitle".)

If I were an over-organizational nerd, I would suggest setting up appropriate issue tags ("In Progress", "Complete", "Stupid Ideas", etc). But I leave that up to whether Darius is an over-organizational nerd.

@dariusk
Owner
dariusk commented Nov 3, 2013

I agree with you @erkyrath -- I'll try and prod people to do that when they're done. Issue tags... I might start labeling things myself!

@dariusk
Owner
dariusk commented Nov 3, 2013

Okay, I opened a new Issue ( #42 ) for general discussion. This thread remains the place for technical resources; the other thread is open to everything else.

@vitorio
vitorio commented Nov 3, 2013

Ficly ( http://ficly.com/stories and its predecessor Ficlets http://ficlets.ficly.com/ ) is a very-short-story writing community, where you have a 1024 character limit. There are lots of tiny stories on the site, but also, you can fork any story and write prequels and sequels to it. Some stories have multiple prequels and sequels, like an unintentional choose-your-own-adventure.

All of the Ficly and Ficlets content is licensed CC-BY-SA.

In late May 2013, I scraped all of Ficly and dumped 13,144 stories, all of which had at least one prequel or sequel, into a matching amount of JSON files (there should be no standalone 1k character stories). Each JSON file records the ID, URL and title of the story; the author's avatar, name and URL; the IDs and URLs of prequels and sequels; and the story content in Markdown.

The scraper (in Python) is probably a little prickly, as it's mostly uncommented, but the .zip of 13k JSON files could be dumped straight into a JSON document store and worked with directly. Perhaps someone wants to generate 50k words of choose-your-own-adventure stories or something.

https://github.com/vitorio/NaNoGenMo2013

@darkliquid

I've done some basic gathering of info over a few sources to generate a bunch of sentence structures using parts-of-speech tagging while I've been researching. Other might find this useful, so you can find them here: https://github.com/darkliquid/NaNoGenMo/tree/master/data

The data is basically one sentence to a line, each line containing a stream of space separated parts-of-speech tags. There are likely to be mistakes in the set as I've hacked this together without any real understanding of what it is I'm doing or what I yet hope to achieve from it, but have at it and good luck!

@aparrish aparrish referenced this issue Nov 3, 2013
Open

participant #41

@dariusk
Owner
dariusk commented Nov 3, 2013

To be clear, @darkliquid's output can be interpreted by looking at this list of part of speech tags.

@catseye
catseye commented Nov 3, 2013

It would be very difficult to use it in an automated way (and I realize it may be unpopular with some participants) but if you haven't heard of it, there's this site called TVTropes. It contains a vast array of, well, tropes (from fiction in general, mostly mass-media but not exclusively television,) pre-deconstructed for your convenience. For example, Applied Phlebotinum.

@lazerwalker

Speaking of parts-of-speech tagging (cc @darkliquid), if you're literate in Objective-C Apple's NSLinguisticTagger API is fantastic. (http://nshipster.com/nslinguistictagger/)

@darkliquid

Wow, that is nice. Sadly it's of no use to me in linux world but that looks like a much richer source of data for the kinds of analysis I'm looking to do.

On another note, I've started annotating the parts-of-speech tag definitions with example words and some extra rules for their use in sentences where applicable (which hopefully I can then use to scan my sentence structure list to bin structures that are grammatically incorrect). https://github.com/darkliquid/NaNoGenMo/blob/master/data/tag_types.txt

@enkiv2
enkiv2 commented Nov 4, 2013

WordNet can be coaxed into doing part of speech tagging (in addition to
providing synonyms, antonyms, and other related words), although part of
speech tagging requires a hack (iterate over parts of speech until the word
has a synonym in that group, then guess which part of speech the word is
actually being used as). I'd recommend using that on *nix, since it has
other (more useful) functions.

Tangentially, I have a resource to contribute.
https://github.com/enkiv2/synonym-warp will take a text document and
randomly replace some words with synonyms (which slightly warps the
semantics since the synonyms it uses aren't necessarily appropriate to the
context). It expects to run on a unix under zsh, with wordnet in the path.
I'm planning to run input texts through it before training a markov model,
to add a little noise.

On Mon, Nov 4, 2013 at 11:18 AM, Andrew Montgomery-Hurrell <
notifications@github.com> wrote:

Wow, that is nice. Sadly it's of no use to me in linux world but that
looks like a much richer source of data for the kinds of analysis I'm
looking to do.

On another note, I've started annotating the parts-of-speech tag
definitions with example words and some extra rules for their use in
sentences where applicable (which hopefully I can then use to scan my
sentence structure list to bin structures that are grammatically
incorrect).
https://github.com/darkliquid/NaNoGenMo/blob/master/data/tag_types.txt


Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27698071
.

@jiko
jiko commented Nov 4, 2013

@darkliquid Nice work! Part of speech tagging seems like a fruitful avenue.

I've played with this Javascript PoS tagger in the last few days. I found it through The node.js Natural Language Story blog post by the maintainer of a package of general natural language facilities for node. I found another interesting Node package to generate random sentences from BNF grammars, along the lines of the Dada Engine mentioned above.

@enkiv2
enkiv2 commented Nov 6, 2013

For anybody rolling their own grammars, I found a constraint solver in
python: https://github.com/switham/constrainer

On Wed, Nov 6, 2013 at 4:37 AM, Andrew Montgomery-Hurrell <
notifications@github.com> wrote:

Some lists of names, places, occupations, etc for generating character
details.

Names http://stackoverflow.com/questions/1803628/raw-list-of-person-names

Titles http://www.gutenberg.org/dirs/GUTINDEX.ALL

US Cities http://wiki.skullsecurity.org/images/5/54/US_Cities.txt

Job Titles http://www.bls.gov/soc/soc_2010_direct_match_title_file.xls

Adjectives http://www.enchantedlearning.com/wordlist/adjectives.shtml

Nouns http://www.momswhothink.com/reading/list-of-nouns.html


Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27855298
.

@elib
elib commented Nov 6, 2013

I don't know if anyone has referenced this crucial resource.
https://www.youtube.com/watch?v=FUa7oBsSDk8

@darkliquid

I've been running a term extraction for the last couple of days that just finished running. It has various 'terms' i.e. the key noun or noun phrase/topic that a sentence is about, extracted from around half a million sentences across a wide range of sources (gutenberg novels, news articles, etc). I'm not sure I'll even use it now, but it might be of use for people looking to seed their stories with random topics.

https://github.com/darkliquid/NaNoGenMo/blob/master/data/terms_cleaned.txt.gz

@enkiv2
enkiv2 commented Nov 8, 2013

I was inspired by somebody's example of dialogue generation, and so I wrote
some code to parse an ontology and create some question/answer pairs based
on categories: https://github.com/enkiv2/NaNoGenMo2013

At some point, I'll need to hack it to generate other kinds of dialogue.

On Wed, Nov 6, 2013 at 4:15 PM, Andrew Montgomery-Hurrell <
notifications@github.com> wrote:

I've been running a term extraction for the last couple of days that just
finished running. It has various 'terms' i.e. the key noun or noun
phrase/topic that a sentence is about, extracted from around half a million
sentences across a wide range of sources (gutenberg novels, news articles,
etc). I'm not sure I'll even use it now, but it might be of use for people
looking to seed their stories with random topics.

https://github.com/darkliquid/NaNoGenMo/blob/master/data/terms_cleaned.txt.gz


Reply to this email directly or view it on GitHubhttps://github.com/dariusk/NaNoGenMo/issues/11#issuecomment-27914090
.

@warnaars
warnaars commented Nov 9, 2013

You might find this an interesting take on 'automated content authorship'
http://youtu.be/SkS5PkHQphY

@MichaelPaulukonis

@warnaars Philip M. Parker! I would love to see some of his novelistic output.... I'd really love to see some of his code. I've got some more links on him at http://www.xradiograph.com/WordSalad/AutomaticForThePeople

@lilinx
lilinx commented Nov 9, 2013

"If the atoms have by chance formed so many sorts of figures, why did it never fall out that they made a house or a shoe? Why at the same rate should we not believe that an infinite number of Greek letters, strewed all over a certain place, might fall into the contexture of the Iliad?"
Michel de Montaigne (1533-1592), Essais

@ikarth
ikarth commented Nov 12, 2013

For that matter, how about a Library of Babel generator? (Not mine) http://dicelog.com/babel

@notio
notio commented Nov 12, 2013

Not open source, but still! The Fiction Idea Generator is interesting: http://figapps.net/fig.html

It's free this month (iTunes): https://itunes.apple.com/app/fiction-idea-generator-ef/id507536455?mt=8

@lilinx
lilinx commented Nov 14, 2013

Also you might be interested in the works of Jean-Pierre Balpe
This man has been doing generative literature experiment for a while. He has countless bot-blogs generating the weirdest things. Unfortunately he seems to do everything in French : it's very difficult to find anything about him in English (even no english wikipedia article). But there is this short article : http://www.digitalarti.com/blog/digitalarti_mag/portrait_jean_pierre_balpe_inventor_of_literature

@catseye
catseye commented Nov 21, 2013

In one issue here somewhere I obliquely suggested generating a graphic novel -- that is to say, a comic book. While I would love to try, I definitely won't have the time to do this in what remains of November, but here are some resources I found while researching it:

http://openclipart.org is a collection of SVG images, all in the public domain. It can also render them as PNGs for you, at the scale you choose. It has a JSON API: http://openclipart.org/developers

If you wanted to use that JSON API on your own web page (perhaps to display these images on an HTML5 canvas element) you could use this generic JSONP proxy to make a mockery of the same-origin policy: http://jsonp.jit.su/

Here is a library of onomatopoeic sound-effects: http://www.writtensound.com/index.php Not sure how easy it would be to scrape, but probably wouldn't be hard to pick a random item from a desired category, like: http://www.writtensound.com/index.php?term=movement

Here is a list of catchphrases: https://en.wikipedia.org/wiki/List_of_catchphrases

And, just for that extra dadaist touch & in no way limited to graphic novels, here is a list of various abuses of the statistical meaning of p-value, collected from various academic papers: http://mchankins.wordpress.com/2013/04/21/still-not-significant-2/

What I imagine the result of using these resources to be something like:

a sombrero with a word balloon saying "Cowabunga" next to Tux (the Linux penguin) with a thought bubble saying "did not quite reach conventional levels of statistical significance (p=0.079)"... with the word SCHHWAFF at a slight angle and in a large-point font, in the background

@MichaelPaulukonis

@catseye check out blotcomics and the graphic novel harsh noise.

I can't shake the feeling that the end result of your automation, however, will end up looking like ELER.
ep064 source

@ikarth
ikarth commented Nov 21, 2013

If we're going graphical I should probably mention the billion-year archives of the webcomic mezzacotta: http://www.mezzacotta.net/

@bredfern

You can take a look at the text of my Automated Lovecraft project here: https://github.com/bredfern/automated-lovecraft/blob/master/automated_lovecraft.md

@bredfern

The interesting thing I learned is that more firepower doesn't produce a better result there's a sweet spot between the size of the data set and the number of layers, so to train on all of lovecraft's text I got the best results using torch with just 4 layers. Since I was running off char nn most of the code I wrote and just bash script actually to run torch processes. I want to get deeper into this stuff so I can go further with it but its exciting to see the training result never having done this before.

@hugovk
Contributor
hugovk commented Nov 18, 2015

@bredfern Wrong repo! This is the 2013 one, here's this year's: dariusk/NaNoGenMo-2015#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment