Who Lives in a Pineapple Under the Sea? MIS-TER DAR-CY! #133

Open
toomuchpete opened this Issue Nov 5, 2015 · 10 comments

Projects

None yet

7 participants

@toomuchpete

My goal is to process a book from Project Gutenberg's Top 100 list, possibly Pride and Prejudice. The book will remain largely intact, but the quotes will be replaced with quotes generated from corpus compiled from Spongebob Squarepants fanfic (collected from FanFiction.net).

Probably the most jarring thing to solve is getting the names right. It would be disorienting to see Spongebob's name littered around Pride and Prejudice, but maybe that will be funny? Or there's probably some replacement that can be done, translating character names between the two.

@toomuchpete

As requested, @kyfast.

@dariusk
Owner
dariusk commented Nov 5, 2015

Love it.

@KyFaSt
KyFaSt commented Nov 5, 2015

🍍

@MichaelPaulukonis

There's been dialogue swapping in the past, and I did character/noun swapping between two texts as well. But nobody has tackled the problem of getting references straight. I thought about it as one of my projects this year, but don't know if I'll get to it.

I won't be sad if you do the work for the rest of us!

@enkiv2
enkiv2 commented Nov 5, 2015

The word2vec-related projects have managed to translate references. If you
make an explicit list of proper names in each source, you can probably make
an explicit translation or use word2vec to produce correspondences for you.

On Thu, Nov 5, 2015 at 9:54 AM Michael Paulukonis notifications@github.com
wrote:

There's been dialogue swapping in the past, and I did character/noun
swapping between two texts as well. But nobody has tackled the problem of
getting references straight. I thought about it as one of my projects this
year, but don't know if I'll get to it.

I won't be sad if you do the work for the rest of us!


Reply to this email directly or view it on GitHub
#133 (comment)
.

@MichaelPaulukonis

I would be intrigued to see this work; one problem is eponyms, nicknames, gender-references, and titles. "King" posed a particular problem for me, as the pos-tagger I was using always decided it was a verb. @enkiv2 - can you link to one or more projects that managed to translate references?

@enkiv2
enkiv2 commented Nov 5, 2015

Take a look at the translated titles and authors in
#72 ; this is what I mean.
Word2vec correctly figured out that certain proper nouns were similar in
the same way that it figured out that certain nouns are similar in general,
from what I understand. If you whitelist proper nouns and have an explicit
list of identical ways of referring to the same person which you normalize,
you can do that with better reliability, but at that point you've done most
of the work of creating a correspondence table between sets of characters
and you might as well just do string replacement on them.

On Thu, Nov 5, 2015 at 10:46 AM Michael Paulukonis notifications@github.com
wrote:

I would be intrigued to see this work; one problem is eponyms, nicknames,
gender-references, and titles. "King" posed a particular problem for me, as
the pos-tagger I was using always decided it was a verb. @enkiv2
https://github.com/enkiv2 - can you link to one or more projects that
managed to translate references?


Reply to this email directly or view it on GitHub
#133 (comment)
.

@ikarth
ikarth commented Nov 6, 2015

My Gutenberg Shuffle from 2013 attempted to respect references, but it turned out to be a bigger project than anticipwords.It sort of got gender right, though I'd redo it if I went that way again.

Note that, at least for the libraries in gensim, pos-taggers work better on sentences rather than individual words.

@enkiv2
enkiv2 commented Nov 6, 2015

I was thinking you'd operate on the whole sentences, but then only pay
attention to the whitelisted words.

On Thu, Nov 5, 2015 at 9:28 PM Isaac Karth notifications@github.com wrote:

My Gutenberg Shuffle from 2013 attempted to respect references, but it
turned out to be a bigger project than anticipwords.It sort of got gender
right, though I'd redo it if I went that way again.

Note that, at least for the libraries in gensim, pos-taggers work better
on sentences rather than individual words.


Reply to this email directly or view it on GitHub
#133 (comment)
.

@michelleful

It sounds like what you'd need (if you did choose to somehow "translate" the names) is to have the names in the Spongebob corpus tagged for named entities, but in case it's useful to have a version of P&P that is name-tagged, the P&P e-text at Pemberley.com is conveniently so.

<P>``<A HREF="ppdrmtis.html#MrBennet">Mr.&#32;Bennet</A>, how can you abuse your own children in such way?  You take delight in vexing me.  You have no compassion on my poor nerves.''</P>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment