fimfiction.net is the main place where bronies publish fanfiction. It's home to over 130,000 stories (and growing daily) and 9 GB of unformatted text. With this much data on our hooves, we can finally figure out who the best pony is (among other things).
Sentiment analysis is the act of looking at some text and quantitatively
measuring its sentiment.
For example, we might assign a score of -0.6 to the text
Cows are terrifying because they're fat and they talk a lot
because it's spoken in a "negative" manner. On the other hand,
Your smile fills me with glee
might get a score of +0.8.
By analyzing the sentiment of each sentence in which a given character appears across the entire set of stories available to us, we can glean some information about which characters are spoken of in the most positive light. We can also track this data across time to see how the general mood of writings on fimfiction has changed over time, or even the ways in which it tends to change throughout individual stories (beginning, middle, end).
Besides sentiment analysis, we can also track which words occur nearby each character's name inside the text and use this to see which words (or non-words) are most associated with each character. We can also see which characters are most often paired with each other.
Finally, we can perform analysis on the story metadata to see how story ratings correlate to things like the length of the story or the time at which it was published.
Performing the analysis requires
ebook-convert, as well as a python 3
installation, the python3
nltk library and the SentiWordNet database
installed systemwide or into
pyenchant Python library is also needed (specifically, the
For Arch users, the relevant packages can be installed via:
# pacman -S ebook-convert python-nltk nltk-data python-pyenchant
You will also need to download the fimfiction dump and extract it to
res/fimfarchive-20160525 (the path may be edited at the top of the Makefile
for future dumps).
The fimfiction dumps can be found on JockeTF's
reddit thread or they can be obtained by running the fimfarchive code
Just navigate to the repository directory and type
This will first convert the .epub books to plaintext, perform
sentiment and word analysis on each story in isolation, aggregate the
results, and finally generate plots in the
If you have multiple cores, I highly advise making use of those with
make -j<num_cores>. The entire process takes 2-3 days on a modern mobile i5 processor and expect the
build/ directory to grow to around 20 GB.
You can interrupt the build process at any time, and
make will pick up where it left off the next time you invoke it.
Results can be found here along with a brief write-up.
All code in this repository is licensed under the terms of the GPLv3. Refer to the provided
LICENSE file for the complete text.