*** NOTE: Discontinued. *** This project is discontinued. The author now considers this to be a poor method. A superior method is to extract the text from HTML files instead. See the Anne of Green Garbles notes for more information.
Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.
If you want to get just the book's text out of a Project Gutenberg text file:
script/guten-gutter pg10662.txt > The_Night_Land.txt
If you want to do that to an entire collection of Project Gutenberg files:
mkdir cleaned script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned
To use Guten-gutter from any working directory, add the
script directory in
this repository to your
PATH. For example, you might add this line to your
An easy way to accomplish this is to dock Guten-gutter using shelf:
A small test script, test.sh, is included with this distribution.
Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)
Make IllustrationProcessor handle multiple lines