Skip to content
This repository has been archived by the owner on Jul 28, 2021. It is now read-only.

catseye/Guten-gutter

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

*** NOTE: Discontinued. *** This project is discontinued. The author now considers this to be a poor method. A superior method is to extract the text from HTML files instead. See the Anne of Green Garbles notes for more information.


Guten-gutter

Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.

Usage

If you want to get just the book's text out of a Project Gutenberg text file:

script/guten-gutter pg10662.txt > The_Night_Land.txt

If you want to do that to an entire collection of Project Gutenberg files:

mkdir cleaned
script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned

To use Guten-gutter from any working directory, add the script directory in this repository to your PATH. For example, you might add this line to your .bashrc:

export PATH=/path/to/this/repo/script:$PATH

An easy way to accomplish this is to dock Guten-gutter using shelf:

shelf_dockgh catseye/Guten-gutter

Tests

A small test script, test.sh, is included with this distribution.

TODO

Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)

Make IllustrationProcessor handle multiple lines

About

Strips boilerplate from Project Gutenberg text files

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published