Ebook Corpus - A parser and extractor for electronic books

Ebook Corpus is a set of tools for parsing and extracting the text of ebooks in various formats, designed for the purpose of creating large multilingual ebook-based text corpora.

Many people have amassed enormous collections of ebooks, often containing millions of lines of text when taken as a whole, so it is always surprising to find that there aren't more tools and libraries available to work with ebooks as a corpus source. It seems that almost all the existing tools are focused on consuming (reading) ebooks, while the remaining few provide the functionality to create ebooks to be thus consumed.

As wonderful as ebooks are, they are often packaged in formats that are incredibly underspecified, or worse, that don't follow the specifications that do exist. A remarkable number of parsing libraries choke on very simple books even in presumably well-supported formats like EPUB3.

There are many ways for an ebook to defy the expectations of the parser -- perhaps it has been written in Unicode and the parser only handles US-ASCII, or the parser expects Unicode and it's written in KOI-8. Maybe the ebook contains an OPF file called content.opf in the root directory, or maybe it's in a separate CONTENT subfolder -- or called something completely different, like mytoc.opf or 目录.opf.

The Ebook Corpus tools won't solve all of these problems, but they nevertheless provide a number of options to make it easier to work with large, multilingual collections of ebooks as a raw text source.

Usage

Invoking the program on the command-line is straightforward:

./ebook.rb [options] [filename]

Where [filename] is the path to the ebook file that you want to work with. If the file has a standard extension (*.epub, *.mobi, *.fb2) it should be detected automatically.

Options

-a or --all: Extract all contents of epub
-c or --cover: Extract cover image
-f or --flatten-dir: Save all files to the current folder rather than an individual directory
-h or --html: Extract raw html
-i or --images: Extract images to a separate folder
-m or --metadata: Print metadata
- -T or --title: Print title metadata only
- -A or --author: Print author metadata only
- -I or --isbn: Print ISBN metadata only
- -L or --language: Print language metadata only
- -P or --publisher: Print publisher metadata only
- -D or --description: Print description metadata only
-o or --output-dir DIR: Save output to specified director
-s or --save: Save (text or html) to file instead of printing
-t or --text: Extract plain text
-T or --tests: Run test suite
-p or --pager: View text in pager
-v or --view: Open images in viewer

Supported formats

Format	File extension
EPUB	`.epub`
FictionBook	`.fb2`
Mobipocket	`.mobi`, `.prc`, `azw`

Support for Mobipocket files is provided via a wrapper for the python script mobiunpack.py by @kevinhendricks (released as GPL3). If you know of a drop-in replacement library in Ruby for parsing MOBI files (or are interested in writing one), please let me know!

Note that only ebooks without DRM will work with this script.

Contributing

PRs, suggestions, examples of ebooks that don't parse properly, and other contributions are always welcome! Providing support for additional formats or opening issues for bugs are examples of ways to help.

MOBI support has only been tested against files with the .mobi extension. It should in theory also work for other extensions. If you have access to ebooks with a .prc or .azw file extension and can confirm this, that would be appreciated!

To do

Code is pretty ad hoc at the moment and in general need of a cleanup. Different formats are handled separately but should probably be merged.

Other things:

Guess alternately-named content.opf files
Figure out cross-platform way of opening images in default viewer (current kludge is hard-coded to open image folder in Gwenview since xdg-open doesn't play nicely with cleaning up temporary files after viewing)

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lib		lib
vendor		vendor
LICENSE		LICENSE
README.md		README.md
ebook.rb		ebook.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ebook Corpus - A parser and extractor for electronic books

Usage

Options

Supported formats

Contributing

To do

License

About

Releases

Packages

Languages

License

dohliam/ebook-corpus

Folders and files

Latest commit

History

Repository files navigation

Ebook Corpus - A parser and extractor for electronic books

Usage

Options

Supported formats

Contributing

To do

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages