Scraper for downloading the entire ebooks repository of project Gutenberg
JavaScript CSS Python HTML
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 68 commits behind openzim:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
gutenberg
.gitignore
LICENSE.txt
README.md
TODO.md
dump-gutenberg.py
requirements.pip

README.md

######################################################################

Project Gutenberg Offline

######################################################################

A scraper that downloads the whole repository of [Project Gutenberg] (http://www.gutenberg.org) and puts it into a localy browsable directory and then in a ZIM file (http://www.openzim.org), a clean and user friendly format for storing content for offline usage. It was created during a Kiwix Hackathon in Lyon, France in July 2014.

Setting up the environment

It's recommended that you use virtualenv.

Install the dependencies

Linux

sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle
sudo pip install virtualenvwrapper

Mac OS X

sudo easy_install pip
sudo pip install virtualenvwrapper
brew install advancecomp jpegoptim pngquant p7zip gifsicle

Finalize the setup

Finally, add this to your .bashrc:

source /usr/local/bin/virtualenvwrapper.sh

Set up the project

git clone git@github.com:kiwix/gutenberg.git
cd gutenberg
mkvirtualenv gut (or any name you want)

Working in the environment

  • Activate the environment: workon gut
  • Quit the environment: deactivate
  • Install the python dependencies: pip install -r requirements.pip

Getting started

After setting up the whole enviroment you can just run the main script dump-gutenberg.py.
It will download, process and export the content.

./dump-gutenberg.py 

Arguments

You can also specify parameters to customize the content.
Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem!
You can also include or exclude book formats.

./dump-gutenberg.py -l en,fr -f pdf --books 100-200

This will download English and French books that have the Id 100 to 200 in the html (default) and pdf format.

You can find the full arguments list below.

-h --help                       Display this help message
-k --keep-db                    Do not wipe the DB during parse stage

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-m --mirror=<url>               Use URL as base for all downloads.
-r --rdf-folder=<folder>        Don't download rdf-files.tar.bz2 and use extracted folder instead
-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals

-x --zim-title=<title>          Custom title for the ZIM file
-q --zim-desc=<desc>            Custom description for the ZIM file

--check                         Check dependencies
--prepare                       Download & extract rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--export                        Export downloaded content to zim-friendly static HTML
--zim                           Create a ZIM file

Screenshots