Have you ever wanted to rip your favorite blog? Like, download all posts, convert them to pdfs, print them out or whatever.
No?
Well, you have a chance to do that now!
This script intelligently transforms all html files to beautiful pdfs. Additionaly, it can remove special chars from the final filenames! Yeah!
The process looks like this:
- the script searches for all html files in current directory
- it identifies the website core content and extract it (extractor.py)
- inserts your own headers and footers (extractor.py)
- converts to beautiful pdf files using Prince XML
- finally, it copies all the pdfs to one directory
- Downloaded html files using ScrapBook Firefox Add-On, for example
- Prince XML
- Python ~2.7
- Beautiful Soup library
- Linux
To convert:
$ ripper-start.sh <HTMLs directory> <PDFs directory>
To clean filenames:
$ ripper-rename.sh
- Karol Bonenberg
This project is licensed under the GNU GPL Version 3 - see the LICENSE.md file for details