Skip to content

A no-nonsense web scraping tool which removes the crap and preserves the content in epub and pdf formats.

License

Notifications You must be signed in to change notification settings

dpapathanasiou/CleanScrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CleanScrape

About

This is a no-nonsense web scraping tool which uses pycurl to fetch public web page content, readability-lxml to clean it of ads, navigation bars, sidebars, and other irrelevant boilerplate, and wkhtmltopdf to preserve the output in PDF and EPUB document formats.

Motivation

I was getting tired of stale bookmarked links: a lot of useful blog articles disappear and neither web.archive.org nor Google's cache are very helpful.

Additionally, too many otherwise-useful pages are cluttered with ads, sidebars, and other crap, so the focus is on preserving text, using the readability algorithm built into readability-lxml.

Installation

You need python, pip, wkhtmltopdf, and pandoc installed and running on your computer.

Clone this repo to your compter and load the other requirements using the requirements.txt file like so:

$ pip install -r requirements.txt

Edit the settings.py file as necessary, to match your computer's environment.

You can also create a local_settings.py file which will override anything in settings.py, without affecting the code checked in here.

For epub, there is a default cover image and css file provided in this repo, but you can provide your own by editing the the settings.py file, or overriding those definitions in a local_settings.py file.

Usage

Run CleanScrape from a command line prompt, defining the url to fetch and clean, and the file name to use for the final output (both the pdf and epub files will have this filename, but with '.pdf' and '.epub' extensions, respectively).

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace"

The same, but from inside a python shell:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace")                                          

If successful, the output looks like this, with the final results being saved to /tmp/strace.epub and /tmp/strace.pdf for the epub and pdf files respectively:

/usr/bin/pandoc -f html -t epub --epub-metadata="/tmp/metadata.xml" -o /tmp/strace.epub --epub-cover-image="epub_cover.jpg" -s --smart --parse-raw /tmp/strace_epub.html 

/usr/local/bin/wkhtmltopdf --page-size Letter /tmp/strace.html /tmp/strace.pdf
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      
                                                           

Cleaning it with readability is optional; if you want to keep the html retrieved as-is, use the --noclean option:

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace" --noclean

Or inside the python shell like this:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace", clean_it=False)                                          

Troubleshooting

Some sites will resist being scraped, and even though a given url is visible in a browser, it will not work here, resulting in an error like this:

Sorry, could not read  http://someblog.com/probably/not/worth/saving/anyway/

When this occurs, there are two things to try:

  1. Change the user agent

    By default, it is this, which is a big tipoff that the page view is not a human being:

    UA = "CleanScrape/1.0 +http://github.com/dpapathanasiou/CleanScrape"

    So create a local_settings.py file, which redefines the UA variable to a common user agent string instead, e.g.:

    UA = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"

    If that still doesn't work, try step 2, below.

  2. Save the page from the browser as HTML on your computer (on linux, /tmp is a good place for it)

    Then, prefix the file folder with file:// (for example, /tmp/someblog.html becomes file:///tmp/someblog.html).

    That is a valid url pycurl can read and process.

About

A no-nonsense web scraping tool which removes the crap and preserves the content in epub and pdf formats.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published