CleanScrape

About

This is a no-nonsense web scraping tool which uses pycurl to fetch public web page content, readability-lxml to clean it of ads, navigation bars, sidebars, and other irrelevant boilerplate, and wkhtmltopdf to preserve the output in PDF and EPUB document formats.

Motivation

I was getting tired of stale bookmarked links: a lot of useful blog articles disappear and neither web.archive.org nor Google's cache are very helpful.

Additionally, too many otherwise-useful pages are cluttered with ads, sidebars, and other crap, so the focus is on preserving text, using the readability algorithm built into readability-lxml.

Installation

You need python, pip, wkhtmltopdf, and pandoc installed and running on your computer.

Clone this repo to your compter and load the other requirements using the requirements.txt file like so:

$ pip install -r requirements.txt

Edit the settings.py file as necessary, to match your computer's environment.

You can also create a local_settings.py file which will override anything in settings.py, without affecting the code checked in here.

For epub, there is a default cover image and css file provided in this repo, but you can provide your own by editing the the settings.py file, or overriding those definitions in a local_settings.py file.

Usage

Run CleanScrape from a command line prompt, defining the url to fetch and clean, and the file name to use for the final output (both the pdf and epub files will have this filename, but with '.pdf' and '.epub' extensions, respectively).

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace"

The same, but from inside a python shell:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace")

If successful, the output looks like this, with the final results being saved to /tmp/strace.epub and /tmp/strace.pdf for the epub and pdf files respectively:

/usr/bin/pandoc -f html -t epub --epub-metadata="/tmp/metadata.xml" -o /tmp/strace.epub --epub-cover-image="epub_cover.jpg" -s --smart --parse-raw /tmp/strace_epub.html 

/usr/local/bin/wkhtmltopdf --page-size Letter /tmp/strace.html /tmp/strace.pdf
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done

Cleaning it with readability is optional; if you want to keep the html retrieved as-is, use the --noclean option:

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace" --noclean

Or inside the python shell like this:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace", clean_it=False)

Troubleshooting

Some sites will resist being scraped, and even though a given url is visible in a browser, it will not work here, resulting in an error like this:

Sorry, could not read  http://someblog.com/probably/not/worth/saving/anyway/

When this occurs, there are two things to try:

Change the user agent

By default, it is this, which is a big tipoff that the page view is not a human being:
```
UA = "CleanScrape/1.0 +http://github.com/dpapathanasiou/CleanScrape"
```
So create a local_settings.py file, which redefines the UA variable to a common user agent string instead, e.g.:
```
UA = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
```
If that still doesn't work, try step 2, below.
Save the page from the browser as HTML on your computer (on linux, /tmp is a good place for it)

Then, prefix the file folder with file:// (for example, /tmp/someblog.html becomes file:///tmp/someblog.html).
That is a valid url pycurl can read and process.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
CleanScraper.py		CleanScraper.py
LICENSE		LICENSE
README.md		README.md
epub.css		epub.css
epub_cover.jpg		epub_cover.jpg
requirements.txt		requirements.txt
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CleanScrape

About

Motivation

Installation

Usage

Troubleshooting

About

Releases

Packages

Languages

License

dpapathanasiou/CleanScrape

Folders and files

Latest commit

History

Repository files navigation

CleanScrape

About

Motivation

Installation

Usage

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages