Skip to content


Subversion checkout URL

You can clone with
Download ZIP
branch: master
Commits on Jun 2, 2015
  1. Minor changes

Commits on Jun 1, 2015
  1. Save a MANIFEST file with a list of all filenames written,

    to make it easier for a downloader to fetch them.
Commits on May 26, 2015
  1. Remove entire <head> section -- which means we'll

    miss meta tags, but the important thing is that we'll
    skip foreign style tags that mess up our rendering and
    cause a lot of unwanted data downloads. Try to notice
    any meta charset tags inside the old head and save them.
Commits on May 21, 2015
  1. Add Los Alamos Daily Post

Commits on May 1, 2015
  1. Try to guard against errors from bad unicode characters in URLs.

    Those seem to be showing up on Longreads, in particular.
Commits on Mar 5, 2015
Commits on Feb 28, 2015
  1. Obey base href on pages that have it, for image rewriting.

    This may solve a lot of the images that weren't downloading.
    Guard against images that don't download; don't bomb out with
    errors, and rewrite the URL to an absolute one so the image
    will at least show if there's a live net connection.
  2. Strip RSS content before deciding it's blank.

    Los Alamos Daily Post has whitespace-only content.
Commits on Feb 27, 2015
  1. Rewrite img tags in the indexstr to use locally fetched images.

    Include a note in the indexstr for pages with blank content.
Commits on Feb 26, 2015
  1. Add User-Agent to every urllib2.Request we make,

    including images and referred requests.
Commits on Feb 25, 2015
  1. Set the User-Agent

Commits on Nov 18, 2014
Commits on Oct 31, 2014
  1. Set a default socket timeout of 100 seconds.

    This is the only way to set a feedparser timeout for the RSS URLs.
  2. Add new continue_on_timeout config parameter

    to control whether a timeout means we skip to the next site,
    or to the next story (true means stay on site, skip to next story).
  3. Add a timeout of 100 seconds on stories.

    If the timeout is exceeded, we'll note that in the log
    and skip the rest of the site, assuming that the site is broken.
    (Hmm, but this may be wrong with Xtraurls or sites like
    Longreads where stories come from different sources.)
Commits on Sep 26, 2014
  1. Make shell=False explicit in subprocess calls.

    (It was already the default, but let's be sure.)
Commits on Jul 15, 2014
Commits on Jul 14, 2014
  1. Eliminate href links in RSS that only span images we're removing,

    and links that only span spaces. (E.g. Slashdot RSS.)
Commits on Jun 24, 2014
Commits on May 12, 2014
Commits on Mar 26, 2014
  1. When skipping to next site, write anything we've gathered so far on t…

    …he current site (so we don't end up with no index)
Commits on Feb 11, 2014
Commits on Jan 20, 2014
  1. We weren't removing the ID from suburls in the IO error case.

    Also, it's not necessary to decrement pagenum in case of errors;
    it's calculated again each time from its position in suburls.
Commits on Jan 12, 2014
  1. Escape % in filenames with %25, so that filenames

    with %20 get requested correctly.
Commits on Jan 11, 2014
  1. Fix links to bogus non-downloaded files.

    When we get a NoContent error for a file, not only do we need
    to decrement itemnum and pagenum, but also remove the generated
    filename from suburls since we didn't actually store a file by
    that name.
Commits on Dec 24, 2013
  1. index_skip_pat needs a default

Commits on Dec 23, 2013
  1. Add an index_skip_pat config option,

    for sites like Pro Publica that put illegal crap like <style>
    tags inline in the RSS page.
Commits on Dec 11, 2013
  1. Remove a commented-out print

  2. Remove some of the verbose comments on URLError

    and how to debug URLError conditions. I left the comments in
    intentionally the first time, so those suggestions will be available
    in the git history if anyone ever needs them. But they don't need
    to clog the running code forever.
Something went wrong with that request. Please try again.