Switch branches/tags
Nothing to show
Commits on Oct 12, 2011
  1. Adding an index to makes much, much faster.

    progscrape users probably want to dig the index out of the new database
    schema code in and apply it to their existing prog.db.
    committed Oct 12, 2011
Commits on Nov 15, 2010
  1. Merge branch 'rewind'

    committed Nov 15, 2010
Commits on Nov 12, 2010
  1. Check to see if the db exists.

    committed Nov 12, 2010
Commits on Nov 9, 2010
  1. Moved to contrib/

    committed Nov 9, 2010
  2. Super Xarn powers.

    committed Nov 9, 2010
Commits on Nov 8, 2010
  1. A trailing slash is now always added to the URL to keep browsers from…

    … stomping on the seconds field.
    committed Nov 8, 2010
Commits on Nov 7, 2010
  1. More polish stuff.

    committed Nov 7, 2010
  2. Cleaned some stuff up.

    committed Nov 7, 2010
Commits on Nov 6, 2010
  1. Single thread added.

    FP performance increased.  The profiler says most of the time is spent in sqlite3 queries for the FP.
    I've also stuck the CSS and background image into the file.
    committed Nov 6, 2010
  2. Initial commit of

    Current bugs:
    1) Thread order on fp is wrong.  Can't figure this out atm.
    2) Viewing individual threads hasn't been implemented: you can only view the front page.
    3) Probably not very fast for fp of all of prog.  This probably is a poor choice of data structures problem.
    4) There should probably be a function that renders a post.
    committed Nov 6, 2010
Commits on Nov 2, 2010
Commits on Oct 23, 2010
  1. Switched back to urllib2.

    Too bad, but httplib is too fragile in the face of 4chan sysadmins. This change fixes all of the occasional errors loading pages, at the expense of keep-alive connections.
    It's also noticeably faster than it was before, but whether this is because of MVB's configuration changes or because of httplib, I don't know.
    Cairnarvon committed Oct 23, 2010
Commits on Oct 22, 2010
  1. Grammar.

    committed with Cairnarvon Oct 21, 2010
  2. Grammar.

    committed Oct 21, 2010
Commits on Oct 21, 2010
  1. Grammar.

    committed Oct 21, 2010
Commits on Oct 19, 2010
  1. If the DB name isn't specified, it's now derived from the board to be…

    … scraped.
    --board /prog/ will use prog.db, --board /sci/ will use sci.db, &c.
    Also, --progress-bar is now the default.
    Cairnarvon committed Oct 19, 2010
Commits on Oct 17, 2010
  1. Just skip failed threads instead of aborting.

    This is a temporary work-around and a bit of a dirty hack. Though just skipping the thread probably makes more sense than just terminating the scraper thread anyway; errors during scraping haven't terminated the whole application since the move to multi-threading.
    I don't know how he managed it, but the recent world4ch configuration changes have really fucked /prog/scrape in the ass. On the other hand, there's no excuse for httplib to be this intolerant to shitty web servers either. Maybe switching back to urllib2 is the easiest fix.
    Cairnarvon committed Oct 17, 2010
  2. Now handles chunked transfer-encoding properly.

    When using Transfer-Encoding chunked, httplib's HTTPResponse can only be expected to contain headers, apparently, as HTTPConnection.getresponse() returns without waiting for the entire body. If you close the connection before doing a read(), which does wait until the last bit of the reply is received before returning, you will probably have only part or none of the body.
    World4ch recently switched to Transfer-Encoding: chunked. This is what was going wrong, and why /prog/scrape thought subject.txt was empty and the database never needed updating.
    I can see why the httplib developers could have considered this to be a reasonable way to do things, but given the lack of documentation, I still consider this behaviour to be a bug in httplib.
    Cairnarvon committed Oct 17, 2010
Commits on Oct 15, 2010
  1. Let's pay closer attention to HTTP status codes

    This change is specifically to give better feedback to banned users. As it was, /prog/scrape quietly fetched subject.txt, didn't notice it didn't actually get it (it got a blank body), and just reported that there were no new posts. Now, it will display a ``! Error: 302 Banned'' message (on world4ch, at least) and exit.
    Cairnarvon committed Oct 15, 2010
Commits on Oct 3, 2010
  1. More tripcode parsing fixes.

    You'd expect this to be hard to fuck up.
    There's room for improvement in that it could explicitly work around the remainders of the newline bug, but fuck it.
    Cairnarvon committed Oct 3, 2010
  2. Fixed a bug in the tripcode parsing code.

    Base 64 is hard.
    Cairnarvon committed Oct 3, 2010
Commits on Oct 2, 2010
  1. Rewrote the subject.txt parsing section.

    Should be clearer and a bit more efficient now.
    Cairnarvon committed Oct 2, 2010
Commits on Oct 1, 2010
  1. Added --dry-run.

    Dry-run mode fetches subject.txt and calculates how many threads and posts would need to be fetched to bring the database up to date, but doesn't actually fetch them.
    Incidentally, the (approximate) number of posts that will be fetched is now displayed prior to scraping in regular mode as well.
    Cairnarvon committed Oct 1, 2010
Commits on Sep 9, 2010
Commits on Sep 8, 2010
  1. *--threads auto* will now try to guess a sensible number of scraper t…

    …hreads to use.
    ``Sensible'' meaning a purely linear function that returns 1 for one thread and 32 for a thousand or more, which may or may not be adequate. I'm open to suggestions as far as that's concerned, but keep in mind.
    Cairnarvon committed Sep 8, 2010
Commits on Sep 7, 2010
  1. Considerably fancier.

    Cairnarvon committed Sep 7, 2010
Commits on Sep 6, 2010
  1. One connection per thread.

    Where keep-alive connections aren't supported, this will make no difference. Where they are, it should be a bit nicer.
    Cairnarvon committed Sep 6, 2010
Commits on Sep 5, 2010
  1. Misc. and various.

    Don't let it be said that I'm overly cautious when I decide something isn't ready for the master branch.
    Cairnarvon committed Sep 5, 2010