Commits on Feb 26, 2012
  1. Some doc updates.

    committed Feb 26, 2012
  2. Ignore dmoz download

    committed Feb 26, 2012
  3. Added README for testdata dir

    committed Feb 26, 2012
  4. Add a global timeout to the detector

    5 seconds seems like a reasonable choice, even if it doesn't appear to
    be working like it says. I only need to mass-execute this once though,
    so it's probably good enough.
    committed Feb 26, 2012
  5. Some minor changes to detection.

    A test against 30 sites shows that I should detect about 2,000 forums
    given the list of 10k. As this is much more than the few hundred that
    I need, so it's time to stop pissing about with this script and start
    work on the actual forum spider.
    committed Feb 26, 2012
Commits on Feb 25, 2012
  1. More detection rules.

    Starting to get messy... A machine learning approach would have
    been a lot better than this mess. Perhaps I should have a go at
    that in the future. Not today though, it's HTML and xpath grind
    day for me.
    committed Feb 25, 2012
  2. Added some rarer forums.

    Keep on keepin' on.
    committed Feb 25, 2012
  3. Added some oddball forum detection rules, for scum who remove attribu…

    …tion
    
    It appears that webmasters in the SEO world just love to remove copyright
    notices and all forms of attribution, apparently they know best. Like two
    <HEAD>s being be better than one (wut?) and JavaScript lives up above the
    DTD or outside </HTML>. It's fucking anarchy out there.
    committed Feb 25, 2012
  4. Added a script to download and parse the open directory in search of …

    …forums
    
    It streams the content from ODP so doesn't need ~2GiB, but must start again
    if it gets disconnected. I like disk space, and pipes. Specially the pipes.
    committed Feb 25, 2012
Commits on Feb 24, 2012
  1. Wrote a README to start with.

    Now let's think about database design...
    committed Feb 24, 2012