Subversion checkout URL

Uses parselets and rwget to generate csv files from websites
branch: master
bin Version bump to 0.3.0
lib test file for xml
test WIP
.gitignore ignore sample files
README.rdoc Documentation for the EC2 Recipe
Rakefile remove inline dependency
VERSION Version bump to 0.4.0
csvget.gemspec Regenerated gemspec for version 0.4.0
hn.let Version bump to 0.3.0


csvget (also, jsonget)

What's new in 0.3.0

  1. Added command-line option:

    --filter=RUBY_CODE RUBY_CODE will be eval'd in context of @row.is_a?(FasterCSV::Row)
    # Example:
    ./bin/csvget --require=chronic --require=time --filter "@row['time']=Chronic.parse(@row['time']).iso8601" # ...
  2. Added CHANGELOG


Running on EC2

> git clone git://
> cd csvget-ec2-recipe
> ./boot.rb

Local Installation

  1. Install the dependencies.

  2. > gem sources -a

  3. > sudo gem install fizx-csvget

Example Usage

> cat hn.let
    "title": ".title a",
    "link": ".title a @href",
    "comments": "match(.subtext a:nth-child(3), '\\d+')",
    "user": ".subtext a:nth-child(2)",
    "score": "match(.subtext span, '\\d+')",
    "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
> csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let
> head data/headlines.csv
4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,,41,waldrews
67,America's unjust sex laws,2 hours ago,,59,MikeCapone
23,Buy somebody lunch,3 hours ago,,58,DanielBMarkham
10,A design pattern is an artifact of a missing feature in your chosen language,3 hours ago,,31,bensummers
4,API changes in Snow Leopard,1 hour ago,,14,pieter
16,How to run a linux based home web server,3 hours ago,,28,RiderOfGiraffes
1,"OpenCL ""Hello World""",1 hour ago,",World/Example:Hello,World.html",8,pieter
15,US Senate bill allows White House to disconnect private computers from Internet,4 hours ago,,35,drewr
1,Strategy: Solve Only 80 Percent of the Problem,47 minutes ago,,6,alrex021
> csvget -h
Usage: ./bin/csvget [options] SEED_URL [SEED_URL2 ...]
        --parselet=JSON_FILE         JSON_FILE is a parselet.
    -w, --wait=SECONDS               wait SECONDS between retrievals.
    -P, --directory-prefix=PREFIX    save files to PREFIX/...
    -U, --user-agent=AGENT           identify as AGENT instead of RWget/VERSION.
    -A, --accept-pattern=RUBY_REGEX  URLs must match RUBY_REGEX to be saved to the queue.
        --time-limit=AMOUNT          Crawler will stop after this AMOUNT of time has passed.
    -R, --reject-pattern=RUBY_REGEX  URLs must NOT match RUBY_REGEX to be saved to the queue.
        --require=RUBY_SCRIPT        Will execute 'require RUBY_SCRIPT'
        --limit-rate=RATE            limit download rate to RATE.
        --http-proxy=URL             Proxies via URL
        --proxy-user=USER            Sets proxy user to USER
        --proxy-password=PASSWORD    Sets proxy password to PASSWORD
        --fetch-class=RUBY_CLASS     Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object]
        --store-class=RUBY_CLASS     Must implement put(key_string, temp_file)
        --dupes-class=RUBY_CLASS     Must implement dupe?(uri)
        --queue-class=RUBY_CLASS     Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int]
        --links-class=RUBY_CLASS     Must implement urls(base_uri, temp_file) #=> [uri, ...]
    -S, --sitemap=URL                URL of a sitemap to crawl (will ignore inter-page links)
    -V, --version
    -Q, --quota=NUMBER               set retrieval quota to NUMBER.
        --max-redirect=NUM           maximum redirections allowed per page.
    -H, --span-hosts                 go to foreign hosts when recursive
        --connect-timeout=SECS       set the connect timeout to SECS.
    -T, --timeout=SECS               set all timeout values to SECONDS.
    -l, --level=NUMBER               maximum recursion depth (inf or 0 for infinite).
        --[no-]timestampize          Prepend the timestamp of when the crawl started to the directory structure.
        --incremental-from=PREVIOUS  Build upon the indexing already saved in PREVIOUS.
        --protocol-directories       use protocol name in directories.
        --no-host-directories        don't create host directories.
    -v, --[no-]verbose               Run verbosely
    -h, --help                       Show this message


Copyright © 2009 Kyle Maxwell. See LICENSE for details (MIT).

