Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Uses parselets and rwget to generate csv files from websites

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 bin
Octocat-spinner-32 lib
Octocat-spinner-32 test
Octocat-spinner-32 .gitignore
Octocat-spinner-32 CHANGELOG
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.rdoc
Octocat-spinner-32 Rakefile
Octocat-spinner-32 VERSION
Octocat-spinner-32 csvget.gemspec
Octocat-spinner-32 hn.let
README.rdoc

csvget (also, jsonget)

What's new in 0.3.0

  1. Added command-line option:

    --filter=RUBY_CODE RUBY_CODE will be eval'd in context of @row.is_a?(FasterCSV::Row)
    
    # Example:
    ./bin/csvget --require=chronic --require=time --filter "@row['time']=Chronic.parse(@row['time']).iso8601" # ...
  2. Added CHANGELOG

Dependencies

Running on EC2

> git clone git://github.com/fizx/csvget-ec2-recipe.git
> cd csvget-ec2-recipe
> ./boot.rb
> ssh YOUR_INSTANCE

Local Installation

  1. Install the dependencies.

  2. > gem sources -a gems.github.com

  3. > sudo gem install fizx-csvget

Example Usage

> cat hn.let
{
  "headlines":[{
    "title": ".title a",
    "link": ".title a @href",
    "comments": "match(.subtext a:nth-child(3), '\\d+')",
    "user": ".subtext a:nth-child(2)",
    "score": "match(.subtext span, '\\d+')",
    "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
  }]
}
> csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
> head data/headlines.csv
comments,title,time,link,score,user
4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham
10,A design pattern is an artifact of a missing feature in your chosen language,3 hours ago,http://www.snell-pym.org.uk/archives/2008/12/29/design-patterns/,31,bensummers
4,API changes in Snow Leopard,1 hour ago,http://developer.apple.com/mac/library/releasenotes/MacOSX/WhatsNewInOSX/Articles/MacOSX10_6.html#//apple_ref/doc/uid/TP40008898-SW1,14,pieter
16,How to run a linux based home web server,3 hours ago,http://stevehanov.ca/blog/index.php?id=73,28,RiderOfGiraffes
1,"OpenCL ""Hello World""",1 hour ago,"http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Example:Hello,World/Example:Hello,World.html",8,pieter
15,US Senate bill allows White House to disconnect private computers from Internet,4 hours ago,http://news.cnet.com/8301-13578_3-10320096-38.html,35,drewr
1,Strategy: Solve Only 80 Percent of the Problem,47 minutes ago,http://highscalability.com/strategy-solve-only-80-percent-problem,6,alrex021
> csvget -h
Usage: ./bin/csvget [options] SEED_URL [SEED_URL2 ...]
        --parselet=JSON_FILE         JSON_FILE is a parselet.
    -w, --wait=SECONDS               wait SECONDS between retrievals.
    -P, --directory-prefix=PREFIX    save files to PREFIX/...
    -U, --user-agent=AGENT           identify as AGENT instead of RWget/VERSION.
    -A, --accept-pattern=RUBY_REGEX  URLs must match RUBY_REGEX to be saved to the queue.
        --time-limit=AMOUNT          Crawler will stop after this AMOUNT of time has passed.
    -R, --reject-pattern=RUBY_REGEX  URLs must NOT match RUBY_REGEX to be saved to the queue.
        --require=RUBY_SCRIPT        Will execute 'require RUBY_SCRIPT'
        --limit-rate=RATE            limit download rate to RATE.
        --http-proxy=URL             Proxies via URL
        --proxy-user=USER            Sets proxy user to USER
        --proxy-password=PASSWORD    Sets proxy password to PASSWORD
        --fetch-class=RUBY_CLASS     Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object]
        --store-class=RUBY_CLASS     Must implement put(key_string, temp_file)
        --dupes-class=RUBY_CLASS     Must implement dupe?(uri)
        --queue-class=RUBY_CLASS     Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int]
        --links-class=RUBY_CLASS     Must implement urls(base_uri, temp_file) #=> [uri, ...]
    -S, --sitemap=URL                URL of a sitemap to crawl (will ignore inter-page links)
    -V, --version
    -Q, --quota=NUMBER               set retrieval quota to NUMBER.
        --max-redirect=NUM           maximum redirections allowed per page.
    -H, --span-hosts                 go to foreign hosts when recursive
        --connect-timeout=SECS       set the connect timeout to SECS.
    -T, --timeout=SECS               set all timeout values to SECONDS.
    -l, --level=NUMBER               maximum recursion depth (inf or 0 for infinite).
        --[no-]timestampize          Prepend the timestamp of when the crawl started to the directory structure.
        --incremental-from=PREVIOUS  Build upon the indexing already saved in PREVIOUS.
        --protocol-directories       use protocol name in directories.
        --no-host-directories        don't create host directories.
    -v, --[no-]verbose               Run verbosely
    -h, --help                       Show this message

Copyright

Copyright © 2009 Kyle Maxwell. See LICENSE for details (MIT).

Something went wrong with that request. Please try again.