Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
gem install extraloop-redis-storage
Extraloop’s Redis storage module decorates ExtraLoop::ScraperBase
and ExtraLoop::IterativeScraper
instances with the set_storage
method: a helper method that allows to specify how the scraped data should be stored.
require "extraloop/redis-storage" class AmazonReview < ExtraLoop::Storage::Record attribute :title attribute :rank attribute :date def validate assert (0..5).include?(rank.to_i), "Rank not in range" end end scraper = AmazonReviewScraper.new("0262560992"). .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'") .run()
At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession
) and associates the extracted records to it. The ‘AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.
reviews = scraper.session.records
The set_storage
method accepts the following arguments:
-
model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing
ExtraLoop::Storage::Record
. -
session_title A human readable title for the extracted dataset (optional).
Once installed, the gem will also add to your system path the extraloop
executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:
extraloop datastore list
This will generate a table like the following one:
id | title | model | records -------------------------------------------------------------------- 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110 49 | 1330106948 AmazonReview Dataset | AmazonReview | 0 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110 52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
Datasets can be removed using the delete
subcommand:
extraloop datastore delete [id]
Where id
is either a single scraping session id, or a session id range (e.g. 48..52).
From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
extraloop datastore export 51..52 -f csv
Similarly, stored datasets can be uploaded to a remote datastore:
extraloop datastore push 51..48 fusion_tables -c google_username:password
While Google’s Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.