A Clojure library for working with Pail within Cascalog.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src/pail_cascalog
test/pail_cascalog
.gitignore
.travis.yml
LICENSE
README.md
project.clj

README.md

pail-cascalog

Build Status

Utilities for working with Pail in Cascalog.

Usage

Add pail-cascalog to your project's dependencies. If you're using Leiningen, your project.clj should look something like this:

(defproject ...
  :dependencies [[pail-cascalog VERSION]])

Where VERSION is the latest version on Clojars.

Creating a Tap from a PailStructure

In order to create a Cascalog tap from a PailStructure, it is necessary to first create a PailSpec from the PailStructure. clj-pail provides a method that does that for us. Once we have a PailSpec, pail-cascalog can be used to create a tap:

(require '[clj-pail.core :as pail]
(require '[pail-cascalog.core :as pail-cascalog])

; can be any PailStructure
(def structure (com.backtype.hadoop.pail.DefaultPailStructure.))

(def tap (-> structure
           (pail/spec)
           (pail-cascalog/tap-options :field-name "object")
           (pail-cascalog/tap "path/to/data")))

The tap can be customized by the options passed to tap-options. In the presence of a vertically-partitioned PailStructure, a subset of data can be consumed by specifying paths with the :attributes option:

; read data only from "foo/bar" and "baz/qux" directories
(pail-cascalog/tap-options structure :attributes [["foo" "bar"] ["baz" "qux"]])

Creating a Tap from an Existing Pail

Existing pails can be opened as Cascalog taps using pail-cascalog.core/pail->tap:

(require '[clj-pail.core :as pail])
(require '[pail-cascalog.core :as pail-cascalog])

; open an existing pail
(def pail (pail/pail "path/to/data"))

; convert it to a Cascalog tap
(def tap (pail-cascalog/pail->tap pail))

The pail->tap function accepts the same options as tap-options:

; customize the tap to have a custom field name ans read from two partitions: "foo" and "bar"
(def tap (pail-cascalog/pail->tap pail :field-name "object" :attributes [["foo"] ["bar"]]))

License

Copyright © 2013 David Cuddeback

Distributed under the MIT License.