Join one of the email groups:
The main difference between Crunch and Cascading/Pig/Hive is in their data models. In Pig and Cascading, most operations in a pipeline are performed on collections of Tuples (here are the Javadocs for the Pig Tuple and the Cascading Tuple). In my answer to this question on Quora, I refer to this as the "single serializable type" (SST) model. Using the SST data model makes it much easier to implement common operations, and Pig, and Cascading provide big libraries of built-in functions that are designed to operate on their respective Tuple types. They also provide APIs for developers to create their own user-defined functions that interact with their SST data models.
Crunch, like the FlumeJava library that is based on, uses a data model that has multiple serializable types (MST). At each stage in a Crunch pipeline, you specify how the data from that stage should be serialized. The benefit of doing this is that it lets you verify at compile-time that each stage of your pipeline will actually receive the type of data it knows how to process. In this sense, the MST model for building pipelines is similar to using a statically typed language, and the SST model is similar to using a dynamically typed language. We feel that the MST model has definite benefits for MapReduce developers:
Crunch's MST serialization model currently has two different implementations, one based on Writables and the other based on Avro records.
Last edited by brockn,