Skip to content

ashwanthkumar/mrcube

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
src
 
 
 
 
 
 

Build Status

Scalding MRCube

Scalding CUBE operators

Naive cubify and rollup methods on richPipe.

For each Input Tuple

  • cubify generates 2^n tuples, where n is the number of fields we are cubing on
  • rollup generates n+1 tuples, where n is the number of fields we are rolling up on

Dev

$ git clone https://github.com/ashwanthkumar/mrcube.git
$ sbt test

Dependencies

For Maven,

<dependency>
  <groupId>in.ashwanthkumar</groupId>
  <artifactId>mrcube_2.10</artifactId>
  <version>0.12.0</version>
</dependency>

For SBT,

libraryDependencies += "in.ashwanthkumar" %% "mrcube" % "0.12.0"

Cubify

If the input tuple is ("ipod", "miami", "2012", "200000") the output generated from the job is

("ipod", "miami", "2012", "1", "200000.0")
("ipod", "miami", "null", "1", "200000.0")
("ipod", "null", "null", "1", "200000.0")
("ipod", "null", "2012", "1", "200000.0")
("null", "null", "2012", "1", "200000.0")
("null", "miami", "null", "1", "200000.0")
("null", "miami", "2012", "1", "200000.0")
("null", "null", "null", "1", "200000.0")

Instead of "null" you can pass in another custom string to cubify.

import in.ashwanthkumar.mrcube._

class CubifyJob(args: Args) extends Job(args) {

  Csv(args("input"), fields = ('product, 'location, 'year, 'sales))
    .read
    .cubify(('product, 'location, 'year))
    .groupBy('product, 'location, 'year) { _.size('size).sum[Int]('sales) }
    .write(Csv(args("output")))
}

Rollup

If the input tuple is ("ipod", "miami", "2012", "200000") the output generated from the job is

("ipod", "miami", "2012", "1", "200000.0")
("null", "miami", "2012", "1", "200000.0")
("null", "null", "2012", "1", "200000.0")
("null", "null", "null", "1", "200000.0")

Similarly instead of "null" you can pass in another custom string to rollup.

import in.ashwanthkumar.mrcube._

class RollupJob(args: Args) extends Job(args) {
  Csv(args("input"), fields = ('product, 'location, 'year, 'sales))
    .read
    .rollup(('product, 'location, 'year))
    .groupBy('product, 'location, 'year) { _.size('size).sum[Int]('sales) }
    .write(Csv(args("output")))

}

References

  1. Distributed Cube Materialization on Holistic Measures by Dr. Arnam Nandi et. al
  2. CUBE Operator in Pig - PIG 2167

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

About

Scalding CUBE operators

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages