Skip to content

chitika/cmr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMR

CMR is a perl framework built on top of nanomsg for distributing tasks across a clustered environment. Clients for performing parallel distributed grep, map, or map-reduce tasks have been created to show the capabilities of CMR.

Dependencies

And the following perl libraries

  • NanoMsg::Raw
  • JSON::XS
  • Date::Calc
  • Date::Manip
  • IO::Select
  • POSIX
  • List::Util
  • File::Basename
  • Cwd
  • Data::GUID
  • Getopt::Long

All of these dependencies can be resolved by installing the following debian packages

  • libnanomsg0*
  • libnanomsg-raw-perl*
  • libdata-guid-perl
  • libdate-calc-perl
  • libjson-xs-perl
  • libgetopt-long-descriptive-perl
  • libdate-manip-perl
  • libconfig-tiny-perl
  • liblog-log4perl-perl
  • libuuid-perl

* The Debian repositories currently provide libnanomsg0 and libnanomsg-raw-perl, both required by the cmr-lib Debian package provided. These nanomsg packages are only available in sid but are in the process of being added to testing and backported to Debian Wheezy. Rather than put your system on unstable the preferred method of acquiring these packages is by backporting them. Instructions on backporting debian packages can be found here - https://wiki.debian.org/SimpleBackportCreation

CMR requires a coherent view of a data warehouse from the perspective of all nodes. CMR mandates the use of a POSIX compliant networked or clustered file system such as NFS or Gluster. If being installed on a single sytem, only a POSIX compliant file system is required.

Installation

Package based (server components):

dpkg -i cmr-lib_0.0.1-1_all.deb cmr-server_0.0.1-1_all.deb

Package based (worker components):

dpkg -i cmr-lib_0.0.1-1_all.deb cmr-worker_0.0.1-1_all.deb cmr-utils_0.0.1-1_amd64.deb

Package based (client components):

dpkg -i cmr-lib_0.0.1-1_all.deb cmr-client_0.0.1-1_all.deb cmr-utils_0.0.1-1_amd64.deb

All components can be installed on the same system. The default configuration is near complete when all components are installed on the same system.

Manual (installs everything):

perl Makefile.PL
make
make install

Tested Installation

CMR has been developed on and has been tested with Debian Wheezy. All dependencies are available directly from Debian repositories. Gluster was chosen as the clustered file system and is the only one verified to work well with CMR, although, NFS should work too. Additionally, the network interconnecting all CMR nodes and all Gluster nodes during development of CMR was 40Gb/s Infiniband, known as QDR. As such, some utilities in use by CMR may be out of place on a different file system. Namely, the chunky c binary. It should not ca use any issues however.

In order to realize the benefits we have seen, a similar environment is recommended.

Setup & Configuration

See Configuration

Usage

See Examples

Components

cmr-server      Provisions cmr-worker instances with cmr client requests
cmr-worker      Handles cmr client requests
cmr-caster      Broadcasts events produced by cmr-components
cmr             Map-Reduce client
cmr-grep        Grep client

cmr-server usage

cmr-server [--config <config file>]
cmr-server default configuration file is /etc/cmr/config.ini.

cmr-worker usage

cmr-worker [--config <config file>]
cmr-worker default configuration file is /etc/cmr/config.ini.

cmr-caster usage

cmr-caster [--config <config file>]
cmr-caster default configuration file is /etc/cmr/config.ini.

cmr usage

cmr --input "<glob_pattern>" --mapper <mapper> [--reducer <reducer>] [--config <config file>]
cmr default configuration file is /etc/cmr/config.ini.
Glob patterns must be quoted, failure to do so will cause them to be expanded by the shell and be misinterpreted by the client
Reducer implementation needs to be idempotent, non-idempotent reducers may however be used as a final-reducer
additional optional arguments
    -v --verbose        verbose mode
    -f --final-reducer  reducer to use for final reduce
    -c --cache          cache results [don't cleanup job output when writing to stdout]
    -o --output         output to this location rather than the default output path
    -b --bundle         bundle file with job (places it in scratch space along with job data making it accessible to worker nodes)
    -F --force          force run (overwrite output path)
    --stdout            output on standard out
experimental arguments
    -j --join-reducer   reducer to use for join [requires bucket and aggregate parameters to be specified]
    -B --bucket         split job into buckets to parallelize final reduce [requires aggregates]
    -a --aggregates     number of aggregates in mapped data
    -F --force          force run (overwrite output path)
    -S --sort           sort

cmr-grep usage

cmr-grep --input "<glob_pattern>" --pattern "<grep_pattern>" [--config <config file>]
cmr-grep default configuration file is /etc/cmr/config.ini.
Glob patterns must be quoted, failure to do so will cause them to be expanded by the shell and be misinterpreted by the client
additional optional arguments
    -v --verbose        verbose output
    -o --output         output to this location rather than the default output path
    -f --flags          pass grep flags
    -c --cache          cache results [don't cleanup job output when writing to stdout]
    -F --force          force run (overwrite output path)
    --stdout            output on standard out