Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.
C Python Objective-C C++
Latest commit ff39b4e Dec 4, 2012 @bwhite Changed default readtb num_procs to 1 as some systems have limited RA…
…M available

Signed-off-by: Brandyn A. White <bwhite@dappervision.com>
Permalink
Failed to load latest commit information.
doc Updated CDH4 support Sep 22, 2012
examples Use a different test image set Aug 3, 2012
hadoopy Changed default readtb num_procs to 1 as some systems have limited RA… Dec 4, 2012
tests Updated test for new abspath Sep 23, 2012
.gitignore Added new entries to .gitignore Jan 28, 2011
COPYING Added headers Jun 1, 2010
README Updated limitation Jan 3, 2012
setup.py

README

Brandyn White <bwhite@dappervision.com>
Andrew Miller <amiller@dappervision.com>

Source  https://github.com/bwhite/hadoopy/
Issues  https://github.com/bwhite/hadoopy/issues
Docs    http://bwhite.github.com/hadoopy/

IRC: #hadoopy @ freenode.net

Requirements
python development headers (python-dev), build tools (build-essential)

Optional
cython (>=.13) (without this it falls back to the pregenerated .c files)

Features
- oozie support
- Automated job parallelization 'auto-oozie' available in the hadoopy_flow project (maintained out of branch)
- typedbytes support (very fast)
- Local execution of unmodified MapReduce job with launch_local
- Read/write sequence files of TypedBytes directly to HDFS from python (readtb, writetb)
- Works on OS X
- Allows printing to stdout and stderr in Hadoop tasks without causing problems (uses the 'pipe hopping' technique, both are available in the task's stderr)
- critical path is in Cython
- works on clusters without any extra installation, Python, or any Python libraries (uses Pyinstaller that is included in this source tree)
- Simple HDFS access (readtb and ls) inside Python, even inside running jobs
- Unit test interface
- Reporting using status and counters (and print statements! no need to be scared of them in Hadoopy)
- Supports design patterns in the Lin/Dyer book (http://www.umiacs.umd.edu/~jimmylin/book.html)

Limitations
- Hadoop Local currently unsupported due to a bug in Hadoop's handling of the distributed cache in this mode.  Use psuedo-distributed instead for now.  (https://github.com/bwhite/hadoopy/issues/40)

Used in
- A Case for Query by Image and Text Content: Searching Computer Help using Screenshots and Keywords (to appear in WWW'11)
- Web-Scale Computer Vision using MapReduce for Multimedia Data Mining (at KDD'10)
- Vitrieve: Visual Search engine
- Picarus: Hadoop computer vision toolbox

Ubuntu Install (others are similar)
sudo apt-get install python-dev build-essential
sudo python setup.py install