Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Source code to accompany the book "Hadoop in Practice", published by Manning.

branch: master

- move to the latest Cloudera Maven repo URL

- exclude Crunch code from build as Cloudera removed the artifacts from their Maven repo
- move to Apache artifacts for Mahout

fixes #8
latest commit c6ac47eb62
Alex Holmes authored
Octocat-spinner-32 bin support CDH4 October 06, 2013
Octocat-spinner-32 src fixes to enable code examples to run against Hadoop 2 October 06, 2013
Octocat-spinner-32 test-data Merge remote-tracking branch 'origin/master' July 14, 2013
Octocat-spinner-32 .gitignore latest revisions November 06, 2011
Octocat-spinner-32 LICENSE skeleton files, and chapter 1 October 30, 2011
Octocat-spinner-32 README.md updated to reflect state of CDH4 and Hadoop 2 compatibility. October 06, 2013
Octocat-spinner-32 pom.xml - move to the latest Cloudera Maven repo URL March 10, 2014
README.md

Source code for book "Hadoop in Practice", Manning Publishing

Overview

This repo contains the code, scripts and data files that are referenced from the book Hadoop in Practice, published by Manning.

Issues

If you hit any compilation or execution problems please create an issue and I'll look into it as soon as I can.

Hadoop Version

All the code has been exercised against CDH3u2, which for the purposes of the code is the same has Hadoop 0.20.x. There are a couple of places where I utilize some features in Pig 0.9.1, which won't work with CDH3u1 which uses 0.8.1.

I've recently run some basic MapReduce jobs against CDH4, and I also updated the examples so that they would run against Hadoop 2. Please let me know on the Manning forum or in a GitHub ticket if you encounter any issues.

Building and running

Download from github

git clone git://github.com/alexholmes/hadoop-book.git

Build

cd hadoop-book
mvn package

Runtime Dependencies

Many of the examples use Snappy and LZOP compression. Therefore you may get runtime errors if you don't have them installed and configured in your cluster.

Snappy can be installed on CDH by following the instructions at https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation.

To install LZOP follow the instructions at https://github.com/kevinweil/hadoop-lzo.

Run an example

# copy the input files into HDFS
hadoop fs -mkdir /tmp
hadoop fs -put test-data/ch1/* /tmp/

# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3
export HADOOP_HOME=/usr/local/hadoop

# run the map-reduce job
bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce /tmp/file1.txt /tmp/file2.txt output
Something went wrong with that request. Please try again.