---
layout: page
title: Introduction to Hadoop
subtitle: Streaming MapReduce
minutes: 10
---
> ## Learning Objectives {.objectives}
>
> *   Understand how to run Hadoop MapReduce programs
> *   Create Hadoop MapReduce commands to run external programs as
>     mappers and reducers

Any Application-related interactions between users, the Hadoop MapReduce
framework, and HDFS are done though YARN, Hadoop' default resource manager. The
most common interactions include submitting applications to YARN and inquiring
about the status of the applications. The generic syntax is:

    yarn COMMAND --loglevel loglevel [generic_options] [command_options]

Starting from the least to the most verbose, we have FATAL, ERROR, WARN, INFO,
DEBUG, and TRACE. Default level is INFO.

generic_options | Description |
----------------|-------------|
`-archives <comma separated list of archives>` | Specify archives to be unarchived on the compute machines. Applies only to job |
`-conf <configuration file>` | Specify an application configuration file |
`-D <property>=<value>` | Use value for a given property |
`-files <comma separated list of files>` | Specify files to be copied to the cluster. Applies only to job |
`-jt <local> or <resourcemanager:port>` | Specify a ResourceManager. Applies only to job |
`-libjars <comma separated list of jars>` | Specify the jar files to include in the classpath. Applies only to job.

#### COMMAND: jar
Run a jar file as an application on Cypress. Usage:

    yarn jar <jar file> [mainClass] args ...

Typically, YARN applications are written in Java and bundled into jar files to be executed. However, Hadoop also supports the execution of non-Java applications via the Hadoop Streaming utility. This utility allows you to use any executable or scripts as the mapper and/or the reducer for a YARN application. Usage:

    yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar [generic_options] [streaming_options]

streaming_options | Optional/Required | Description |
------------------|-------------------|-------------|
`-input directoryname or filename` | Required | Input location for mapper |
`-output directoryname` | Required | Output location for reducer |

`-mapper executable or JavaClassName` | Required | Mapper executable |
`-reducer executable or JavaClassName` | Required | Reducer executable |
`-file filename` | Optional |  Make the mapper, reducer, or combiner executable available locally on the compute nodes |
`-inputformat JavaClassName` | Optional | Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
`-outputformat JavaClassName` | Optional | Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
`-partitioner JavaClassName` | Optional | Class that determines which reduce a key is sent to |
`-combiner executable or JavaClassName` | Optional | Combiner executable for map output |
`-cmdenv name=value` | Optional | Pass environment variable to streaming commands |
`-inputreader JavaClassName` | Optional | For backwards-compatibility: specifies a record reader class (instead of an input format class) |
`-verbose` | Optional | Verbose output |
`-lazyOutput` | Optional | Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to Context.write |
`-numReduceTasks` | Optional | Specify the number of reducers |
`-mapdebug` | Optional | Script to call when map task fails |
`-reducedebug` | Optional | Script to call when reduce task fails |

To demonstrate Hadoop Streaming functionality, let's look at the problem of
counting how many instances of "thou" there are in Gutenberg's complete work of
Shakespeare.

In [5]:
!ssh dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input intro-to-hadoop/gutenberg-shakespeare.txt  \
    -output intro-to-hadoop/wc -mapper cat -reducer "wc"

packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob3414726412006351394.jar tmpDir=null
16/07/25 13:01:43 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 13:01:44 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 13:01:44 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 857 for lngo on ha-hdfs:dsci
16/07/25 13:01:44 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 857 for lngo)
16/07/25 13:01:44 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/25 13:01:45 INFO mapreduce.JobSubmitter: number of splits:2
16/07/25 13:01:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1469462752582_0008
16/07/25 13:01:45 INFO mapred

In [6]:
!ssh dsciu001 hdfs dfs -ls intro-to-hadoop/wc

Found 2 items
-rw-r--r--   2 lngo hdfs          0 2016-07-25 13:02 intro-to-hadoop/wc/_SUCCESS
-rw-r--r--   2 lngo hdfs         25 2016-07-25 13:02 intro-to-hadoop/wc/part-00000


In [7]:
!ssh dsciu001 hdfs dfs -cat intro-to-hadoop/wc/part-00000

 124213  899681 5571957	


Several items need to be paid attention to in the above example. First, the
location of the input and output directories are relative. That is, without an
initial backslash (**/**), YARN assumes that the path starts from inside user's
home directory on HDFS, which is **/user/user-name**. Second, the output
directory must not exist prior to the *yarn jar* call, otherwise, the command
will fail with an *output directory already exists* error.  