---
layout: page
title: Introduction to Hadoop
subtitle: Integrating Python Mapper and Reducer in Hadoop
minutes: 15
---
> ## Learning Objectives {.objectives}
>
> *   Run the combination of Python-based mapper and reducer on the Hadoop
>     infrastructure
> *   Customize reducer for questions that require global access to KEYS

With the mapper and reducer created and tested, the final step is to run this
combination on the Hadoop infrastructure.

In [3]:
!ssh dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /user/lngo/intro-to-hadoop/ml-10M100K/ratings.dat  \
    -output ratings \
    -file /home/lngo/intro-to-hadoop/mapper02.py \
    -mapper mapper02.py \
    -file /home/lngo/intro-to-hadoop/reducer01.py \
    -reducer reducer01.py \
    -file /home/lngo/intro-to-hadoop/movies.dat

16/07/25 15:31:35 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/lngo/intro-to-hadoop/mapper02.py, /home/lngo/intro-to-hadoop/reducer01.py, /home/lngo/intro-to-hadoop/movies.dat] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob8030563500250963142.jar tmpDir=null
16/07/25 15:31:37 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:31:37 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:31:37 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 860 for lngo on ha-hdfs:dsci
16/07/25 15:31:37 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 860 for lngo)
16/07/25 15:31:38 INFO mapred.FileInputF

The content of the ratings directory includes an empty file serves as a flag to
indicate whether the operation was successful or not, and the output files. The
number of output files depends on how many reducers we use.

In [4]:
!ssh dsciu001 hdfs dfs -ls ratings 2>/dev/null

Found 2 items
-rw-r--r--   2 lngo hdfs          0 2016-07-25 15:34 ratings/_SUCCESS
-rw-r--r--   2 lngo hdfs     422298 2016-07-25 15:34 ratings/part-00000


We can **cat** for the content of the output file

In [5]:
!ssh dsciu001 hdfs dfs -cat ratings/part-00000 2>/dev/null | head

"Great Performances" Cats (1998)	3.58333333333
'Round Midnight (1986)	3.72
'Til There Was You (1997)	2.83774834437
'burbs, The (1989)	2.96941489362
'night Mother (1986)	3.45023696682
*batteries not included (1987)	3.15314401623
...All the Marbles (a.k.a. The California Dolls) (1981)	2.21739130435
...And God Created Woman (Et Dieu... créa la femme) (1956)	3.08552631579
...And God Spoke (1993)	3.28260869565
...And Justice for All (1979)	3.65270935961


It is also possible to increase number of reducers

In [6]:
!ssh dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -D mapreduce.job.reduces=4 \
    -input /user/lngo/intro-to-hadoop/ml-10M100K/ratings.dat  \
    -output ratings4R \
    -file /home/lngo/intro-to-hadoop/mapper02.py \
    -mapper mapper02.py \
    -file /home/lngo/intro-to-hadoop/reducer01.py \
    -reducer reducer01.py \
    -file /home/lngo/intro-to-hadoop/movies.dat

16/07/25 15:36:57 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/lngo/intro-to-hadoop/mapper02.py, /home/lngo/intro-to-hadoop/reducer01.py, /home/lngo/intro-to-hadoop/movies.dat] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob4268970964328268399.jar tmpDir=null
16/07/25 15:36:58 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:36:59 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:36:59 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 861 for lngo on ha-hdfs:dsci
16/07/25 15:36:59 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 861 for lngo)
16/07/25 15:37:00 INFO mapred.FileInputF

In [7]:
!ssh dsciu001 hdfs dfs -ls ratings4R 2>/dev/null

Found 5 items
-rw-r--r--   2 lngo hdfs          0 2016-07-25 15:38 ratings4R/_SUCCESS
-rw-r--r--   2 lngo hdfs     106498 2016-07-25 15:38 ratings4R/part-00000
-rw-r--r--   2 lngo hdfs     103491 2016-07-25 15:38 ratings4R/part-00001
-rw-r--r--   2 lngo hdfs     104521 2016-07-25 15:38 ratings4R/part-00002
-rw-r--r--   2 lngo hdfs     107788 2016-07-25 15:38 ratings4R/part-00003


Aside from performance implication, an important difference between using one
and many reducers is demonstrated in cases where we want to perform operations
that require a global examination of the data. Let's say the movie company
wishes to identify the movie with highest rating average.

Create a file called **reduce03.py** with the following content

~~~ {.output}
#!/usr/bin/env python
import sys

current_movie = None
current_rating_sum = 0
current_rating_count = 0

max_movie = ""
max_average = 0

for line in sys.stdin:
  line = line.strip()
  movie, rating = line.split("\t", 1)
  try:
    rating = float(rating)
  except ValueError:
    continue

  if current_movie == movie:
    current_rating_sum += rating
    current_rating_count += 1
  else:
    if current_movie:
      rating_average = current_rating_sum / current_rating_count
      if rating_average > max_average:
        max_movie = current_movie
        max_average = rating_average
    current_movie = movie
    current_rating_sum = rating
    current_rating_count = 1

if current_movie == movie:
  rating_average = current_rating_sum / current_rating_count
  if rating_average > max_average:
    max_movie = current_movie
    max_average = rating_average

print ("%s\t%s" % (max_movie, max_average))
~~~

Rerun the Hadoop program using one and four reducers, respectively:

In [8]:
!ssh dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -D mapreduce.job.reduces=4 \
    -input /user/lngo/intro-to-hadoop/ml-10M100K/ratings.dat  \
    -output ratingsMax \
    -file /home/lngo/intro-to-hadoop/mapper02.py \
    -mapper mapper02.py \
    -file /home/lngo/intro-to-hadoop/reducer03.py \
    -reducer reducer03.py \
    -file /home/lngo/intro-to-hadoop/movies.dat

16/07/25 15:46:22 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/lngo/intro-to-hadoop/mapper02.py, /home/lngo/intro-to-hadoop/reducer03.py, /home/lngo/intro-to-hadoop/movies.dat] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob4500088075814474419.jar tmpDir=null
16/07/25 15:46:23 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:46:24 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:46:24 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 862 for lngo on ha-hdfs:dsci
16/07/25 15:46:24 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 862 for lngo)
16/07/25 15:46:25 INFO mapred.FileInputF

In [12]:
!ssh dsciu001 yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -D mapreduce.job.reduces=4 \
    -input /user/lngo/intro-to-hadoop/ml-10M100K/ratings.dat  \
    -output ratingsMax4R \
    -file /home/lngo/intro-to-hadoop/mapper02.py \
    -mapper mapper02.py \
    -file /home/lngo/intro-to-hadoop/reducer03.py \
    -reducer reducer03.py \
    -file /home/lngo/intro-to-hadoop/movies.dat

16/07/25 15:53:18 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/lngo/intro-to-hadoop/mapper02.py, /home/lngo/intro-to-hadoop/reducer03.py, /home/lngo/intro-to-hadoop/movies.dat] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob3588979903194305897.jar tmpDir=null
16/07/25 15:53:19 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:53:20 INFO impl.TimelineClientImpl: Timeline service address: http://dscim003.palmetto.clemson.edu:8188/ws/v1/timeline/
16/07/25 15:53:20 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 864 for lngo on ha-hdfs:dsci
16/07/25 15:53:20 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 864 for lngo)
16/07/25 15:53:20 INFO mapred.FileInputF

In the case of one reducer, there is only a single answer for the movie with
highest rating average. With four reducers, we have four possible answers. On
the other hand, it is quite feasible to infer the final single answer from a
set of four possible choices.

In [10]:
!ssh dsciu001 hdfs dfs -cat ratingsMax/part-00000 2>/dev/null

Blue Light, The (Das Blaue Licht) (1932)	5.0


In [13]:
!ssh dsciu001 hdfs dfs -cat ratingsMax4R/part-00000 2>/dev/null

Blue Light, The (Das Blaue Licht) (1932)	5.0


In [14]:
!ssh dsciu001 hdfs dfs -cat ratingsMax4R/part-00001 2>/dev/null

Satan's Tango (Sátántangó) (1994)	5.0


In [15]:
!ssh dsciu001 hdfs dfs -cat ratingsMax4R/part-00002 2>/dev/null

End of Summer, The (Kohayagawa-ke no aki) (1961)	4.5


In [16]:
!ssh dsciu001 hdfs dfs -cat ratingsMax4R/part-00003 2>/dev/null

Fighting Elegy (Kenka erejii) (1966)	5.0


## Check your understanding: Additional conditions on the reduce side {.challenge}
The previous results do not make sense intuitively, as these movies are not
well known. It is possible that our results are skewed by movies having too
few reviews. Modify the reducer so that we only consider movies that have more
than one thousand ratings totally. Name this reducer reducer03.py. Run the
Hadoop MapReduce program again with mapper03.py and reducer03.py using one and
four reducers respectively. Report the outcome.



## Check your understanding: User Study {.challenge}
User feedback plays an important role in marketing strategies. Implement a
Hadoop MapReduce program that identifies the user that rates the most movies
over time. Identify the genre that this user rates most favorably.