Week 7: MapReduce 🅾️

Lecture Material

A new version of the following books have been released:

e516 Lecture Notes Engineering Cloud Computing, Gregor von Laszewski [@las19e516]:
Cloud Computing, Gregor von Laszewski [@las19cloudcomputing]:

Please read the Hadoop sections in Cloud Computing[@las19cloudcomputing]

Watch these online videos for Hadoop MapReduce:

Map Reduce, Hadoop, and Spark (19:02) Hadoop A
Hadoop 13:19 Hadoop B
Hadoop 12:57 Hadoop C
Hadoop 15:14 Hadoop D

Note: All Map Reduce/Hadoop Lab activities are optional

Lab Activity: Hadoop Installation

The following exercises will guide you to install Hadoop on a single node and then run a MapReduce job in Python. Please figure out the required command lines. These commands are available in the book sections: Cloud Computing, Gregor von Laszewski, Ed. 2019 [@las19cloudcomputing]

Exercises:

Install prerequisite software
- Install Java
- Install SSH
- Install Maven
Configure programming environments
- JAVA_HOME
- core-site.xml
- hdfs-site.xml
- SSh localhost without a passphrase
Start Hadoop HDFS
- Format HDFS filesystem
- Start NameNode and DataNode via command lines
- Check NameNode status via Web Interface: NameNode - http://localhost:9870/
Start Hadoop YARN
- Configure
  - mapred-site.xml
  - yarn-site.xml
- Start YARN via command line
- Check YARN status via Web Interface: ResourceManager - http://localhost:8088/

After finishing all these steps, you are good to move forward to MapReduce programming.

Lab Activity: Python MapReduce

See in the map Section of the Python book:

The basic syntax of the map function expects a function object and any number of iterables like a list or dictionary. It executes the function object for each element in the sequence and returns a list of the elements modified by the function object.

def multiply(x):
   return x * 2

map(multiply2, [2, 4, 6, 8]

# Output [4, 8, 12, 16]

Lab Activity Map/Reduce on the Cloud

Python Word Count in MapReduce

#!/usr/bin/env python

import sys
for line in sys.stdin:
    line = line.strip()
words = line.split()
for word in words:
    print('%s\t%s' % (word, 1))

This code is used as Mapper.

#!/usr/bin/env python

from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
if current_word == word:
    current_count += count
else:
    if current_word:
        print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word
    if current_word == word:
        print('%s\t%s' % (current_word, current_count))

This code is used as Reducer.

Please note these code snippets are used to demonstrate the idea of Mapper and Reducer in Python. It leaves some bugs by intentions. Please debug the above code or write your own version of MapReduce.

Run Hadoop MapReduce in Python

bin/hadoop jar <path_to_hadoop_libs>/hadoop-*streaming*.jar \
-file /<path_to_mapper>/mapper.py \
-mapper /<path_to_mapper>/mapper.py \
-file /<path_to_reducer>/reducer.py  \
-reducer /<path_to_reducer>/reducer.py  \
-input <input_file_path> \
-output <output_file_path>

Lab Activity: MapReduce on the cloud

Hadoop Cluster Setup

Hadoop’s Java configuration is driven by two types of important configuration files:

Read-only default configuration - core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
Site-specific configuration - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.

Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.

HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN daemons are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.

Configuration

Configure all Hadoop daemons:

NameNode: HDFS_NAMENODE_OPTS
DataNode: HDFS_DATANODE_OPTS
Secondary NameNode: HDFS_SECONDARYNAMENODE_OPTS
ResourceManager: YARN_RESOURCEMANAGER_OPTS
NodeManager: YARN_NODEMANAGER_OPTS
WebAppProxy: YARN_PROXYSERVER_OPTS
Map Reduce Job History Server: MAPRED_HISTORYSERVER_OPTS

Configure Namenode to use parallelGC and a 4GB Java Heap, the following statement should be added in hadoop-env.sh:

export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"

Operating the Hadoop Cluster

To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. The first time you bring up HDFS, it must be formatted. Format a new distributed filesystem as hdfs.

Format NameNode (on a dedicate node in the cluster)
Start NameNode (on a dedicate node in the cluster)
Start DataNode (on each node of the cluster)
Start ResourceManager (on a dedicate node in the cluster)
Start NodeManager (on each node of the cluster)

Run Hadoop MapReduce in Python

$ bin/hadoop jar <path_to_hadoop_libs>/hadoop-*streaming*.jar \
    -file /<path_to_mapper>/mapper.py \
    -mapper /<path_to_mapper>/mapper.py \
    -file /<path_to_reducer>/reducer.py  \
    -reducer /<path_to_reducer>/reducer.py  \
    -input <input_file_path> \
    -output <output_file_path>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week7-mapreduce.md

week7-mapreduce.md

Week 7: MapReduce 🅾️

Lecture Material

Lab Activity: Hadoop Installation

Exercises:

Lab Activity: Python MapReduce

Lab Activity Map/Reduce on the Cloud

Python Word Count in MapReduce

Run Hadoop MapReduce in Python

Lab Activity: MapReduce on the cloud

Hadoop Cluster Setup

Configuration

Operating the Hadoop Cluster

Run Hadoop MapReduce in Python

Files

week7-mapreduce.md

Latest commit

History

week7-mapreduce.md

File metadata and controls

Week 7: MapReduce 🅾️

Lecture Material

Lab Activity: Hadoop Installation

Exercises:

Lab Activity: Python MapReduce

Lab Activity Map/Reduce on the Cloud

Python Word Count in MapReduce

Run Hadoop MapReduce in Python

Lab Activity: MapReduce on the cloud

Hadoop Cluster Setup

Configuration

Operating the Hadoop Cluster

Run Hadoop MapReduce in Python