Basic hadoop python streaming

Hadoop streaming vs. Native Java execution

Hadoop provides a streaming API which supports any programming languages that can read from the standard input stdin and write to the standard output stdout. Hadoop streaming API uses the standard Linux streams as the interface between Hadoop and the program. Thus, input data is passed via the stdin to a map function, which processes it line by line and writes to the stdout. Input to the reduce function is stdin (which is guaranteed to be sorted by key by Hadoop) and the results are output to stdout.

Steps

1. Set all you need for your hadoop cluster

Prerequisites: account at Metacentrum and access to Hadoop - https://www.metacentrum.cz/en/hadoop/

Add this to .ssh/config

# MetaCentrum

Host hador
HostName hador.ics.muni.cz
User <yourlogin>
Port 22

Try to log in to Hadoop Cluster

$ ssh hador

Gain Kerberos ticket on your local machine and run web interface

$ scp <yourlogin>@hador.ics.muni.cz:/etc/krb5.conf .
$ env KRB5_CONFIG=krb5.conf kinit <yourlogin>@META
$ chromium-browser --auth-server-whitelist="*.ics.muni.cz" &

Check https://hador-c1.ics.muni.cz:50470/

2. Download Hadoop Streaming jar

$ wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.6.0/hadoop-streaming-2.6.0.jar

3. Implement mapper and reducer. Make sure the files have execution permission.

$ chmod +x mapper.py reducer.py

4. Download file which you want to process and put it into Hadoop’s HDFS.

$ hdfs dfs -mkdir input
$ wget https://goo.gl/KyDfc7 -O shake.txt
$ hdfs dfs -D dfs.block.size=1048576 -put shake.txt /user/<user>/input/shake.txt
$ hdfs dfs -ls

5. Run the MapReduce job

$ hadoop jar hadoop-streaming-2.6.0.jar \
    -D mapreduce.job.name='Word count python streaming' \ 
    -input input \ 
    -output output \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py

Check the sorted output

$ hdfs dfs -get output .\
$ sort -k 2 -g -r part-00000 > sorted.txt

Hador MapReduce Job history - https://hador-c2.ics.muni.cz:19890/jobhistory

Links

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
https://www.princeton.edu/researchcomputing/computational-hardware/hadoop/mapred-tut/
Hadoop Streaming documentation - http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html
MetaCentrum Hadoop documentation: https://wiki.metacentrum.cz/w/index.php?title=Hadoop_documentation&setlang=en
MetaCentrum Hadoop Web Accessibility - https://wiki.metacentrum.cz/wiki/Hadoop_documentation#Web_accessibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic hadoop python streaming

Hadoop streaming vs. Native Java execution

Steps

1. Set all you need for your hadoop cluster

2. Download Hadoop Streaming jar

3. Implement mapper and reducer. Make sure the files have execution permission.

4. Download file which you want to process and put it into Hadoop’s HDFS.

5. Run the MapReduce job

Links

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
mapper.py		mapper.py
reducer.py		reducer.py

danci5/hadoop-streaming-python

Folders and files

Latest commit

History

Repository files navigation

Basic hadoop python streaming

Hadoop streaming vs. Native Java execution

Steps

1. Set all you need for your hadoop cluster

2. Download Hadoop Streaming jar

3. Implement mapper and reducer. Make sure the files have execution permission.

4. Download file which you want to process and put it into Hadoop’s HDFS.

5. Run the MapReduce job

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages