Hadoop provides a streaming API which supports any programming languages that can read from the standard input stdin and write to the standard output stdout. Hadoop streaming API uses the standard Linux streams as the interface between Hadoop and the program. Thus, input data is passed via the stdin to a map function, which processes it line by line and writes to the stdout. Input to the reduce function is stdin (which is guaranteed to be sorted by key by Hadoop) and the results are output to stdout.
Prerequisites: account at Metacentrum and access to Hadoop - https://www.metacentrum.cz/en/hadoop/
Add this to .ssh/config
# MetaCentrum
Host hador
HostName hador.ics.muni.cz
User <yourlogin>
Port 22
Try to log in to Hadoop Cluster
$ ssh hador
Gain Kerberos ticket on your local machine and run web interface
$ scp <yourlogin>@hador.ics.muni.cz:/etc/krb5.conf .
$ env KRB5_CONFIG=krb5.conf kinit <yourlogin>@META
$ chromium-browser --auth-server-whitelist="*.ics.muni.cz" &
Check https://hador-c1.ics.muni.cz:50470/
$ wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.6.0/hadoop-streaming-2.6.0.jar
$ chmod +x mapper.py reducer.py
$ hdfs dfs -mkdir input
$ wget https://goo.gl/KyDfc7 -O shake.txt
$ hdfs dfs -D dfs.block.size=1048576 -put shake.txt /user/<user>/input/shake.txt
$ hdfs dfs -ls
$ hadoop jar hadoop-streaming-2.6.0.jar \
-D mapreduce.job.name='Word count python streaming' \
-input input \
-output output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
Check the sorted output
$ hdfs dfs -get output .\
$ sort -k 2 -g -r part-00000 > sorted.txt
Hador MapReduce Job history - https://hador-c2.ics.muni.cz:19890/jobhistory
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
https://www.princeton.edu/researchcomputing/computational-hardware/hadoop/mapred-tut/
Hadoop Streaming documentation - http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html
MetaCentrum Hadoop documentation: https://wiki.metacentrum.cz/w/index.php?title=Hadoop_documentation&setlang=en
MetaCentrum Hadoop Web Accessibility - https://wiki.metacentrum.cz/wiki/Hadoop_documentation#Web_accessibility