# <center> Introduction to Hadoop MapReduce </center>

Python Jupyter notebook supports execution of Linux command inside the notebook cells. This is done by adding the **!** to the beginning of the command line. It should be noted that each command begins with a **!** will create a new bash shell and close this cell once the execution is done:
- Full path is required
- Temporary results and environmental variables will be lost

In [1]:
!module list

Currently Loaded Modulefiles:
  1) anaconda3/4.2.0   3) zeromq/4.1.5
  2) matlab/2015a      4) hdp/0.1


We need to initialize Kerberos authentication mechanism

In [2]:
!cypress-kinit

In [3]:
!klist

Ticket cache: FILE:/home/lngo/.krb5cc
Default principal: lngo@PALMETTO.CLEMSON.EDU

Valid starting       Expires              Service principal
10/04/2017 12:44:29  10/11/2017 12:44:29  krbtgt/PALMETTO.CLEMSON.EDU@PALMETTO.CLEMSON.EDU


Interaction with Hadoop Distributed File System is done through `hdfs` and its sub-commands

In [None]:
!hdfs

In [None]:
!hdfs dfs

### Challenge

Create a directory named **intro-to-hadoop** inside your user directory on HDFS

In [None]:
!hdfs dfs -ls /

In [None]:
!ls /

In [None]:
!hdfs dfs -ls /user/lngo

In [None]:
!hdfs dfs -mkdir intro-to-hadoop

### Challenge

Upload the **text** directory into the newly created **intro-to-hadoop** directory. 

In [None]:
!hdfs dfs -put

In [7]:
!hdfs dfs -put text intro-to-hadoop

### Challenge 

Check the health status of the directories above in HDFS using fsck:
```
hdfs fsck <path-to-directory> -files -blocks -locations
```

In [5]:
!hdfs fsck intro-to-hadoop/text/gutenberg-shakespeare.txt -files -blocks -locations

Connecting to namenode via http://dscim002.palmetto.clemson.edu:50070/fsck?ugi=lngo&files=1&blocks=1&locations=1&path=%2Fuser%2Flngo%2Fintro-to-hadoop%2Fgutenberg-shakespeare.txt
FSCK started by lngo (auth:KERBEROS_SSL) from /10.125.3.168 for path /user/lngo/intro-to-hadoop/gutenberg-shakespeare.txt at Wed Oct 04 12:53:58 EDT 2017
/user/lngo/intro-to-hadoop/gutenberg-shakespeare.txt 5447744 bytes, 1 block(s):  OK
0. BP-1143747467-10.125.40.142-1413584797204:blk_1108454815_34728420 len=5447744 repl=2 [DatanodeInfoWithStorage[10.125.8.217:1019,DS-5960dbe8-cb5f-40a6-834f-f2edbd236732,DISK], DatanodeInfoWithStorage[10.125.8.227:1019,DS-91f408e1-e851-4308-bb72-0be27c3b689c,DISK]]

Status: HEALTHY
 Total size:	5447744 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 5447744 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication

## MapReduce Programming Paradigm

**What is “map”?**
– A function/procedure that is applied to every individual
elements of a collection/list/array/…

```
int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]
```

**What is “reduce”?**
– A function/procedure that performs an operation on a list.
This operation will “fold/reduce” this list into a single value
(or a smaller subset)

```
reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24
```

MapReduce is an old concept in functional programming. It is naturally applicable in HDFS: 
- `map` tasks are performed on top of individual data blocks (mainly to filter and decrease raw data contents while increase data value
- `reduce` tasks are performed on intermediate results from `map` tasks (should now be significantly decreased in size) to calculate the final results. 

## 1. The Hello World of Hadoop: Word Count

In [1]:
!mkdir codes

In [8]:
!hdfs dfs -cat intro-to-hadoop/text/gutenberg-shakespeare.txt \
    2>/dev/null | head -n 100

1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  

In [9]:
%%writefile codes/wordcountMapper.py
#!/usr/bin/env python                                          
import sys                                                                                                
for oneLine in sys.stdin:
    oneLine = oneLine.strip()
    for word in oneLine.split(" "):
        if word != "":
            print ('%s\t%s' % (word, 1)) 

Writing codes/wordcountMapper.py


In [15]:
!hdfs dfs -cat intro-to-hadoop/text/gutenberg-shakespeare.txt \
    2>/dev/null \
    | head -n 20 \
    | python ./codes/wordcountMapper.py

1609	1
THE	1
SONNETS	1
by	1
William	1
Shakespeare	1
1	1
From	1
fairest	1
creatures	1
we	1
desire	1
increase,	1
That	1
thereby	1
beauty's	1
rose	1
might	1
never	1
die,	1
But	1
as	1
the	1
riper	1
should	1
by	1
time	1
decease,	1
His	1
tender	1
heir	1
might	1
bear	1
his	1
memory:	1
But	1
thou	1
contracted	1
to	1
thine	1
own	1
bright	1
eyes,	1
Feed'st	1
thy	1
light's	1
flame	1
with	1
self-substantial	1
fuel,	1
Making	1
a	1
famine	1
where	1
abundance	1
lies,	1
Thy	1
self	1
thy	1
foe,	1
to	1
thy	1
sweet	1
self	1
too	1
cruel:	1
Thou	1
that	1
art	1
now	1
the	1
world's	1
fresh	1
ornament,	1
And	1
only	1
herald	1
to	1
the	1
gaudy	1
spring,	1
Within	1
thine	1
own	1
bud	1
buriest	1
thy	1
content,	1


In [16]:
!hdfs dfs -cat intro-to-hadoop/text/gutenberg-shakespeare.txt \
    2>/dev/null \
    | head -n 20 \
    | python ./codes/wordcountMapper.py \
    | sort

1	1
1609	1
a	1
abundance	1
And	1
art	1
as	1
bear	1
beauty's	1
bright	1
bud	1
buriest	1
But	1
But	1
by	1
by	1
content,	1
contracted	1
creatures	1
cruel:	1
decease,	1
desire	1
die,	1
eyes,	1
fairest	1
famine	1
Feed'st	1
flame	1
foe,	1
fresh	1
From	1
fuel,	1
gaudy	1
heir	1
herald	1
his	1
His	1
increase,	1
lies,	1
light's	1
Making	1
memory:	1
might	1
might	1
never	1
now	1
only	1
ornament,	1
own	1
own	1
riper	1
rose	1
self	1
self	1
self-substantial	1
Shakespeare	1
should	1
SONNETS	1
spring,	1
sweet	1
tender	1
that	1
That	1
the	1
the	1
the	1
THE	1
thereby	1
thine	1
thine	1
thou	1
Thou	1
thy	1
thy	1
thy	1
thy	1
Thy	1
time	1
to	1
to	1
to	1
too	1
we	1
where	1
William	1
with	1
Within	1
world's	1


In [12]:
%%writefile codes/wordcountReducer.py
#!/usr/bin/env python
import sys

current_word = None
total_word_count = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split("\t", 1)
    try:
        count = int(count)
    except ValueError:
        continue
    
    if current_word == word:
        total_word_count += count
    else:
        if current_word:
            print ("%s\t%s" % (current_word, total_word_count))
        current_word = word
        total_word_count = 1
        
if current_word == word:
    print ("%s\t%s" % (current_word, total_word_count))

Writing codes/wordcountReducer.py


In [14]:
!hdfs dfs -cat intro-to-hadoop/text/gutenberg-shakespeare.txt \
    2>/dev/null \
    | head -n 20 \
    | python ./codes/wordcountMapper.py \
    | sort \
    | python ./codes/wordcountReducer.py

1	1
1609	1
a	1
abundance	1
And	1
art	1
as	1
bear	1
beauty's	1
bright	1
bud	1
buriest	1
But	2
by	2
content,	1
contracted	1
creatures	1
cruel:	1
decease,	1
desire	1
die,	1
eyes,	1
fairest	1
famine	1
Feed'st	1
flame	1
foe,	1
fresh	1
From	1
fuel,	1
gaudy	1
heir	1
herald	1
his	1
His	1
increase,	1
lies,	1
light's	1
Making	1
memory:	1
might	2
never	1
now	1
only	1
ornament,	1
own	2
riper	1
rose	1
self	2
self-substantial	1
Shakespeare	1
should	1
SONNETS	1
spring,	1
sweet	1
tender	1
that	1
That	1
the	3
THE	1
thereby	1
thine	2
thou	1
Thou	1
thy	4
Thy	1
time	1
to	3
too	1
we	1
where	1
William	1
with	1
Within	1
world's	1


In [17]:
!hdfs dfs -rm -R intro-to-hadoop/output-wordcount
!yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input intro-to-hadoop/text/gutenberg-shakespeare.txt \
    -output intro-to-hadoop/output-wordcount \
    -file ./codes/wordcountMapper.py \
    -mapper wordcountMapper.py \
    -file ./codes/wordcountReducer.py \
    -reducer wordcountReducer.py \

rm: `intro-to-hadoop/output-wordcount': No such file or directory
17/10/04 13:02:39 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./codes/wordcountMapper.py, ./codes/wordcountReducer.py] [/usr/hdp/2.6.0.3-8/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.0.3-8.jar] /hadoop_java_io_tmpdir/streamjob2536420095245052849.jar tmpDir=null
17/10/04 13:02:41 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/04 13:02:41 INFO client.AHSProxy: Connecting to Application History server at dscim003.palmetto.clemson.edu/10.125.8.215:10200
17/10/04 13:02:41 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 14303 for lngo on ha-hdfs:dsci
17/10/04 13:02:41 INFO security.TokenCache: Got dt for hdfs://dsci; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:dsci, Ident: (HDFS_DELEGATION_TOKEN token 14303 for lngo)
17/10/04 13:02:42 INFO lzo.GPLNativeCodeLoader: Loaded nativ

In [18]:
!hdfs dfs -ls intro-to-hadoop/output-wordcount

Found 2 items
-rw-r--r--   2 lngo hdfs          0 2017-10-04 13:03 intro-to-hadoop/output-wordcount/_SUCCESS
-rw-r--r--   2 lngo hdfs     713504 2017-10-04 13:03 intro-to-hadoop/output-wordcount/part-00000


In [19]:
!hdfs dfs -cat intro-to-hadoop/output-wordcount/part-00000 \
    2>/dev/null | head -n 100

"	241
"'Tis	1
"A	4
"Air,"	1
"Alas,	1
"Amen"	2
"Amen"?	1
"Amen,"	1
"And	1
"Aroint	1
"B	1
"Black	1
"Break	1
"Brutus"	1
"Brutus,	2
"C	1
"Caesar"?	1
"Caesar,	1
"Caesar."	2
"Certes,"	1
"Come	1
"Cursed	1
"D	1
"Darest	1
"Do	1
"E	1
"Fear	2
"Fly,	1
"Gentle	1
"Give	2
"Glamis	1
"God	2
"Good	1
"Havoc!"	1
"He	1
"Help	1
"Help,	2
"Here	1
"Hold,	2
"I	4
"Indeed!"	1
"King	1
"Liberty,	1
"Lo,	1
"Long	1
"Murther!"	2
"Neither	1
"Now	1
"O	2
"Peace,	1
"Shall	1
"Sing	2
"Sir,	1
"Sleep	2
"Speak,	1
"Sweet	1
"That	1
"The	1
"These	1
"They	2
"This	2
"Thus	2
"Tis	2
"Where	1
"Willow,	1
"You'll	1
"better"?	1
"hem,"	1
"never."	1
"not"	1
"then"	1
"thrusting"	1
"thy	1
"twas	1
"whore"	1
"whore."	1
"willow";	1
&	3
&C.	2
&c.	12
&c.'	2
&c.,	2
'"All	1
'"Among	1
'"And,	1
'"But,	1
'"Gamut"	1
'"How	1
'"Lo,	2
'"Look	1
'"My	1
'"Now	1
'"O	2
'"The	1
'"When	1
''Tis	3
'-on	1
'A	53
'A-down	1
'Above	1


### Challenge

Modify *wordcountMapper.py* so that punctuations and capitalization are no longer factors in determining unique words

In [None]:
%%writefile codes/wordcountEnhancedMapper.py
#!/usr/bin/env python                                          
import sys                     
import string

translator = str.maketrans('', '', string.punctuation)

for oneLine in sys.stdin:
    oneLine = oneLine.strip()
    for word in oneLine.split(" "):
        if word != "":
            newWord = word.translate(translator).lower()
            print ('%s\t%s' % (_______, 1)) 

In [None]:
!hdfs dfs -rm -R intro-to-hadoop/output-wordcount-enhanced
!ssh dsciutil yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input intro-to-hadoop/text/gutenberg-shakespeare.txt \
    -output intro-to-hadoop/output-wordcount \
    -file ____________________________________________________ \
    -mapper _____________________ \
    -file ____________________________________________________ \
    -reducer _____________________ \