# Hadoop Streaming assignment 3: Name Count
Make WordCount program for all the names in the dataset. Name is a word with the following properties:

* The first character is not a digit (other characters can be digits).
* The first character is uppercase, all the other characters that are letters are lowercase.
* There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and the condition (2) is not met.

Order by quantity, most popular first, output format:

<code>name <tab> count</code>

The result is the 5th line in the output.

The result on the sample dataset:

<code>french 5742</code>

If you want to deploy the environment on your own machine, please use bigdatateam/yarn-notebook Docker container.

In [1]:
%%writefile test.dat

1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
2	for the horde
3	FoR$ ^THe HoRde!!!
42	Good OR baD 

Overwriting test.dat


In [2]:
cat test.dat | tail -5

1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
1	For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde! For The Horde!
2	for the horde
3	FoR$ ^THe HoRde!!!
42	Good OR baD 

In [3]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

total = 0

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        total += 1       
        cond1 = word[0].isalpha()
        cond2 = not word[0].islower() and word[1:].islower()        
        if (cond1):
            print "%s\t%d\t%d" % (word, 1, cond2)
        
print >> sys.stderr, "reporter:counter:Wiki stats,Total words,%d" % total

Overwriting mapper1.py


In [4]:
cat test.dat | python2 ./mapper1.py | sort | head

reporter:counter:Wiki stats,Total words,135
baD	1	0
for	1	0
FoR	1	0
For	1	1
For	1	1
For	1	1
For	1	1
For	1	1
For	1	1
For	1	1


In [5]:
%%writefile reducer1.py

import sys

current_key=None
current_group = []

for line in sys.stdin:
    try:
        key, count, cond2 = line.strip().split('\t', 2)
        count = int(count)
        cond2 = int(cond2)        
    except ValueError as e:
        continue
    
    if current_key != key: # the next case sensitive word 
        if current_key:
            current_group.append([current_key, key_sum, current_cond2])
            
            if current_key.lower() != key.lower(): # next not case sensitive word
                def print_group(arr):
                    word_sum = 0
                    for i in arr:
                        word_sum += i[1]
                    for i in arr:
                        print "%s\t%d\t%d\t%d" % (i[0], i[1], i[2], word_sum)                
                
                print_group(current_group)
                current_group = []
        
        current_key = key
        key_sum = 0
        current_cond2 = cond2    
    key_sum += count

if current_key:
    print_group(current_group)

Overwriting reducer1.py


In [6]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./reducer1.py

reporter:counter:Wiki stats,Total words,135
baD	1	0	1
for	1	0	44
FoR	1	0	44
For	42	1	44
Good	1	1	1
horde	1	0	44
HoRde	1	0	44
Horde	42	1	44
OR	1	0	1
the	1	0	2
THe	1	0	2


In [7]:
%%writefile total_counter.py

import sys

if __name__ == '__main__':
    for line in sys.stdin:
        try:
            key, value = line.strip().split('=', 1)            
            if (key == "Total words"):
                print "%d" % int(value)
                break;
        except ValueError as e:
            continue

Overwriting total_counter.py


In [8]:
%%writefile mapper2.py

import sys

reload(sys)
sys.setdefaultencoding('utf-8')

total = float(sys.argv[1])

for line in sys.stdin:
    try:
        key, key_sum, cond2, group_sum = unicode(line.strip()).split('\t', 3)
        key_sum = int(key_sum)
        cond2 = int(cond2)
        group_sum = float(group_sum)
    except ValueError as e:
        continue
    
    cond3 = (group_sum/total)*100 < 0.5  
    if (cond2 or cond3):    
        print "%d\t%s" % (key_sum, key)        

Overwriting mapper2.py


In [9]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./reducer1.py | python2 ./mapper2.py "3371"

reporter:counter:Wiki stats,Total words,135
1	baD
42	For
1	Good
42	Horde
1	OR
1	the
1	THe


In [10]:
%%writefile reducer2.py

import sys

for line in sys.stdin:
    try:
        cnt, key = line.strip().split('\t', 1)
        cnt = int(cnt)
    except ValueError as e:
        continue
    print "%s\t%d" % (key, cnt)

Overwriting reducer2.py


In [11]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./reducer1.py | python2 ./mapper2.py "3371" | sort -nr | python2 ./reducer2.py

reporter:counter:Wiki stats,Total words,135
Horde	42
For	42
THe	1
the	1
OR	1
Good	1
baD	1


In [12]:
%%bash

OUT_DIR="word_groups_"
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR}* > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordGroups" \
    -D mapreduce.job.reduces=8 \
    -files mapper1.py,reducer1.py \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> $LOGS
    
cat $LOGS >&2

rm: `word_groups_*': No such file or directory
19/04/17 13:58:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 13:58:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 13:58:47 INFO mapred.FileInputFormat: Total input files to process : 1
19/04/17 13:58:47 INFO mapreduce.JobSubmitter: number of splits:2
19/04/17 13:58:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555502073095_0020
19/04/17 13:58:47 INFO impl.YarnClientImpl: Submitted application application_1555502073095_0020
19/04/17 13:58:47 INFO mapreduce.Job: The url to track the job: http://05a41fc46e82:8088/proxy/application_1555502073095_0020/
19/04/17 13:58:47 INFO mapreduce.Job: Running job: job_1555502073095_0020
19/04/17 13:58:53 INFO mapreduce.Job: Job job_1555502073095_0020 running in uber mode : false
19/04/17 13:58:53 INFO mapreduce.Job:  map 0% reduce 0%
19/04/17 13:59:09 INFO mapreduce.Job:  map 41% reduce 0%
19/04/17 13:59:15 INFO mapreduce.

In [13]:
!hdfs dfs -cat "word_groups_/part-00000" | head

A	18574	0	18574
A".Two	1	0	1
A$1.18	1	0	1
A$38,850	1	0	1
A$4	1	0	1
A$480	1	0	1
A(0	29	0	29
A(2,A(3,2	1	0	1
A(3,1	1	0	1
A(3,A(4,0	1	0	1
cat: Unable to write to output stream.


In [14]:
%%bash
TOTAL=$(cat stderr_logs.txt | python2 ./total_counter.py)
echo $TOTAL

11937317


In [15]:
%%bash

TOTAL=$(cat stderr_logs.txt | python2 ./total_counter.py)
DIR1="word_groups_"
DIR2="name_counts_"

hdfs dfs -rm -r -skipTrash "name_counts_"* > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming nameCount" \
    -D mapreduce.job.reduces=1 \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-nr" \
    -files mapper2.py,reducer2.py \
    -mapper "python mapper2.py $TOTAL" \
    -reducer "python reducer2.py" \
    -input ${DIR1} \
    -output ${DIR2} > /dev/null

rm: `name_counts_*': No such file or directory
19/04/17 13:59:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 13:59:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/17 13:59:47 INFO mapred.FileInputFormat: Total input files to process : 8
19/04/17 13:59:47 INFO mapreduce.JobSubmitter: number of splits:8
19/04/17 13:59:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555502073095_0021
19/04/17 13:59:47 INFO impl.YarnClientImpl: Submitted application application_1555502073095_0021
19/04/17 13:59:47 INFO mapreduce.Job: The url to track the job: http://05a41fc46e82:8088/proxy/application_1555502073095_0021/
19/04/17 13:59:47 INFO mapreduce.Job: Running job: job_1555502073095_0021
19/04/17 13:59:52 INFO mapreduce.Job: Job job_1555502073095_0021 running in uber mode : false
19/04/17 13:59:52 INFO mapreduce.Job:  map 0% reduce 0%
19/04/17 13:59:59 INFO mapreduce.Job:  map 75% reduce 0%
19/04/17 14:00:02 INFO mapreduce.

In [16]:
!hdfs dfs -cat "name_counts_/part-00000" | head -100

The	104250
are	57225
from	56252
or	49258
be	43262
an	43198
which	42144
his	41530
at	41140
it	35840
were	34321
In	34308
also	30380
not	30027
have	29707
has	28756
he	25551
had	23892
this	23673
their	23521
but	23343
its	22348
one	21164
been	20443
other	20355
first	20168
such	18874
A	18574
used	18130
can	17953
more	17632
American	17215
who	16371
they	15741
two	15689
into	15569
all	15465
most	15348
than	14596
This	14576
some	14169
only	13828
It	13664
would	12746
time	12596
between	12488
after	12378
many	12325
when	12102
may	11931
over	10756
He	10643
use	10498
about	10393
known	10231
years	9996
these	9938
there	9932
during	9741
United	9674
new	9621
New	9526
where	9390
called	9251
number	9241
no	9069
made	8834
being	8593
both	8483
through	8429
including	8353
then	8193
I	7958
b	7914
up	7884
any	7870
often	7842
later	7841
them	7835
English	7794
system	7767
out	7736
while	7709
century	7674
under	7530
three	7525