# Hadoop Streaming assignment 4: Word Groups

Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

<code>sum <tab> group size <tab> word1,word2,...</code>

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

The result on the sample dataset:

<code>7823    eghilns 5   english,helsing,hesling,shengli,shingle</code>

NB: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.

In [1]:
%%writefile test.dat

1	Emit occurred 3 times emit emiT, item 2 times item, time 5 times time time Time Time; 
2	Calculate statistics for groups of words which are equal up to permutations of letters.
3	english, helsing, hesling, shengli, shingle

Overwriting test.dat


In [2]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    
    for word in words:
        word = word.lower()        
        if len(word) < 2:
            continue        
        permutation = ''.join(sorted(word))
        print "%s\t%s\t%d" % (permutation, word, 1)

Overwriting mapper1.py


In [3]:
cat test.dat | python2 ./mapper1.py | sort

aaccelltu	calculate	1
aciisssttt	statistics	1
aeimnoprsttu	permutations	1
aelqu	equal	1
aer	are	1
ccdeorru	occurred	1
chhiw	which	1
dorsw	words	1
eelrstt	letters	1
eghilns	english	1
eghilns	helsing	1
eghilns	hesling	1
eghilns	shengli	1
eghilns	shingle	1
eimst	times	1
eimst	times	1
eimst	times	1
eimt	emit	1
eimt	emit	1
eimt	emit	1
eimt	item	1
eimt	item	1
eimt	time	1
eimt	time	1
eimt	time	1
eimt	time	1
eimt	time	1
fo	of	1
fo	of	1
for	for	1
goprsu	groups	1
ot	to	1
pu	up	1


In [4]:
%%writefile combiner1.py

import sys

current_key=None
current_permutation=None
key_sum = 0  

for line in sys.stdin:
    try:
        permutation, word, count = line.strip().split('\t', 2)
        count = int(count)
    except ValueError as e:
        continue
    
    if current_key != word:        
        if current_key:
            print "%s\t%s\t%d" % (current_permutation, current_key, key_sum)
        current_key = word
        current_permutation = permutation
        key_sum = 0       
    key_sum += count

if current_key:
    print "%s\t%s\t%d" % (current_permutation, current_key, key_sum)

Overwriting combiner1.py


In [5]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./combiner1.py

aaccelltu	calculate	1
aciisssttt	statistics	1
aeimnoprsttu	permutations	1
aelqu	equal	1
aer	are	1
ccdeorru	occurred	1
chhiw	which	1
dorsw	words	1
eelrstt	letters	1
eghilns	english	1
eghilns	helsing	1
eghilns	hesling	1
eghilns	shengli	1
eghilns	shingle	1
eimst	times	3
eimt	emit	3
eimt	item	2
eimt	time	5
fo	of	2
for	for	1
goprsu	groups	1
ot	to	1
pu	up	1


In [6]:
%%writefile reducer1.py

import sys

current_key=None
key_sum = 0
words = set()

for line in sys.stdin:
    try:
        permutation, word, count = line.strip().split('\t', 2)
        count = int(count)
    except ValueError as e:
        continue
    
    if current_key != permutation:        
        if current_key and len(words) > 1:
            print "%d\t%d\t%s" % (key_sum, len(words), ','.join(sorted(words)))
        current_key = permutation
        key_sum = 0
        words = set()
    key_sum += count
    words.add(word)
    
if current_key and len(words) > 1:    
    print "%d\t%d\t%s" % (key_sum, len(words), ','.join(sorted(words)))

Overwriting reducer1.py


In [7]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./combiner1.py | python2 ./reducer1.py

5	5	english,helsing,hesling,shengli,shingle
10	3	emit,item,time


In [8]:
%%bash

OUT_DIR="out_"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming WordGroup" \
    -D mapreduce.job.reduces=8 \
    -files mapper1.py,combiner1.py,reducer1.py \
    -mapper "python mapper1.py" \
    -combiner "python combiner1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/* | grep -P '(,|\t)english($|,)'

7825	5	english,helsing,hesling,shengli,shingle


19/04/18 12:03:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/18 12:03:19 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/18 12:03:20 INFO mapred.FileInputFormat: Total input files to process : 1
19/04/18 12:03:20 INFO mapreduce.JobSubmitter: number of splits:2
19/04/18 12:03:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555571073071_0004
19/04/18 12:03:20 INFO impl.YarnClientImpl: Submitted application application_1555571073071_0004
19/04/18 12:03:20 INFO mapreduce.Job: The url to track the job: http://28294daafed9:8088/proxy/application_1555571073071_0004/
19/04/18 12:03:20 INFO mapreduce.Job: Running job: job_1555571073071_0004
19/04/18 12:03:26 INFO mapreduce.Job: Job job_1555571073071_0004 running in uber mode : false
19/04/18 12:03:26 INFO mapreduce.Job:  map 0% reduce 0%
19/04/18 12:03:42 INFO mapreduce.Job:  map 33% reduce 0%
19/04/18 12:03:48 INFO mapreduce.Job:  map 39% reduce 0%
19/04/18 12:03:54 INFO 