Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

sum <tab> group size <tab> word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

NB: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.



In [1]:
%%writefile mapper.py

#!/usr/bin/env python

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')


def find_all_permu(word):
    import itertools
    return [''.join(permu_word) for permu_word in itertools.permutations( list(word) )]


for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
#     text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

    # your code goes here
    my_filter = set(find_all_permu('english'))
    
    for word in words:
        word = word.lower()
        if word in my_filter:
            print "%s\t%d" % (word, 1)

Overwriting mapper.py


In [2]:
%%writefile reducer.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    if current_key != key:
        if current_key:
            print "%s\t%d" % (current_key, word_sum)
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print "%s\t%d" % (current_key, word_sum)

Overwriting reducer.py


In [3]:
%%writefile myScript.py

from __future__ import print_function
import sys


myList = []
cnt = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue
    
    cnt += count
    myList.append(key)

print ("%d\t%d" % (cnt, len(myList)), end='\t')
for i, x in enumerate(sorted(myList)):
    if i == len(myList)-1:
        print (x, end='')
    else:
        print (x, end=',')


Overwriting myScript.py


In [4]:
%%bash

OUT_DIR="wordcount_result"
NUM_REDUCERS=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming wordCount" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py \
    -mapper "python mapper.py" \
    -combiner "python reducer.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> output.log

# Print result
hadoop fs -cat ${OUT_DIR}/part-* | python2 myScript.py | head

# print log to stderr for grader
cat output.log >&2

7820	5	english,helsing,hesling,shengli,shingle

18/01/25 09:26:14 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/25 09:26:14 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/25 09:26:15 INFO mapred.FileInputFormat: Total input files to process : 1
18/01/25 09:26:15 INFO mapreduce.JobSubmitter: number of splits:2
18/01/25 09:26:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516837645430_0019
18/01/25 09:26:15 INFO impl.YarnClientImpl: Submitted application application_1516837645430_0019
18/01/25 09:26:15 INFO mapreduce.Job: The url to track the job: http://c239dcf23262:8088/proxy/application_1516837645430_0019/
18/01/25 09:26:15 INFO mapreduce.Job: Running job: job_1516837645430_0019
18/01/25 09:26:21 INFO mapreduce.Job: Job job_1516837645430_0019 running in uber mode : false
18/01/25 09:26:21 INFO mapreduce.Job:  map 0% reduce 0%
18/01/25 09:26:33 INFO mapreduce.Job:  map 100% reduce 0%
18/01/25 09:26:38 INFO mapreduce.Job:  map 100% reduce 13%
18/01/25 09:26:39 IN

## Test

In [5]:
%%writefile debug_data.txt
1	engilhs engilhs engilhs 
2	english english english english english english
3	enhilgs enhilgs enhilgs enhilgs enhilgs enhilgs enhilgs

Overwriting debug_data.txt


In [6]:
cat debug_data.txt | python2 mapper.py 

engilhs	1
engilhs	1
engilhs	1
english	1
english	1
english	1
english	1
english	1
english	1
enhilgs	1
enhilgs	1
enhilgs	1
enhilgs	1
enhilgs	1
enhilgs	1
enhilgs	1


In [7]:
cat debug_data.txt | python2 mapper.py | sort | python2 reducer.py

engilhs	3
english	6
enhilgs	7


In [8]:
cat debug_data.txt | python2 mapper.py | sort | python2 reducer.py | python2 myScript.py

16	3	engilhs,english,enhilgs