## Hadoop Streaming Assignment 4: Word Groups

Calculate statistics for groups of words which are equal up to permutations of letters. For example, ‘emit’, ‘item’ and ‘time’ are the same words up to a permutation of letters. Determine such groups of words and sum all their counts. Apply stop words filter. Filter out groups that consist of only one word.

Output: count of occurrences for the group of words, number of unique words in the group, comma-separated list of the words in the group in lexicographical order:

sum <tab> group size <tab> word1,word2,...

Example: assume ‘emit’ occurred 3 times, 'item' -- 2 times, 'time' -- 5 times; 3 + 2 + 5 = 10, group contains 3 words, so for this group result is:

10 3 emit,item,time

The result of the task is the output line with word ‘english’.

The result on the sample dataset:
7823    eghilns 5   english,helsing,hesling,shengli,shingle

NB: Do not forget about the lexicographical order of words in the group: 'emit,item,time' is OK, 'emit,time,item' is not.

In [1]:
%%writefile mapper.py

import sys
import re

from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

def read_stopwords(file_path): #Read in stop words file
    return set(word.strip().lower() for word in open(file_path))

#read in stop words
stopwords = read_stopwords("stop_words_en.txt")

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)

        #text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        words = [word.lower().strip() for word in words if (word.lower() not in stopwords) and word.isalpha()]
    
        for word in words:
            print ("%s\t%d" % (word, 1))

    except Exception as e:
        print("Error in mapper.py", e)
        continue

Overwriting mapper.py


## Step 2. Create the reducer.

In [2]:
%%writefile reducer.py

# Your code for reducer here.
import sys
from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

current_key = None
total = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 2)
        count = int(count)

    
        if current_key != key:
            if current_key:
                print("{}\t{}".format(current_key, total))
            total = 0
            current_key = key

        total += count
    
    except Exception as e:
        print("Error in reducer.py", e)
        continue
    
if current_key:
    print ("%s\t%d" % (current_key, total))
    #print("{}\t{}".format(current_key, total))

Overwriting reducer.py


In [3]:
%%writefile mapper2.py

# Your code for mapper here.
import sys
from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

current_key = None
sorted_key = None
total = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 2)
        count = int(count)

        if current_key != key:
            if current_key:
                print ("%s\t%d\t%s" % (sorted_key, total, current_key))
            current_key = key
            sorted_key = ''.join(sorted(current_key))
            total = 0

        total += count

    except Exception as e:
        print("Error in mapper2.py", e)
        continue
        
if current_key:
    print ("%s\t%d\t%s" % (sorted_key, total, current_key))
    #print("{}\t{}\t{}".format(sorted_key, current_key, total))

Overwriting mapper2.py


In [4]:
%%writefile reducer2.py

# Your code for reducer here.
import sys
from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

current_key = None
total_sum = 0
word_set = set()

for line in sys.stdin:
    try:
        sorted_key, key_count, key = line.strip().split('\t', 2)
        key_count = int(key_count)

    
        if current_key != sorted_key:
            if current_key and len(word_set) > 1:
                print ("%d\t%s\t%d\t%s" % (total_sum, current_key, len(word_set), ','.join(sorted(word_set))))

            word_set = set()
            current_key = sorted_key
            total_sum = 0

        total_sum += key_count
        word_set.add(key)
    
    except Exception as e:
        print("Error in reducer2.py", e)
        continue
        
if current_key and len(word_set) > 1:
    print ("%d\t%s\t%d\t%s" % (total_sum, current_key, len(word_set), ','.join(sorted(word_set))))
    #print("{}\t{} {}\t{}".format(total_sum, current_key, len(word_set), ','.join(sorted(word_set))))

Overwriting reducer2.py


In [5]:
%%bash

OUT_DIR_JOB1="wordgroup_job1_"$(date +"%s%6N")
OUT_DIR_JOB2="wordgroup_job2_"$(date +"%s%6N")
NUM_REDUCERS=8
LOGS="stderr_logs.txt"

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Streaming WordGroupTask4 Job 1" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_JOB1} > /dev/null

 yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Streaming WordGroupTask4 Job 2" \
    -D mapreduce.job.reduces=1 \
    -files mapper2.py,reducer2.py \
    -mapper "python mapper2.py" \
    -reducer "python reducer2.py" \
    -input ${OUT_DIR_JOB1} \
    -output ${OUT_DIR_JOB2} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_JOB2}/* | grep -P '(,|\t)english($|,)'

hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB1} > /dev/null
hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB2} > /dev/null

#hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB1}* > /dev/null
#hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB2}* > /dev/null

18/12/28 05:28:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/28 05:28:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/28 05:28:31 INFO mapred.FileInputFormat: Total input files to process : 1
18/12/28 05:28:31 INFO mapreduce.JobSubmitter: number of splits:2
18/12/28 05:28:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545955689172_0038
18/12/28 05:28:32 INFO impl.YarnClientImpl: Submitted application application_1545955689172_0038
18/12/28 05:28:32 INFO mapreduce.Job: The url to track the job: http://3bfb327c519e:8088/proxy/application_1545955689172_0038/
18/12/28 05:28:32 INFO mapreduce.Job: Running job: job_1545955689172_0038
18/12/28 05:28:36 INFO mapreduce.Job: Job job_1545955689172_0038 running in uber mode : false
18/12/28 05:28:36 INFO mapreduce.Job:  map 0% reduce 0%
18/12/28 05:28:50 INFO mapreduce.Job:  map 100% reduce 0%
18/12/28 05:28:55 INFO mapreduce.Job:  map 100% reduce 13%
18/12/28 05:28:56 IN