# Hadoop Streaming assignment 1: Words Rating

Create your own WordCount program and process Wikipedia dump. Use the second job to sort  words by quantity in the reverse order (most popular first). Output format:

`word <tab> count`

The result is the 7th word by popularity and its quantity.

The result on the sample dataset:

`is  126420`

> **Hint:** it is possible to use exactly one reducer in the second job to obtain a totally ordered result.

## Step 1. Create mapper and reducer.

In [None]:
%%writefile mapper_wiki_parser.py

import re
import sys

from collections import Counter


def log(message, **kwargs):
    """
    Prints a given message to sys.stderr stream.
    """
    print(message, file=sys.stderr, **kwargs)
    

# Main block
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError:
        continue
    
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
    counter = Counter(words)
    for word, count in counter.items():
        log("reporter:counter:Wiki stats,Total words,%d" % 1)
        print("%s\t%d" % (word.lower(), count))


In [None]:
%%writefile reducer_wiki_parser.py

import sys


current_word = None
word_count = 0

# Main block
for line in sys.stdin:
    try:
        word, count = line.strip().split('\t', 1)
    except ValueError:
        continue
        
    if current_word != word:
        if current_word:
            print("%s\t%d" % (current_word, word_count))
        
        word_count = 0
        current_word = word
    
    word_count += int(count)
    
if current_word:
    print("%s\t%d" % (current_word, word_count))

## Step 2. Run MapReduce jobs.

The first job just count words.

The second job uses MapReduce comparator to sort result of the first job by value (count for each word).

In [None]:
%%bash

INPUT="/data/wiki/en_articles_part"
OUT_DIR="coursera_mr_task1"
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/count > /dev/null

# Count words
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Word Rating (Count)" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper_wiki_parser.py,reducer_wiki_parser.py \
    -mapper "python3 mapper_wiki_parser.py" \
    -reducer "python3 reducer_wiki_parser.py" \
    -input ${INPUT} \
    -output ${OUT_DIR}/count > /dev/null

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/summarize > /dev/null

# Sort counts
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Word Rating (Summarize)" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keycomparator.options=-k2,2nr \
    -D mapreduce.job.reduces=1 \
    -files reducer_wiki_parser.py \
    -mapper "cat" \
    -reducer "cat" \
    -input ${OUT_DIR}/count \
    -output ${OUT_DIR}/summarize > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR}/summarize/part-00000 | sed -n '7p;8q'