# Hadoop Streaming assignment 2: Stop Words

Improve the previous program to calculate how many stop words are in the input dataset. Stop words list is in `/datasets/stop_words_en.txt` file. Use Hadoop counter to count the number of stop words and total words in the dataset. The result is the percentage of stop words in the entire dataset (without percent symbol).

The result on the sample dataset:

`41.603`

### Step 1. Create the mapper.

In [None]:
%%writefile mapper_wiki_parser.py

import sys
import re


def log(message, **kwargs):
    """
    Prints a given message to sys.stderr stream.
    """
    print(message, file=sys.stderr, **kwargs)
    
    
def counter(name, value):
    """
    Prints a MapReduce job counter.
    """
    log("reporter:counter:Wiki Stats,%s,%d" % (name, value))
    

def get_stop_words():
    """
    Reads a file with stop words and parses it to set.
    """
    words = set()
    
    with open('stop_words_en.txt', 'r', encoding='utf-8') as f:
        words = {w.strip().lower() for w in f}
    
    return words


stop_words = get_stop_words()


# Main block
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError as e:
        continue

    words = re.split(r"\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        if word in stop_words:
            counter("Stop words", 1)
        
        counter("Total words", 1)
        
        print("%s\t%d" % (word.lower(), 1))

### Step 2. Create the reducer.

**Note:** we don't really need a reducer for this task.

### Step 3. Create the parsing function.

In [None]:
%%writefile counter_process.py

#! /usr/bin/env python

import sys
import re


STOP_WORDS_COUNTER_RE = re.compile("Stop words=\d+")
TOTAL_WORDS_COUNTER_RE = re.compile("Total words=\d+")


def parse_logs():
    """
    Parses raw logs of MapReduce job and 
    returns values of two counters as tuple.
    """
    stop_words = 0
    total_words = 0
    
    for line in sys.stdin:
        
        if STOP_WORDS_COUNTER_RE.search(line):
            stop_words = int(line.strip().split("=", 1)[1])
            
        if TOTAL_WORDS_COUNTER_RE.search(line):
            total_words = int(line.strip().split("=", 1)[1])
    
    return stop_words, total_words


if __name__ == '__main__':
    stop_words, total_words = parse_logs()
    print(stop_words / float(total_words) * 100)

### Step 4. Bash commands.

In [None]:
%%bash

INPUT="/data/wiki/en_articles_part"
OUT_DIR="coursera_mr_task2"
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Stop Words" \
    -D mapreduce.job.reduces=0 \
    -files mapper_wiki_parser.py,/datasets/stop_words_en.txt \
    -mapper "python3 mapper_wiki_parser.py" \
    -input ${INPUT} \
    -output ${OUT_DIR} > /dev/null 2> $LOGS
    
cat $LOGS | python ./counter_process.py
cat $LOGS >&2
