# [Honour Task] Hadoop Streaming assignment 3: Name Count

Make WordCount program for all the names in the dataset. Name is a word with the following properties:
* The first character is not a digit (other characters can be digits).
* The first character is uppercase, all the other characters that are letters are lowercase.
* There are less than 0.5% occurrences of this word, when this word regardless to its case appears in the dataset and  the condition (2) is not met.

Order by quantity, most popular first, output format:

`name <tab> count`

The result is the 5th line in the output.

The result on the sample dataset:

`french 5742`

### Step 1. Create the mapper.

In [1]:
%%writefile mapper.py
"""
This is a map function: 
  map: (<article_id> <text>) -> [(<word> <1 if as name, 0 otherwise>),]

Skips all words start with a digital.
"""

import sys
import re


def is_name(word):
    """
    Checks is the word used as name. 
    Condition - the first character is uppercase, 
    all the other characters that are letters are lowercase.
    
    :return: 1 if the word used as name, otherwise 0.
    """
    if word[0].isupper() and (len(word) < 2 or word[1:].islower()):
        return True
    
    return False
    

# Main block
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError as e:
        continue

    words = re.split(r"\W*\s+\W*", text, flags=re.UNICODE)
    for word in words:
        if len(word) > 0 and word[0].isalpha():
            if is_name(word):
                print("%s\t%d" % (word.lower(), 1))
            else:
                print("%s\t%d" % (word.lower(), 0))

Writing mapper.py


### Step 2. Create the reducer.

In [2]:
%%writefile reducer.py
"""
This is a reducer function: 
  combiner: (<word> <1 if as name, 0 otherwise>) -> [(<word> <num as name>]
  
  A word is a name if there are less than 0.5% occurrences of this word, 
  when this word regardless to its case appears.
"""

import sys


def is_name(total, as_name):
    """
    Returns True if there are less than 0.5% 
    occurrences of this word when the word is not a name.
    """
    if as_name / float(total) >= 0.95:
        return True
    
    return False


current_word = None
word_total_count = 0
word_count_as_name = 0
    

# Main block
for line in sys.stdin:
    try:
        word, occurrence = line.strip().split('\t', 1)
    except ValueError as e:
        continue

    if current_word != word:
        if current_word and is_name(word_total_count, word_count_as_name):
            print("%s\t%d" % (current_word, word_count_as_name))

        word_total_count = 0
        word_count_as_name = 0
        current_word = word

    word_count_as_name += int(occurrence)
    word_total_count += 1

if current_word and is_name(word_total_count, word_count_as_name):
    print("%s\t%d" % (current_word, word_count_as_name))

Writing reducer.py


### Step 3. Run MapReduce jobs.

The first job calculates occurences of each 'name' in the dataset.

The second job uses MapReduce comparator to sort result of the first job by value (count for each word).

In [None]:
%%bash

INPUT="/data/wiki/en_articles_part"
OUT_DIR="coursera_mr_task3"
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/count > /dev/null

# Count words
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Name Count (Count)" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -input ${INPUT} \
    -output ${OUT_DIR}/count > /dev/null

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/summarize > /dev/null

# Sort counts
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Name Count (Summarize)" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keycomparator.options=-k2,2nr \
    -D mapreduce.job.reduces=1 \
    -mapper "cat" \
    -reducer "cat" \
    -input ${OUT_DIR}/count \
    -output ${OUT_DIR}/summarize > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR}/summarize/part-00000 | head -6 | tail -1