- Dataset location: /data/wiki/en_articles_part
- Stop words list is in ‘/datasets/stop_words_en.txt’ file.
- Format: article_id <\t> article_text

Calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations.

Term frequency (tf) is a function depending on a term (word) and a document (article):
```
tf(term, doc_id) = Nt/N,
```
    - Nt: quantity of particular term in the document
    - N: the total number of terms in the document (without stop words)


Inverse document frequency (idf) is a function depends on a term:
```
idf(term) = 1/log(1 + Dt),
```
    - Dt: number of documents in the dataset with the particular term.


You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Output: tf*idf for term=’labor’ and article_id=12

## tf

In [1]:
%%writefile mapper_tf.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

distributed_cache_file = 'stop_words_en.txt'
def read_stop_words(file_path):
    return set(word.strip().lower() for word in open(file_path))
stop_words = read_stop_words(distributed_cache_file)

my_article_id = 12
my_word = 'labor'

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    article_id = int(article_id)
    text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

    for word in words:
        word = word.lower()
        if word not in stop_words and article_id == my_article_id:
            print >> sys.stderr, "reporter:counter:myGroup,Total_terms,%d" % 1
            if word == my_word:
                print "%s\t%d" % (word, 1)
                print >> sys.stderr, "reporter:counter:myGroup,my_terms,%d" % 1

Writing mapper_tf.py


In [2]:
%%writefile reducer_tf.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue

    if current_key != key:
        if current_key:
            print "%s\t%d" % (current_key, word_sum)
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print "%s\t%d" % (current_key, word_sum)

Writing reducer_tf.py


## idf

In [3]:
%%writefile mapper_idf.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8') # required to convert to unicode

my_word = 'labor'

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    article_id = int(article_id)
    text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

    for word in words:
        word = word.lower()
        if word == my_word:
            print "%s\t%d" % (str(article_id), 1)

Writing mapper_idf.py


In [4]:
%%writefile reducer_idf.py

import sys

current_key = None
word_sum = 0

for line in sys.stdin:
    try:
        key, count = line.strip().split('\t', 1)
        count = int(count)
    except ValueError as e:
        continue

    if current_key != key:
        if current_key:
            print "%s\t%d" % (current_key, word_sum)
        word_sum = 0
        current_key = key
    word_sum += count

if current_key:
    print "%s\t%d" % (current_key, word_sum)

Writing reducer_idf.py


## Script

In [5]:
%%writefile myScript.py

import sys
reload(sys)

import math

nums = []
for line in sys.stdin:
    try:
        num = line.strip()
        num = float(num)
    except ValueError as e:
        continue

    nums.append(num)
    
# Calc
Nt = nums[1]
N = nums[0]
Dt = nums[2]

tf = Nt/N
idf = 1/math.log(1+Dt)
print tf*idf

Writing myScript.py


In [6]:
%%bash


############################################### tf ###############################################
OUT_DIR_TF="tf_result_"$(date +"%s%6N")
NUM_REDUCERS_TF=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR_TF} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming tf" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_TF} \
    -files mapper_tf.py,reducer_tf.py,/datasets/stop_words_en.txt \
    -mapper "python mapper_tf.py" \
    -combiner "python reducer_tf.py" \
    -reducer "python reducer_tf.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_TF} > /dev/null 2> output_tf.log

# print the answer
cat output_tf.log | egrep "*_terms" | grep -Eo [0-9]+ > results.txt

# print log to stderr for grader
cat output_tf.log >&2


############################################### idf ###############################################
OUT_DIR_IDF="idf_result_"$(date +"%s%6N")
NUM_REDUCERS_IDF=8

hdfs dfs -rm -r -skipTrash ${OUT_DIR_IDF} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming idf" \
    -D mapreduce.job.reduces=${NUM_REDUCERS_IDF} \
    -files mapper_idf.py,reducer_idf.py \
    -mapper "python mapper_idf.py" \
    -combiner "python reducer_idf.py" \
    -reducer "python reducer_idf.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_IDF} > /dev/null 2> output_idf.log

hdfs dfs -cat ${OUT_DIR_IDF}/part-* | wc -l >> results.txt

# print log to stderr for grader
cat output_idf.log >&2

############################################### script ###############################################
cat results.txt | python2 myScript.py

0.00035046896211


rm: `tf_result_1517471915885026': No such file or directory
18/02/01 07:58:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/02/01 07:58:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/02/01 07:58:41 INFO mapred.FileInputFormat: Total input files to process : 1
18/02/01 07:58:41 INFO mapreduce.JobSubmitter: number of splits:2
18/02/01 07:58:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517470045575_0001
18/02/01 07:58:42 INFO impl.YarnClientImpl: Submitted application application_1517470045575_0001
18/02/01 07:58:42 INFO mapreduce.Job: The url to track the job: http://9e2c98977772:8088/proxy/application_1517470045575_0001/
18/02/01 07:58:42 INFO mapreduce.Job: Running job: job_1517470045575_0001
18/02/01 07:58:48 INFO mapreduce.Job: Job job_1517470045575_0001 running in uber mode : false
18/02/01 07:58:48 INFO mapreduce.Job:  map 0% reduce 0%
18/02/01 07:59:02 INFO mapreduce.Job:  map 100% reduce 0%
18/02/01 07:59:07 I