# Real-World Applications: TF-IDF

In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <tab> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: tf*idf for term=’labor’ and article_id=12

The result on the sample dataset: 0.000351

Hint: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.

If you want to deploy the environment on your own machine, please use bigdatateam/yarn-notebook Docker container.

In [10]:
%%writefile mapper.py

from __future__ import division
import sys
import re
from collections import Counter

from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

def read_stopwords(file_path): #Read in stop words file
    return set(word.strip().lower() for word in open(file_path))

#read in stop words
stopwords = read_stopwords("stop_words_en.txt")

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)

        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        words = [word.lower() for word in words if (word.lower() not in stopwords)]

        words_counter = Counter(words)
        words_total = sum(words_counter.values()) # (N) total number of terms in the document (without stop words)

        for word, count in words_counter.items(): #does this need to be sorted?
            #term, article, tf(term, doc_id) = Nt/N,           
            tf = float(count)/float(words_total) # count = (NT) quantity of particular term in the document
            print ("%s\t%s\t%s" % (word, article_id, tf))
        
    except Exception as e:
        print("Error in mapper.py", e)
        continue

Overwriting mapper.py


## Step 2. Create the reducer.

In [11]:
%%writefile reducer.py

from __future__ import division
import sys
from math import log
from imp import reload
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8") # required to convert to unicode

articles_dict = dict()
current_key = None

for line in sys.stdin:
    try:
        key, article_id, tf = line.strip().split('\t', 2)
        tf = float(tf)

        if current_key != key:
            if current_key:
                idf = (float(1)/log(1 + len(articles_dict.values())))
                for key_article_id, tf in articles_dict.items():
                    tf_idf = tf * idf
                    print ("%s\t%s\t%f" % (current_key, key_article_id, tf_idf))

            current_key = key
            articles_dict = dict()

        articles_dict[article_id] = float(tf)
    
    except Exception as e:
        print("Error in reducer.py", e)
        continue
    
if current_key:
    idf = (float(1)/log(1 + len(articles_dict.values())))
    for key_article_id, tf in articles_dict.items():
        tf_idf = tf * idf
        print ("%s\t%s\t%f" % (current_key, key_article_id, tf_idf))

Overwriting reducer.py


In [12]:
%%bash

OUT_DIR_JOB="tf_idf"$(date +"%s%6N")
NUM_REDUCERS=4
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="Streaming TF-IDF Job 1" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python mapper.py" \
    -reducer "python reducer.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR_JOB} > /dev/null

# Code for obtaining the results
hdfs dfs -cat ${OUT_DIR_JOB}/* | grep -P 'labor\t12\t' | cut -f3

#hdfs dfs -rm -r -skipTrash ${OUT_DIR_JOB} > /dev/null


rm: `tf_idf1545975923428444': No such file or directory
18/12/28 05:45:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/28 05:45:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/28 05:45:26 INFO mapred.FileInputFormat: Total input files to process : 1
18/12/28 05:45:26 INFO mapreduce.JobSubmitter: number of splits:2
18/12/28 05:45:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545955689172_0043
18/12/28 05:45:27 INFO impl.YarnClientImpl: Submitted application application_1545955689172_0043
18/12/28 05:45:27 INFO mapreduce.Job: The url to track the job: http://3bfb327c519e:8088/proxy/application_1545955689172_0043/
18/12/28 05:45:27 INFO mapreduce.Job: Running job: job_1545955689172_0043
18/12/28 05:45:32 INFO mapreduce.Job: Job job_1545955689172_0043 running in uber mode : false
18/12/28 05:45:32 INFO mapreduce.Job:  map 0% reduce 0%
18/12/28 05:45:46 INFO mapreduce.Job:  map 50% reduce 0%
18/12/28 05:45:47 INFO m