# Real-World Applications: TF-IDF
In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate <code>tf*idf</code> for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

<code>tf(term, doc_id) = Nt/N</code>,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

<code>idf(term) = 1/log(1 + Dt)</code>,

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <i>tab</i> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: <code>tf*idf</code> for term=’labor’ and article_id=12

The result on the sample dataset:

<code>0.000351</code>

In [1]:
%%writefile test.dat

1	In this task Hadoop Streaming is used to process Wikipedia articles dump calculations.
2	The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump calculations.
3	Apply the stop words filter to speed up calculations calculations calculations calculations calculations. 
4	Term frequency (tf) is a function depending on a term (word) and a document (article)
5	To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities.

Overwriting test.dat


In [2]:
%%bash
cp '/datasets/stop_words_en.txt' 'stop_words_en.txt'

In [3]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

with open('stop_words_en.txt', "r") as f:
    stop_words = set(f.read().splitlines())
    
for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    text= re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    text = re.split("\W*\s+\W*", text, flags=re.UNICODE)   
    
    counter = dict()
    for w in text:
        w = w.lower()
        if w in stop_words:
            continue
            
        if w in counter:
            counter[w] += 1
        else:
            counter[w] = 1
    
    total = sum(counter.values())
    for term, cnt in counter.iteritems():
        print "%s\t%s\t%f" % (term, article_id, float(cnt)/total) 

Overwriting mapper1.py


In [4]:
cat test.dat | python2 ./mapper1.py | sort | tail

transform	5	0.055556
unicode	5	0.055556
used	1	0.111111
wikipedia	1	0.111111
wikipedia	2	0.100000
wikipedia	5	0.055556
word	2	0.100000
word	4	0.111111
words	3	0.100000
words	5	0.055556


In [5]:
%%writefile reducer1.py

import sys
from math import log

current_key = None
cache = dict() # TODO avoid in-memory collection

def process_record(term, cache):
    idf = 1/log(1+len(cache))
    for article_id, tf in cache.iteritems():
        print "%s\t%d\t%f" % (term, article_id, tf*idf)

for line in sys.stdin:
    try:
        term, article_id, tf = line.strip().split('\t', 2)
        article_id = int(article_id)
        tf = float(tf)
    except ValueError as e:
        continue
        
    if current_key != term:
        if current_key:            
            process_record(current_key, cache)
        current_key = term
        cache.clear()
    cache[article_id] = tf
    
if current_key:
    process_record(current_key, cache)

Overwriting reducer1.py


In [6]:
cat test.dat | python2 ./mapper1.py | sort | python2 ./reducer1.py | tail

transform	5	0.080150
unicode	5	0.080150
used	1	0.160299
wikipedia	1	0.080150
wikipedia	2	0.072135
wikipedia	5	0.040075
word	2	0.091024
word	4	0.101138
words	3	0.091024
words	5	0.050569


In [7]:
%%bash

OUT_DIR="out_"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Streaming TF-IDF" \
    -D mapreduce.job.reduces=4 \
    -files mapper1.py,reducer1.py,/datasets/stop_words_en.txt \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null

hdfs dfs -cat ${OUT_DIR}/* |  grep -P 'labor\t12\t' | cut -f3

0.000351


19/04/30 16:07:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/30 16:07:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/30 16:07:30 INFO mapred.FileInputFormat: Total input files to process : 1
19/04/30 16:07:30 INFO mapreduce.JobSubmitter: number of splits:2
19/04/30 16:07:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556631347566_0018
19/04/30 16:07:31 INFO impl.YarnClientImpl: Submitted application application_1556631347566_0018
19/04/30 16:07:31 INFO mapreduce.Job: The url to track the job: http://fd5b2ddce85e:8088/proxy/application_1556631347566_0018/
19/04/30 16:07:31 INFO mapreduce.Job: Running job: job_1556631347566_0018
19/04/30 16:07:37 INFO mapreduce.Job: Job job_1556631347566_0018 running in uber mode : false
19/04/30 16:07:37 INFO mapreduce.Job:  map 0% reduce 0%
19/04/30 16:07:53 INFO mapreduce.Job:  map 67% reduce 0%
19/04/30 16:07:55 INFO mapreduce.Job:  map 83% reduce 0%
19/04/30 16:07:56 INFO 