# Real-World Applications: TF-IDF
In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate <code>tf*idf</code> for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

<code>tf(term, doc_id) = Nt/N</code>,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

<code>idf(term) = 1/log(1 + Dt)</code>,

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <i>tab</i> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: <code>tf*idf</code> for term=’labor’ and article_id=12

The result on the sample dataset:

<code>0.000351</code>

In [105]:
%%writefile mapper.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

with open('/datasets/stop_words_en.txt', "r") as f:
    stop_words = set(f.read().splitlines())
    
for line in sys.stdin:
    try:
        doc, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    text= re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)   
    words = [i.lower() for i in words if i.lower() not in stop_words]
    
    tf = dict()
    for w in words:
        if w in tf:
            tf[w] += 1
        else:
            tf[w] = 1
            
    cnt = len(tf)    
    for k,v in tf.iteritems():
        print "%s_%s\t%f" % (k, doc, float(v)/cnt) # term_doc <tab> tf

Overwriting mapper.py


In [106]:
%%writefile test.dat

1	In this task Hadoop Streaming is used to process Wikipedia articles dump calculations.
2	The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump calculations.
3	Apply the stop words filter to speed up calculations calculations calculations calculations calculations. 
4	Term frequency (tf) is a function depending on a term (word) and a document (article)
5	To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities.
42	To cope with Unicode we recommend to use the following tokenizer:

Overwriting test.dat


In [107]:
cat test.dat | python2 ./mapper.py | head

articles_1	0.111111
task_1	0.111111
dump_1	0.111111
process_1	0.111111
wikipedia_1	0.111111
hadoop_1	0.111111
streaming_1	0.111111
used_1	0.111111
calculations_1	0.111111
tf*idf_2	0.100000


In [21]:
%%writefile reducer.py

import sys

for line in sys.stdin:
    try:
        key, tf = line.strip().split('\t', 1)        
        term, doc = key.split('_', 1)       
        doc = int(doc)
        tf = float(tf)
    except ValueError as e:
        continue
    
    # TODO: here
    if current_key != permutation:        
        if current_key and len(words) > 1:
            print "%d\t%d\t%s" % (key_sum, len(words), ','.join(sorted(words)))
        current_key = permutation
        key_sum = 0
        words = set()
    key_sum += count
    words.add(word)
    
if current_key and len(words) > 1:    
    print "%d\t%d\t%s" % (key_sum, len(words), ','.join(sorted(words)))