# Real-World Applications: TF-IDF
In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate <code>tf*idf</code> for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

<code>tf(term, doc_id) = Nt/N</code>,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

<code>idf(term) = 1/log(1 + Dt)</code>,

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <i>tab</i> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: <code>tf*idf</code> for term=’labor’ and article_id=12

The result on the sample dataset:

<code>0.000351</code>

<i>Hint: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.</i>

In [1]:
%%writefile wiki.dat

1	In this task Hadoop Streaming is used to process Wikipedia articles dump calculations.
2	The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump calculations.
3	Apply the stop words filter to speed up calculations calculations calculations calculations calculations. 
4	Term frequency (tf) is a function depending on a term (word) and a document (article)
5	To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump



Overwriting wiki.dat


In [2]:
%%bash
cp '/datasets/stop_words_en.txt' 'stop_words_en.txt'

In [3]:
%%writefile mapper1.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

with open('stop_words_en.txt', "r") as f:
    stop_words = set(f.read().splitlines()) 

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue

    text= re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    text = re.split("\W*\s+\W*", text, flags=re.UNICODE)    
    text = {w.lower() for w in text if w.lower() not in stop_words}
    for term in text:
        print "%s\t%s" % (term, article_id) 

Overwriting mapper1.py


In [4]:
cat wiki.dat | python2 ./mapper1.py | sort | head

apply	3
article	2
article	4
articles	1
articles	5
calculate	2
calculations	1
calculations	2
calculations	3
depending	4


In [5]:
%%writefile reducer1.py

import sys
from math import log

current_key = None
key_sum = 0
idf = lambda x: 1/log(1+x)

for line in sys.stdin:
    try:
        term, article_id = line.strip().split('\t', 1)
    except ValueError as e:
        continue
        
    if current_key != term:
        if current_key:            
            print "%d\t%s\t%f" % (0, current_key, idf(key_sum)) # dummy article_id=0
        current_key = term
        key_sum = 0
    key_sum += 1
    
if current_key:
    print "%d\t%s\t%f" % (0, current_key, idf(key_sum))

Overwriting reducer1.py


In [6]:
cat wiki.dat | python2 ./mapper1.py | sort | python2 ./reducer1.py >> wiki.dat

In [7]:
cat wiki.dat | head


1	In this task Hadoop Streaming is used to process Wikipedia articles dump calculations.
2	The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump calculations.
3	Apply the stop words filter to speed up calculations calculations calculations calculations calculations. 
4	Term frequency (tf) is a function depending on a term (word) and a document (article)
5	To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump
0	apply	1.442695
0	article	0.910239
0	articles	0.910239
0	calculate	1.442695


In [8]:
%%writefile mapper2.py

import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

with open('stop_words_en.txt', "r") as f:
    stop_words = set(f.read().splitlines())
    
for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    
    article_id = int(article_id)
    
    if article_id == 0:
        term, idf = text.split('\t', 1)
        print "%s\t%d\t%s" % (term, article_id, idf)
        continue
    
    text= re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    text = re.split("\W*\s+\W*", text, flags=re.UNICODE)   
    
    counter = dict()
    for w in text:
        w = w.lower()
        if w in stop_words:
            continue
            
        if w in counter:
            counter[w] += 1
        else:
            counter[w] = 1
    
    total = sum(counter.values())
    
    for term, cnt in counter.iteritems():
        print "%s\t%d\t%f" % (term, article_id, float(cnt)/total) 

Overwriting mapper2.py


In [9]:
cat wiki.dat | python2 ./mapper2.py | sort | head

apply	0	1.442695
apply	3	0.100000
article	0	0.910239
article	2	0.100000
article	4	0.111111
articles	0	0.910239
articles	1	0.111111
articles	5	0.125000
calculate	0	1.442695
calculate	2	0.100000


In [10]:
%%writefile reducer2.py

import sys

idf = None

for line in sys.stdin:
    try:
        term, article_id, tf = line.strip().split('\t', 2)        
    except ValueError as e:
        continue
        
    article_id = int(article_id)
    tf = float(tf)
            
    if article_id == 0:
        idf = tf
        continue
    
    print "%s\t%d\t%f" % (term, article_id, tf*idf)

Overwriting reducer2.py


In [11]:
cat wiki.dat | python2 ./mapper2.py | sort  | python2 ./reducer2.py | head

apply	3	0.144270
article	2	0.091024
article	4	0.101138
articles	1	0.101138
articles	5	0.113780
calculate	2	0.144270
calculations	1	0.080150
calculations	2	0.072135
calculations	3	0.360674
depending	4	0.160299


In [12]:
%%bash

WIKI_DIR="/data/wiki/en_articles_part"
IDF_DIR="idf"
TFIDF_DIR="tf_idf"

hdfs dfs -rm -r -skipTrash ${IDF_DIR} > /dev/null
hdfs dfs -rm -r -skipTrash ${TFIDF_DIR} > /dev/null

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Calculate IDF" \
    -D mapreduce.job.reduces=4 \
    -files mapper1.py,reducer1.py,/datasets/stop_words_en.txt \
    -mapper "python mapper1.py" \
    -reducer "python reducer1.py" \
    -input ${WIKI_DIR} \
    -output ${IDF_DIR} > /dev/null
    
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.jab.name="Calculate TF-IDF" \
    -D mapreduce.job.reduces=4 \
    -D mapreduce.partition.keypartitioner.options="-k1,1"\
    -D stream.num.map.output.key.fields=2 \
    -files mapper2.py,reducer2.py,/datasets/stop_words_en.txt \
    -mapper "python mapper2.py" \
    -reducer "python reducer2.py" \
    -input ${WIKI_DIR},${IDF_DIR} \
    -output ${TFIDF_DIR} \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner > /dev/null

hdfs dfs -cat ${TFIDF_DIR}/* |  grep -P 'labor\t12\t' | cut -f3

0.000351


19/05/02 14:26:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/05/02 14:26:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/05/02 14:26:07 INFO mapred.FileInputFormat: Total input files to process : 1
19/05/02 14:26:07 INFO mapreduce.JobSubmitter: number of splits:2
19/05/02 14:26:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1556715347669_0006
19/05/02 14:26:07 INFO impl.YarnClientImpl: Submitted application application_1556715347669_0006
19/05/02 14:26:07 INFO mapreduce.Job: The url to track the job: http://5e7e19746200:8088/proxy/application_1556715347669_0006/
19/05/02 14:26:07 INFO mapreduce.Job: Running job: job_1556715347669_0006
19/05/02 14:26:13 INFO mapreduce.Job: Job job_1556715347669_0006 running in uber mode : false
19/05/02 14:26:13 INFO mapreduce.Job:  map 0% reduce 0%
19/05/02 14:26:29 INFO mapreduce.Job:  map 83% reduce 0%
19/05/02 14:26:30 INFO mapreduce.Job:  map 100% reduce 0%
19/05/02 14:26:36 INFO