# Real-World Applications: TF-IDF

In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

`tf(term, doc_id) = Nt/N`,

where `Nt` - quantity of particular term in the document, `N` - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

`idf(term) = 1/log(1 + Dt)`,

where `Dt` - number of documents in the dataset with the particular term.

Dataset location: */data/wiki/en_articles_part*

Stop words list is in `/datasets/stop_words_en.txt` file.

**Format:** `article_id <tab> article_text`

**Output:** `tf*idf` for term=’labor’ and article_id=12

The result on the sample dataset:

```
0.000351
```


In [None]:
%%writefile mapper.py

import sys
import re

from collections import Counter


def get_stop_words():
    """
    Reads a file with stop words and parses it to set.
    """
    words = set()
    
    with open('stop_words_en.txt', 'r', encoding='utf-8') as f:
        words = {w.strip().lower() for w in f}
    
    return words


stop_words = get_stop_words()

# Main block
for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
    except ValueError as e:
        continue

    words = [w.lower() for w in re.split(r"\W*\s+\W*", text, flags=re.UNICODE) if w.lower() not in stop_words]
    total_words = len(words)
    counter = Counter(words)
    for word, count in counter.items():
        tf = float(count) / float(total_words)
        print("%s\t%s\t%f" % (word, article_id, tf))


In [None]:
%%writefile reducer_articles_with_word.py
"""
This is a reducer function: 
"""

import sys


current_word = None
articles_count = 0
    
# Main block
for line in sys.stdin:
    try:
        word, article_id, tf = line.strip().split('\t', 2)
    except ValueError as e:
        continue

    if current_word != word:
        if current_word:
            print("%s\t%d\t%d" % (current_word, 0, articles_count))

        articles_count = 0
        current_word = word

    print("%s\t%s\t%f" % (word, article_id, float(tf)))
    articles_count += 1

if current_word:
    print("%s\t%d\t%d" % (current_word, 0, articles_count))

In [None]:
%%writefile reducer_tfidf.py
"""
This is a reducer function: 
"""

import sys

from math import log


def tfidf(tf, dt):
    """
    Calculates tf*idf for word and article
    """
    idf = 1.0 / (log(1 + dt))
    tfidf = tf * idf

    return tfidf


articles_count = 0
    
# Main block
for line in sys.stdin:
    try:
        word, article_id, tf = line.strip().split('\t', 2)
        article_id, tf = int(article_id), float(tf)
    except ValueError as e:
        continue

    if article_id == 0:
        articles_count = tf
    else:
        print("%s\t%s\t%f" % (word, article_id, tfidf(tf, articles_count)))

In [None]:
%%bash

INPUT="/data/wiki/en_articles_part"
OUT_DIR="coursera_tfidf"
NUM_REDUCERS=4

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/count > /dev/null

# Count words and tf. Counts articles with particular word.
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="TF-IDF (Count)" \
    -D mapreduce.job.reduces=${NUM_REDUCERS} \
    -files mapper.py,reducer_articles_with_word.py,/datasets/stop_words_en.txt \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer_articles_with_word.py" \
    -input ${INPUT} \
    -output ${OUT_DIR}/count > /dev/null

hdfs dfs -rm -r -skipTrash ${OUT_DIR}/tfidf > /dev/null

# Calculate tf-idf
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapreduce.job.name="TF-IDF (Calculate)" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keycomparator.options="-k1 -k2,2n" \
    -D mapreduce.job.reduces=1 \
    -file reducer_tfidf.py \
    -mapper "cat" \
    -reducer "python3 reducer_tfidf.py" \
    -input ${OUT_DIR}/count \
    -output ${OUT_DIR}/tfidf > /dev/null

# Print TF-IDF for word 'labor' and article with id 12
hdfs dfs -cat ${OUT_DIR}/tfidf/part-00000 | grep "labor	12" | head -1 | cut -f 3