TFIDF calculation

This document explains how to compute TF-IDF with Apache Hive/Hivemall.

What you need to compute TF-IDF is a table/view composing (docid, word) pair, 2 views, and 1 query.

Note that this feature is supported since Hivemall v0.3-beta3 or later. Macro is supported since Hive 0.12 or later.

Define macros used in the TF-IDF computation

create temporary macro max2(x INT, y INT)
if(x>y,x,y);

-- create temporary macro idf(df_t INT, n_docs INT)
-- (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

create temporary macro tfidf(tf FLOAT, df_t INT, n_docs INT)
tf * (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

Data preparation

To calculate TF-IDF, you need to prepare a relation consists of (docid,word) tuples.

create external table wikipage (
  docid int,
  page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
  docid, 
  word
from
  wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
  not is_stopword(word);

You can download the data of the wikipage table from this link.

Define views of TF/DF

create or replace view term_frequency 
as
select
  docid, 
  word,
  freq
from (
select
  docid,
  tf(word) as word2freq
from
  wikipage_exploded
group by
  docid
) t 
LATERAL VIEW explode(word2freq) t2 as word, freq;

create or replace view document_frequency
as
select
  word, 
  count(distinct docid) docs
from
  wikipage_exploded
group by
  word;

TF-IDF calculation for each docid/word pair

-- set the total number of documents
select count(distinct docid) from wikipage;
set hivevar:n_docs=3;

select
  tf.docid,
  tf.word, 
  -- tf.freq * (log(10, CAST(${n_docs} as FLOAT)/max2(1,df.docs)) + 1.0) as tfidf
  tfidf(tf.freq, df.docs, ${n_docs}) as tfidf
from
  term_frequency tf 
  JOIN document_frequency df ON (tf.word = df.word)
order by 
  tfidf desc;

The result will be as follows:

docid  word     tfidf
1       justice 0.1641245850805637
3       knowledge       0.09484606645205085
2       action  0.07033910867777095
1       law     0.06564983513276658
1       found   0.06564983513276658
1       religion        0.06564983513276658
1       discussion      0.06564983513276658
  ...
  ...
2       act     0.017584777169442737
2       virtues 0.017584777169442737
2       well    0.017584777169442737
2       willingness     0.017584777169442737
2       find    0.017584777169442737
2       1       0.014001086678120098
2       experience      0.014001086678120098
2       often   0.014001086678120098

The above result is considered to be appropriate as docid 1, 2, and 3 are the Wikipedia entries of Justice, Wisdom, and Knowledge, respectively.

Analytics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFIDF calculation

Define macros used in the TF-IDF computation

Data preparation

Define views of TF/DF

TF-IDF calculation for each docid/word pair

Clone this wiki locally