# Term frequency-inverse document frequency (tf-idf)
Frequent words occur across multiple documents. Those frequently occurring words typically
don't contain useful information.
The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$tf-idf(t,d) = tf(t,d)\cdot idf(t)$$ 

$tf(t, d)$ is the term t frequency  for document d, the inverse document frequency idf can be calculated as:

$$idf(t)= 1 + log[(1+n)/(1+df(t))]$$

where $n$ is the total number of documents, and $df(t)$ is the number of documents 
that contain the term t. Note that if $df(t)=n$ then $idf(t)$ has the minimum value of 1.

The log is used to ensure that low document frequencies are not given too much weight.

In [1]:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.{Tokenizer,StopWordsRemover,CountVectorizer,IDF}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.linalg.Vector

## Create a set of documents

In [2]:
val df = List( (0,"The sun is shining"),
                (1,"The weather is sweet, sweet"),
                (2,"The sun is shining and the weather is sweet")).toDF("id","doc")

df = [id: int, doc: string]


[id: int, doc: string]

In [3]:
df.show()

+---+--------------------+
| id|                 doc|
+---+--------------------+
|  0|  The sun is shining|
|  1|The weather is sw...|
|  2|The sun is shinin...|
+---+--------------------+



## Vectorize the documents

In [4]:
val tokenizer = new Tokenizer().
  setInputCol("doc").
  setOutputCol("raw_words")

tokenizer = tok_852b51e088ea


tok_852b51e088ea

In [5]:
val remover = new StopWordsRemover().
  setInputCol("raw_words").
  setOutputCol("filtered_words")

remover = stopWords_55c145ba7646


stopWords_55c145ba7646

In [6]:
val cv = new CountVectorizer().
  setInputCol("filtered_words").
  setOutputCol("rawFeatures")

cv = cntVec_48fed1ba3ba1


cntVec_48fed1ba3ba1

In [7]:
val idf = new IDF().
  setInputCol("rawFeatures").
  setOutputCol("features")

idf = idf_3bd9d5e92022


idf_3bd9d5e92022

In [8]:
val pipeline = new Pipeline().
  setStages(Array(tokenizer, remover, cv, idf))

pipeline = pipeline_cd8cd5fcbc6f


pipeline_cd8cd5fcbc6f

In [9]:
val df_v = pipeline.fit(df).transform(df).select("id","doc","features")
df_v.show()

| id|                 doc|            features|
+---+--------------------+--------------------+
|  0|  The sun is shining|(5,[1,2],[0.28768...|
|  1|The weather is sw...|(5,[0,3,4],[0.287...|
|  2|The sun is shinin...|(5,[0,1,2,3],[0.2...|
+---+--------------------+--------------------+



df_v = [id: int, doc: string ... 1 more field]


[id: int, doc: string ... 1 more field]

In [10]:
df_v.select("features").collect.map(row => row(0).asInstanceOf[Vector].toDense)

[[0.0,0.28768207245178085,0.28768207245178085,0.0,0.0], [0.28768207245178085,0.0,0.0,0.28768207245178085,0.6931471805599453], [0.28768207245178085,0.28768207245178085,0.28768207245178085,0.28768207245178085,0.0]]