# Term frequency-inverse document frequency (tf-idf)
Frequent words occur across multiple documents. Those frequently occurring words typically
don't contain useful information.
The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$tf-idf(t,d) = tf(t,d)\cdot idf(t)$$ 

$tf(t, d)$ is the term t frequency  for document d, the inverse document frequency idf can be calculated as:

$$idf(t)= 1 + log[(1+n)/(1+df(t))]$$

where $n$ is the total number of documents, and $df(t)$ is the number of documents 
that contain the term t. Note that if $df(t)=n$ then $idf(t)$ has the minimum value of 1.

The log is used to ensure that low document frequencies are not given too much weight.

In [1]:
import numpy as np
from stemming.porter2 import stem
from sklearn.feature_extraction.text import TfidfVectorizer

## Create a set of documents

In [2]:
docs = [
    'The sun is shining and thus it shines',
    'The weather is sweet',
    'The sun is shining and the weather is sweet']

## Define a stemming tokenizer

In [3]:
def tokenizer_porter(doc):
  return [stem(word) for word in doc.split()]

## Vectorize the documents with tf-idf

In [4]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer = tokenizer_porter)

In [5]:
bag = vectorizer.fit_transform(docs).toarray()
bag

array([[ 0.89442719,  0.4472136 ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.70710678,  0.70710678],
       [ 0.5       ,  0.5       ,  0.5       ,  0.5       ]])

## Vocabulary

In [6]:
vectorizer.vocabulary_

{u'shine': 0, u'sun': 1, u'sweet': 2, u'weather': 3}