# Term Frequency (TF)


The number of times the terms appears in document / total number of terms in document.

Say in a 100 word document, the term "cat" appears 12 times. The TF is then 12 / 100

# Inverse Data Frequency (IDF)

Is used to calculate the weight of rare words across all documents in the corpus. The word that occur rarely in the corpus have a high IDF score.
```
log(total number of documents / total number of documents with terms in it)
```

Say there are 10 documents, and 3 has the term "cat" in it, then the IDF is log10(10/3)

In [157]:
documents = ['The car is driven on the road',
             'The truck is driven on the highway']
documents = [
    "the cat sat on my face",
    "the dog sat on my bed"
]

In [158]:
parsed_documents = [document.split(' ') for document in documents]
parsed_documents

[['the', 'cat', 'sat', 'on', 'my', 'face'],
 ['the', 'dog', 'sat', 'on', 'my', 'bed']]

In [163]:
from collections import Counter
import math


df_scores = {}
tf_scores = []
for i, document in enumerate(documents):
    result = {}
    words = document.split(' ')
    
    # Get the number of terms for each words in the document.
    counter = Counter(words)
    for term, freq in counter.items():
        # The term frequency is the frequency of the words in the 
        # current documents, divided by the number of words in the
        # document.
        tf = freq/len(words)
        
        # Store the result for each document.
        result[term] = tf
        
        # For each unique term, add the document_frequency.
        # This will give the number of documents with the given term
        # for the IDF calculation.
        df_scores[term] = df_scores.get(term, 0) + 1
    tf_scores.append(result)

docs = len(documents)

for item in tf_scores:
    for term, tf in item.items():
        df = df_scores[term]
        # This is the textbook algorithm, sklearn implements it differently.
        idf = math.log10(docs/df)
        tfidf = tf * idf
        print(f'tfidf for {term} is {tfidf}')
    print()

tfidf for the is 0.0
tfidf for cat is 0.050171665943996864
tfidf for sat is 0.0
tfidf for on is 0.0
tfidf for my is 0.0
tfidf for face is 0.050171665943996864

tfidf for the is 0.0
tfidf for dog is 0.050171665943996864
tfidf for sat is 0.0
tfidf for on is 0.0
tfidf for my is 0.0
tfidf for bed is 0.050171665943996864



https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

If `smooth_idf=True` (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

if `smooth_idf=False`, idf(t) = log [ n / df(t) ] + 1 

(Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

In [160]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [161]:
vectorizer = TfidfVectorizer(lowercase=False,use_idf=True,smooth_idf=True)
response = vectorizer.fit_transform(documents)
print(response)

  (0, 7)	0.35464863330313684
  (0, 1)	0.49844627974580596
  (0, 6)	0.35464863330313684
  (0, 5)	0.35464863330313684
  (0, 4)	0.35464863330313684
  (0, 3)	0.49844627974580596
  (1, 7)	0.35464863330313684
  (1, 6)	0.35464863330313684
  (1, 5)	0.35464863330313684
  (1, 4)	0.35464863330313684
  (1, 2)	0.49844627974580596
  (1, 0)	0.49844627974580596


References

- https://triton.ml/blog/tf-idf-from-scratch
- https://github.com/mayank408/TFIDF/blob/master/TFIDF.ipynb
- https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/