# 10 TF-IDF Weights

TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Referensi: [https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

## Dataset

In [1]:
corpus = ['The dog kicks a green ball.', 
          'The boy with red hat kicks a red ball.', 
          'The man sell a blue ball']

## Term Frequency

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
frequencies = np.array(vectorized_X)
frequencies

array([[1, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 1, 1, 0, 2, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1]])

In [3]:
dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[0]))

{'ball': 0,
 'blue': 1,
 'boy': 2,
 'dog': 3,
 'green': 4,
 'hat': 5,
 'kicks': 6,
 'man': 7,
 'red': 8,
 'sell': 9}

In [4]:
tf = np.sum(frequencies, axis=0)
tf

array([3, 1, 1, 1, 1, 1, 2, 1, 2, 1])

In [5]:
for token, index in vectorizer.vocabulary_.items():
    print(f'{token}: {tf[index]}')

dog: 1
kicks: 2
green: 1
ball: 3
boy: 1
red: 2
hat: 1
man: 1
sell: 1
blue: 1


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit_transform(corpus).todense()[0]

matrix([[0.34520502, 0.        , 0.        , 0.5844829 , 0.5844829 ,
         0.        , 0.44451431, 0.        , 0.        , 0.        ]])

In [7]:
vectorizer.vocabulary_

{'dog': 3,
 'kicks': 6,
 'green': 4,
 'ball': 0,
 'boy': 2,
 'red': 8,
 'hat': 5,
 'man': 7,
 'sell': 9,
 'blue': 1}