# TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series of corpus is to a text. The meaning increase proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

### Terminology

- **Term Frequency:** In document *d*, the frequency represents the number of instances of given word *t*. Therefore, we can see that it become more relevant when a words appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

![tfidf](tfidf.png)


- **Document Frequency:** This tests the meaning of the text, which is very is very similar to TF, in the whole corpus collection. The only difference is that in document *d*, TF is the frequency counter for a term *t*, while df is the number of occurences in the document set N in the term t. In other words, the number of papers in which the words is present is DF.

df(t) = occurrence of t in documents

- **Inverse Document Frequency:** Mainly, it tests how relevant the word in. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term *t* by counting the number of documents containing the term

df(t) = N(t)

where,
- df(t) = Document frequency of a term t
- N(t) = Number of documents containing the term t

Term frequency is the number of instances of a term in a single document only; although the frequency of the documents in the number of separate documents in which the term appears, it depends on the entire corpus.

**The IDF of the word in the number of documents in the corpus separated by the frequency of the text.**

idf(t) = N/df(T) = N/N(t)

The most common word is supposed to be considered less significant, but the element (most definite integrers) seems to harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the *idf* of the term *t* becomes:

idf(t) = log(N / df(t))

- **Computation**: *tf-idf* is one of the best metrics to determine how significant a term is to a text in a series of a corpus. *tf-idf* is a weighting system that assigns a weight to each word in a document based on its term frequency (*tf*) and and the reciprocal document frequency (*tf*)(*idf*). The words with higher scores of weigh are deemed to be more significant.

Usually, the *tf-idf* weight consists of two terms:

- **Normalized Term Frequency (tf)**
- **Inverse Document Frequency (idf)**


```
tf-idf(t, d) = tf(t, d) * idf(t)
```

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# collection strings from documents and create a corpus having a 
# collection of strings from the documents d0, d1, and d2.

d0 = "Geeks for geeks"
d1 = "Geeks"
d2 = "r2j"


# merge documents into a single corpus


string = [d0, d1, d2]


In [3]:
# get if-idf values from fit_tramsform() method.

tfidf = TfidfVectorizer()

# get tf-idf values
result = tfidf.fit_transform(string)

In [4]:
# display idf values ofthe words present in the corpus.

print("idf values:")
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ":", ele2)

idf values:
for : 1.6931471805599454
geeks : 1.2876820724517808
r2j : 1.6931471805599454


In [6]:
# Display tf-idf valeus along with indexing.


# get indexing
print("Word indexes:")
print(tfidf.vocabulary_)

# display tf-idf values
print("\ntf-idf value:")
print(result)

# in matrix room
print("\ntf-idf values in matrix form:")
print(result.toarray())

Word indexes:
{'geeks': 1, 'for': 0, 'r2j': 2}

tf-idf value:
  (0, 0)	0.5493512310263033
  (0, 1)	0.8355915419449176
  (1, 1)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.54935123 0.83559154 0.        ]
 [0.         1.         0.        ]
 [0.         0.         1.        ]]


### Python Implementation

In [4]:
import numpy as np
import pandas as pd

In [1]:
# small corpus
corpus = [
    "data science is one of the most important fields of science",
    "this is one of the best data science cources",
    "data scientists analyze data"
]

In [3]:
# creating the word set

words_set = set()

for doc in corpus:
    words = doc.split(" ")
    words_set = words_set.union(set(words))
    
print("Number of words in the corpus: ", len(words_set))
print("The words in the corpus: \n", words_set)

Number of words in the corpus:  14
The words in the corpus: 
 {'one', 'of', 'fields', 'is', 'this', 'the', 'important', 'best', 'science', 'most', 'cources', 'scientists', 'analyze', 'data'}


##### Computing  Term Frequency

In [8]:
# Number of documents in the corpus
n_docs = len(corpus)

# Number of unique words in the document
n_words_set = len(words_set)

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=list(words_set))

# Computer Term Frequency (TF)
for i in range(n_docs):
    # Words in the document
    words = corpus[i].split(" ")
    
    for w in words:
        df_tf[w][i] = df_tf[w][i] + (1 / len(words))

df_tf

Unnamed: 0,one,of,fields,is,this,the,important,best,science,most,cources,scientists,analyze,data
0,0.090909,0.181818,0.090909,0.090909,0.0,0.090909,0.090909,0.0,0.181818,0.090909,0.0,0.0,0.0,0.090909
1,0.111111,0.111111,0.0,0.111111,0.111111,0.111111,0.0,0.111111,0.111111,0.0,0.111111,0.0,0.0,0.111111
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.5


##### Computing Inverse Document Frequency

In [9]:
print("IDF of: ")

idf = {}


for w in words_set:
    # number of documents in the corpus that contain this words
    k = 0
    
    for i in range(n_docs):
        if w in corpus[i].split():
            k+=1
            
    idf[w] = np.log10(n_docs / k)
    print(f"{w:>15}: {idf[w]:>10}")

IDF of: 
            one: 0.17609125905568124
             of: 0.17609125905568124
         fields: 0.47712125471966244
             is: 0.17609125905568124
           this: 0.47712125471966244
            the: 0.17609125905568124
      important: 0.47712125471966244
           best: 0.47712125471966244
        science: 0.17609125905568124
           most: 0.47712125471966244
        cources: 0.47712125471966244
     scientists: 0.47712125471966244
        analyze: 0.47712125471966244
           data:        0.0


##### Putting it Together: Computing TF-IDF

In [10]:
df_tf_idf = df_tf.copy()

for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
        
df_tf_idf


Unnamed: 0,one,of,fields,is,this,the,important,best,science,most,cources,scientists,analyze,data
0,0.016008,0.032017,0.043375,0.016008,0.0,0.016008,0.043375,0.0,0.032017,0.043375,0.0,0.0,0.0,0.0
1,0.019566,0.019566,0.0,0.019566,0.053013,0.019566,0.0,0.053013,0.019566,0.0,0.053013,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11928,0.11928,0.0


### TF-IDF Using Scikit Learn

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
# call the `fit_transform` method on our test corpus.
# This will perform all of the calculations we performed above

tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

In [19]:
# after the vectoring the corpus by the function, a `sparse matrix`
# is obtained.

print(type(tf_idf_vector), tf_idf_vector.shape)

<class 'scipy.sparse._csr.csr_matrix'> (3, 14)


In [20]:
# convert to an regular array to get a better idea of the values.

tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

[[0.         0.         0.         0.18952581 0.32089509 0.32089509
  0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0.
  0.24404899 0.        ]
 [0.         0.40029393 0.40029393 0.23642005 0.         0.
  0.30443385 0.         0.30443385 0.30443385 0.30443385 0.
  0.30443385 0.40029393]
 [0.54270061 0.         0.         0.64105545 0.         0.
  0.         0.         0.         0.         0.         0.54270061
  0.         0.        ]]


In [21]:
# 
words_set = tr_idf_model.get_feature_names_out()

print(words_set)

['analyze' 'best' 'cources' 'data' 'fields' 'important' 'is' 'most' 'of'
 'one' 'science' 'scientists' 'the' 'this']


In [22]:
# dataframe to better show the TF-IDF scores of each document

df_tf_idf = pd.DataFrame(tf_idf_array, columns=words_set)

df_tf_idf

Unnamed: 0,analyze,best,cources,data,fields,important,is,most,of,one,science,scientists,the,this
0,0.0,0.0,0.0,0.189526,0.320895,0.320895,0.244049,0.320895,0.488098,0.244049,0.488098,0.0,0.244049,0.0
1,0.0,0.400294,0.400294,0.23642,0.0,0.0,0.304434,0.0,0.304434,0.304434,0.304434,0.0,0.304434,0.400294
2,0.542701,0.0,0.0,0.641055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.542701,0.0,0.0
