# TF-IDF Implementation

- [Python](#python-implementation)
- [Scikit-Learn](#tf-idf-using-scikit-learn)
- [Tensorflow](#tf-idf-using-tensorflow)

## Python Implementation

We will wrtie a TF-IDF function from scratch using the standard formula given above, but we will not apply any preprocessing operations such as top removal, stemming, punctuation removal, or lowercasing. Result may be different when using native function built into a library.

In [12]:
import pandas as pd
import numpy as np

First, let's contruct a small corpus.

In [13]:
corpus = [
    "data science is one of the most important fields of science",
    "this is one of the best data science courese",
    "data scientists analysze data"
]

Next, we'll create a word set for the corpus.

In [14]:
words_set = set()

for doc in corpus:
    words = doc.split(" ")
    words_set = words_set.union(set(words))
    
print("Number of words in the corpus:", len(words_set))
print("The words in the corpus: \n", words_set)

Number of words in the corpus: 14
The words in the corpus: 
 {'most', 'this', 'one', 'important', 'analysze', 'courese', 'data', 'best', 'fields', 'scientists', 'is', 'science', 'of', 'the'}


### Computing Term Frqeuency

Now we can create a dataframe by the number of documents in the corpus and the word set, and use that information to compute the **term frequency (TF)**.

In [22]:
words_set

{'analysze',
 'best',
 'courese',
 'data',
 'fields',
 'important',
 'is',
 'most',
 'of',
 'one',
 'science',
 'scientists',
 'the',
 'this'}

In [23]:
# Number of documents in the corpus
n_docs = len(corpus)

# Number of unique words in the corpus
n_words_set = len(words_set)

df_tf = pd.DataFrame(
    np.zeros((n_docs, n_words_set)),
    columns=list(words_set)
)


# Compute Term Frequency (TF)
for i in range(n_docs):
    
    # words in the document
    words = corpus[i].split(" ")
    
    for w in words:
        df_tf[w][i] = df_tf[w][i] + (1 / len(words))    
        
        
df_tf

Unnamed: 0,most,this,one,important,analysze,courese,data,best,fields,scientists,is,science,of,the
0,0.090909,0.0,0.090909,0.090909,0.0,0.0,0.090909,0.0,0.090909,0.0,0.090909,0.181818,0.181818,0.090909
1,0.0,0.111111,0.111111,0.0,0.0,0.111111,0.111111,0.111111,0.0,0.0,0.111111,0.111111,0.111111,0.111111
2,0.0,0.0,0.0,0.0,0.25,0.0,0.5,0.0,0.0,0.25,0.0,0.0,0.0,0.0


The dataframe above shows we have a column for each word and a row each document. This shows the frequency of each word in each document.

### Computing Inverse Document Freqency

Now, we'll compute the **inverse document frequency (IDF)**.

In [25]:
print ("IDF of: ")

idf = {}

for w in words_set:
    # number of documents in the corpus that contain the word
    k = 0
    
    for i in range(n_docs):
        if w in corpus[i].split():
            k+=1
            
    idf[w] = np.log10(n_docs/k)
    
    print(f"{w:>15}: {idf[w]:>10}")

IDF of: 
           most: 0.47712125471966244
           this: 0.47712125471966244
            one: 0.17609125905568124
      important: 0.47712125471966244
       analysze: 0.47712125471966244
        courese: 0.47712125471966244
           data:        0.0
           best: 0.47712125471966244
         fields: 0.47712125471966244
     scientists: 0.47712125471966244
             is: 0.17609125905568124
        science: 0.17609125905568124
             of: 0.17609125905568124
            the: 0.17609125905568124


### Computing TF-IDF

In [26]:
df_tf_idf = df_tf.copy()

for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
        
df_tf_idf

Unnamed: 0,most,this,one,important,analysze,courese,data,best,fields,scientists,is,science,of,the
0,0.043375,0.0,0.016008,0.043375,0.0,0.0,0.0,0.0,0.043375,0.0,0.016008,0.032017,0.032017,0.016008
1,0.0,0.053013,0.019566,0.0,0.0,0.053013,0.0,0.053013,0.0,0.0,0.019566,0.019566,0.019566,0.019566
2,0.0,0.0,0.0,0.0,0.11928,0.0,0.0,0.0,0.0,0.11928,0.0,0.0,0.0,0.0


Notice that "data" has an IDF of 0 because it appears in every document. As a result, is not considered to be an important term in this corpus. This will slightly in the following sklearn implementation, where "data" will be non-zero.

## TF-IDF Using scikit-learn

First, we need to import sklearn's TfidfVectorizer:

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

We need to instantiate the class first, then we call the `fit_transform` method on our test corpus. This will perform all of the calculations we performed above.

In [28]:
tf_idf_model = TfidfVectorizer()

In [34]:
tf_idf_vector = tf_idf_model.fit_transform(corpus)

# print matrix cordinate with non-zero values
print(tf_idf_vector)

  (0, 4)	0.320895090271992
  (0, 5)	0.320895090271992
  (0, 7)	0.320895090271992
  (0, 12)	0.24404898736823682
  (0, 8)	0.48809797473647365
  (0, 9)	0.24404898736823682
  (0, 6)	0.24404898736823682
  (0, 10)	0.48809797473647365
  (0, 3)	0.18952580966166677
  (1, 2)	0.40029393442429256
  (1, 1)	0.40029393442429256
  (1, 13)	0.40029393442429256
  (1, 12)	0.30443385488725433
  (1, 8)	0.30443385488725433
  (1, 9)	0.30443385488725433
  (1, 6)	0.30443385488725433
  (1, 10)	0.30443385488725433
  (1, 3)	0.2364200460658773
  (2, 0)	0.5427006131762078
  (2, 11)	0.5427006131762078
  (2, 3)	0.6410554491745127


After vectoring the corpus by the function, a [sparse matrix](../sparse-matrix.md) is obtained.

Here's the current shape of the matrix:

In [30]:
print(type(tf_idf_vector), tf_idf_vector.shape)

<class 'scipy.sparse._csr.csr_matrix'> (3, 14)


And we can convert to an regular array to get a better idea of the values:

In [33]:
tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

[[0.         0.         0.         0.18952581 0.32089509 0.32089509
  0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0.
  0.24404899 0.        ]
 [0.         0.40029393 0.40029393 0.23642005 0.         0.
  0.30443385 0.         0.30443385 0.30443385 0.30443385 0.
  0.30443385 0.40029393]
 [0.54270061 0.         0.         0.64105545 0.         0.
  0.         0.         0.         0.         0.         0.54270061
  0.         0.        ]]


It's now very straightforward to obtain the original terms in the corpus by using `get_feature_names`:

In [35]:
words_set_features_name = tf_idf_model.get_feature_names_out()

print(words_set_features_name)

['analysze' 'best' 'courese' 'data' 'fields' 'important' 'is' 'most' 'of'
 'one' 'science' 'scientists' 'the' 'this']


Finally, we'll create a dataframe to better show the TF-IDF scores of each document:

In [36]:
df_tf_idf_2 = pd.DataFrame(tf_idf_array, columns=words_set_features_name)

In [37]:
df_tf_idf_2

Unnamed: 0,analysze,best,courese,data,fields,important,is,most,of,one,science,scientists,the,this
0,0.0,0.0,0.0,0.189526,0.320895,0.320895,0.244049,0.320895,0.488098,0.244049,0.488098,0.0,0.244049,0.0
1,0.0,0.400294,0.400294,0.23642,0.0,0.0,0.304434,0.0,0.304434,0.304434,0.304434,0.0,0.304434,0.400294
2,0.542701,0.0,0.0,0.641055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.542701,0.0,0.0


## TF-IDF Using Tensorflow

**Inverse Document Frequency** in tensorflow in calculate using the following formula:

```
idf = 1 + log((corpus size + 1) / (count of documents containing term + 1)).
```

In [39]:
import tensorflow as tf

ModuleNotFoundError: No module named 'distutils'