## TFID  *(Term Frequency-Inverse Document Frequency)*

* Term Frequency (TF): how many times a word appears in a document.
* Inverse Document Frequency (IDF): the inverse document frequency of the word across a collection of documents. Rare words have high scores, common words have low scores.

TF(word, document) = “number of occurrences of the word in the document” / “number of words in the document”

IDF(word) = log(number of documents / number of documents that contain the word)

### TF-IDF(word, document) = TF(word, document) * IDF(word)


In [6]:
!pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.2.2 threadpoolctl-3.1.0


In [7]:
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer


documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

In [8]:
bag_of_wordsA = documentA.split(" ")
bag_of_wordsB = documentB.split(" ")
bag_of_wordsB

['the', 'children', 'sat', 'around', 'the', 'fire']

In [10]:
unique_words = set(bag_of_wordsA) .union( set(bag_of_wordsB))
unique_words 

{'a',
 'around',
 'children',
 'fire',
 'for',
 'man',
 'out',
 'sat',
 'the',
 'walk',
 'went'}

In [5]:
# num_of_wordsA = dict.keys(unique_words, 0)

In [13]:
wordsA = dict.fromkeys(unique_words,0 )
for word in bag_of_wordsA:
    wordsA[word]+=1
wordsB = dict.fromkeys(unique_words,0 )
for word in bag_of_wordsB:
    wordsB[word]+=1

In [15]:
wordsA

{'children': 0,
 'man': 1,
 'a': 1,
 'fire': 0,
 'the': 1,
 'sat': 0,
 'around': 0,
 'out': 1,
 'went': 1,
 'for': 1,
 'walk': 1}

In [14]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

**Term Frequency (TF)**

The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.

In [20]:
def computeTF(wordDict, bow):
    tfDict  =dict()
    bow_count = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bow_count)
    return tfDict


In [22]:
tfA = computeTF(wordsA, bag_of_wordsA)
tfB = computeTF(wordsB, bag_of_wordsB)
tfA

{'children': 0.0,
 'man': 0.14285714285714285,
 'a': 0.14285714285714285,
 'fire': 0.0,
 'the': 0.14285714285714285,
 'sat': 0.0,
 'around': 0.0,
 'out': 0.14285714285714285,
 'went': 0.14285714285714285,
 'for': 0.14285714285714285,
 'walk': 0.14285714285714285}

**Inverse Data Frequency (IDF)**

The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

In [33]:
import math
def computeIDF(documents):
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys() , 0)
    for document in documents:
        for word , val in document.items():
            if val>0:
                idfDict[word]+=1
    for word, val in idfDict.items():
        idfDict[word] = math.log(N/float(val))
    return idfDict

In [34]:
idfs = computeIDF([wordsA,wordsB])
idfs

{'children': 0.6931471805599453,
 'man': 0.6931471805599453,
 'a': 0.6931471805599453,
 'fire': 0.6931471805599453,
 'the': 0.0,
 'sat': 0.6931471805599453,
 'around': 0.6931471805599453,
 'out': 0.6931471805599453,
 'went': 0.6931471805599453,
 'for': 0.6931471805599453,
 'walk': 0.6931471805599453}

In [35]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [36]:
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
df = pd.DataFrame([tfidfA, tfidfB])
df

Unnamed: 0,children,man,a,fire,the,sat,around,out,went,for,walk
0,0.0,0.099021,0.099021,0.0,0.0,0.0,0.0,0.099021,0.099021,0.099021,0.099021
1,0.115525,0.0,0.0,0.115525,0.0,0.115525,0.115525,0.0,0.0,0.0,0.0


'the man went out for a walk'

In [46]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns = feature_names)
df

Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0.0,0.0,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.42616
1,0.407401,0.407401,0.407401,0.0,0.0,0.0,0.407401,0.579739,0.0,0.0
