# Term Frequency and Inverse Document Frequency

This method is used to convert words to vectors. Bag of words was also used for the same purpose, but the problem with bag of words was that, it only converted the words in either 0 or the frequency of that word. Their was no weightage given to the words. So if we want to carry out task like symentic analysis, then this method will not help us. So to overcome this problem TF-IDF is used. Here in this method more importance is given to the words which are important and plays a vital role in giving meaning to the sentence.

### How does this method work ?

As usual,first we will perform the text cleaning tasks, like removing the stop words and performing either Lemmatization or Stemming. After we get the text corpus, now next step is to caclculate the Term Frequency.

Term Frequency (TF) = (No. of repeation of words in a sentence) / (No. of words in a sentence)

Inverse Document Frequency (IDF) = log(No. of sentences / No. of sentences containing the word)

And finally TF-IDF = TF * IDF

In [2]:
para = '''
Mr. Aakash Patel is 23 years old. The date is 3rd Feburary 2022. Time is 1PM.
By impossible of in difficulty discovered celebrated ye. Justice joy manners boy met resolve produce. Bed head loud next plan rent had easy add him. As earnestly shameless elsewhere defective estimable fulfilled of. Esteem my advice it an excuse enable. Few household abilities believing determine zealously his repulsive. To open draw dear be by side like.

Necessary ye contented newspaper zealously breakfast he prevailed. Melancholy middletons yet understood decisively boy law she. Answer him easily are its barton little. Oh no though mother be things simple itself. Dashwood horrible he strictly on as. Home fine in so am good body this hope.

Knowledge nay estimable questions repulsive daughters boy. Solicitude gay way unaffected expression for. His mistress ladyship required off horrible disposed rejoiced. Unpleasing pianoforte unreserved as oh he unpleasant no inquietude insipidity. Advantages can discretion possession add favourable cultivated admiration far. Why rather assure how esteem end hunted nearer and before. By an truth after heard going early given he. Charmed to it excited females whether at examine. Him abilities suffering may are yet dependent.

Do am he horrible distance marriage so although. Afraid assure square so happen mr an before. His many same been well can high that. Forfeited did law eagerness allowance improving assurance bed. Had saw put seven joy short first. Pronounce so enjoyment my resembled in forfeited sportsman. Which vexed did began son abode short may. Interested astonished he at cultivated or me. Nor brought one invited she produce her.
'''

In [21]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [22]:
wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(para)
corpus = []

In [23]:
sentences

['\nMr. Aakash Patel is 23 years old.',
 'The date is 3rd Feburary 2022.',
 'Time is 1PM.',
 'By impossible of in difficulty discovered celebrated ye.',
 'Justice joy manners boy met resolve produce.',
 'Bed head loud next plan rent had easy add him.',
 'As earnestly shameless elsewhere defective estimable fulfilled of.',
 'Esteem my advice it an excuse enable.',
 'Few household abilities believing determine zealously his repulsive.',
 'To open draw dear be by side like.',
 'Necessary ye contented newspaper zealously breakfast he prevailed.',
 'Melancholy middletons yet understood decisively boy law she.',
 'Answer him easily are its barton little.',
 'Oh no though mother be things simple itself.',
 'Dashwood horrible he strictly on as.',
 'Home fine in so am good body this hope.',
 'Knowledge nay estimable questions repulsive daughters boy.',
 'Solicitude gay way unaffected expression for.',
 'His mistress ladyship required off horrible disposed rejoiced.',
 'Unpleasing pianoforte unr

In [24]:
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z0-9]' , ' ' , sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
    review = ' '.join(review)
    corpus.append(review)

In [25]:
corpus

['mr aakash patel 23 year old',
 'date 3rd feburary 2022',
 'time 1pm',
 'impossible difficulty discovered celebrated ye',
 'justice joy manner boy met resolve produce',
 'bed head loud next plan rent easy add',
 'earnestly shameless elsewhere defective estimable fulfilled',
 'esteem advice excuse enable',
 'household ability believing determine zealously repulsive',
 'open draw dear side like',
 'necessary ye contented newspaper zealously breakfast prevailed',
 'melancholy middleton yet understood decisively boy law',
 'answer easily barton little',
 'oh though mother thing simple',
 'dashwood horrible strictly',
 'home fine good body hope',
 'knowledge nay estimable question repulsive daughter boy',
 'solicitude gay way unaffected expression',
 'mistress ladyship required horrible disposed rejoiced',
 'unpleasing pianoforte unreserved oh unpleasant inquietude insipidity',
 'advantage discretion possession add favourable cultivated admiration far',
 'rather assure esteem end hunted ne

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()

In [27]:
X

array([[0.        , 0.        , 0.41518962, ..., 0.41518962, 0.        ,
        0.        ],
       [0.        , 0.5       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.70710678, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

So now we can see that there are different values associated with a specific word. 