# TF-IDF Representation/Model 

The TF-IDF representation, also called the TF-IDF model, takes into the account the importance of each word. In the bag-of-words model, each word is assumed to be equally important, which is of course not correct.

The formula to calculate TF-IDF weight of a term in a document is:


$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

<span style="font-size: 18px;">Where:</span>
<ul>
    <li><strong>TF(t, d):</strong> Term Frequency of term <em>t</em> in document <em>d</em>.</li>
    <li><strong>IDF(t, D):</strong> Inverse Document Frequency of term <em>t</em> in the corpus <em>D</em>.</li>
</ul>

<span style="font-size: 18px;">Detailed Formulas:</span>
<ul>
    <li><strong>TF:</strong> 
    $$
    \text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of terms in document } d}
    $$</li>
    <li><strong>IDF:</strong> 
    $$
    \text{IDF}(t, D) = \log \frac{N}{1 + n_t}
    $$
    Where:
    <ul>
        <li><strong>N:</strong> Total number of documents in the corpus <em>D</em>.</li>
        <li><strong>n_t:</strong> Number of documents containing the term <em>t</em>.</li>
    </ul>
    </li>
</ul>


Higher weights are assigned to terms that are present frequently in a document and which are rare among all documents. On the other hand, a low score is assigned to terms which are common across all documents.

In [11]:
# import required libraries
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# let us define a pre-processing function

stemmer = PorterStemmer()

def preprocess(document):

    # convert to lower case
    document = document.lower()

    # tokenize the text in the document into words
    words = word_tokenize(document)

    # remove the stop words
    words = [word for word in words if words not in stopwords.words('english')]

    # stem
    words = [stemmer.stem(word) for word in words]

    # join stemmed words to make sentence
    document = ' '.join(words)

    return document

In [13]:
# Let us create a TF-IDF for sample corpus of text 

documents = ["Gangs of Wasseypur is a great movie. Wasseypur is a town in Bihar.", "The success of a song depends on the music.", "There is a new movie releasing this week. The movie is fun to watch."]
print(documents)

['Gangs of Wasseypur is a great movie. Wasseypur is a town in Bihar.', 'The success of a song depends on the music.', 'There is a new movie releasing this week. The movie is fun to watch.']


In [14]:
type(documents)

list

In [15]:
# pre-process all the documents
documents = [preprocess(document) for document in documents]
print(documents)

['gang of wasseypur is a great movi . wasseypur is a town in bihar .', 'the success of a song depend on the music .', 'there is a new movi releas thi week . the movi is fun to watch .']


#### TF-IDF Model

In [17]:
vectorizer = TfidfVectorizer()
tfidf_model = vectorizer.fit_transform(documents)
print(tfidf_model)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 27 stored elements and shape (3, 23)>
  Coords	Values
  (0, 3)	0.2831782312982749
  (0, 10)	0.21536434394959103
  (0, 20)	0.5663564625965498
  (0, 6)	0.43072868789918206
  (0, 4)	0.2831782312982749
  (0, 7)	0.21536434394959103
  (0, 19)	0.2831782312982749
  (0, 5)	0.2831782312982749
  (0, 0)	0.2831782312982749
  (1, 10)	0.2707200828407501
  (1, 15)	0.5414401656815002
  (1, 14)	0.355964375670793
  (1, 13)	0.355964375670793
  (1, 1)	0.355964375670793
  (1, 11)	0.355964375670793
  (1, 8)	0.355964375670793
  (2, 6)	0.41856696082601574
  (2, 7)	0.41856696082601574
  (2, 15)	0.20928348041300787
  (2, 16)	0.27518262650373737
  (2, 9)	0.27518262650373737
  (2, 12)	0.27518262650373737
  (2, 17)	0.27518262650373737
  (2, 22)	0.27518262650373737
  (2, 2)	0.27518262650373737
  (2, 18)	0.27518262650373737
  (2, 21)	0.27518262650373737


In [19]:
# print the full sparse matrix
print(tfidf_model.toarray())

[[0.28317823 0.         0.         0.28317823 0.28317823 0.28317823
  0.43072869 0.21536434 0.         0.         0.21536434 0.
  0.         0.         0.         0.         0.         0.
  0.         0.28317823 0.56635646 0.         0.        ]
 [0.         0.35596438 0.         0.         0.         0.
  0.         0.         0.35596438 0.         0.27072008 0.35596438
  0.         0.35596438 0.35596438 0.54144017 0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.27518263 0.         0.         0.
  0.41856696 0.41856696 0.         0.27518263 0.         0.
  0.27518263 0.         0.         0.20928348 0.27518263 0.27518263
  0.27518263 0.         0.         0.27518263 0.27518263]]


### Let us now create TF-IDF model on the spam dataset

In [26]:
# Load Data

spam_df = pd.read_csv('SMSSpamCollection.txt',delimiter='\t', header=None)

In [27]:
spam_df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [28]:
spam_df.columns = ['label','message']

In [30]:
spam_df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [31]:
spam_df.shape

(5572, 2)

In [32]:
# Let us fetch the top 50 rows and create the tf-idf model for it

spam_df = spam_df.iloc[0:50,:]
spam_df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [34]:
# extract the message from spam_df
messages = [message for message in spam_df.message]
print(messages)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though", "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv", 'Even my brother is not like to speak with me. They treat me like aids patent.', "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', 'Had your mobile 

In [35]:
type(messages)

list

In [36]:
# preprocess the message using the udf
messages = [preprocess(message) for message in messages]

In [37]:
print(messages)

['go until jurong point , crazi .. avail onli in bugi n great world la e buffet ... cine there got amor wat ...', 'ok lar ... joke wif u oni ...', "free entri in 2 a wkli comp to win fa cup final tkt 21st may 2005. text fa to 87121 to receiv entri question ( std txt rate ) t & c 's appli 08452810075over18 's", 'u dun say so earli hor ... u c alreadi then say ...', "nah i do n't think he goe to usf , he live around here though", "freemsg hey there darl it 's been 3 week 's now and no word back ! i 'd like some fun you up for it still ? tb ok ! xxx std chg to send , £1.50 to rcv", 'even my brother is not like to speak with me . they treat me like aid patent .', "as per your request 'mell mell ( oru minnaminungint nurungu vettam ) ' ha been set as your callertun for all caller . press * 9 to copi your friend callertun", 'winner ! ! as a valu network custom you have been select to receivea £900 prize reward ! to claim call 09061701461. claim code kl341 . valid 12 hour onli .', 'had your mo

In [39]:
vectorizer =  TfidfVectorizer()
spam_tfidf_model = vectorizer.fit_transform(messages)
print(spam_tfidf_model)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 743 stored elements and shape (50, 428)>
  Coords	Values
  (0, 155)	0.15740122619251695
  (0, 370)	0.22947785548516664
  (0, 195)	0.2537512927116646
  (0, 278)	0.2537512927116646
  (0, 100)	0.2537512927116646
  (0, 53)	0.2537512927116646
  (0, 262)	0.21225557769305559
  (0, 187)	0.17875380792848203
  (0, 64)	0.2537512927116646
  (0, 162)	0.21225557769305559
  (0, 415)	0.2537512927116646
  (0, 200)	0.2537512927116646
  (0, 63)	0.2537512927116646
  (0, 84)	0.2537512927116646
  (0, 348)	0.1988969412111259
  (0, 159)	0.22947785548516664
  (0, 38)	0.2537512927116646
  (0, 388)	0.2537512927116646
  (1, 259)	0.4017341108428282
  (1, 201)	0.4343305520327375
  (1, 194)	0.4343305520327375
  (1, 403)	0.4802726555443259
  (1, 261)	0.4802726555443259
  (2, 187)	0.13484988527351144
  (2, 147)	0.15004564106574803
  :	:
  (48, 68)	0.31570866072271353
  (48, 98)	0.31570866072271353
  (48, 224)	0.31570866072271353
  (48, 51)	0.315708660722713

In [40]:
print(spam_tfidf_model.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [45]:
# Convert the vectorized array into data frame
tf_idf = pd.DataFrame(spam_tfidf_model.toarray(), columns = vectorizer.get_feature_names_out())

tf_idf.head()

Unnamed: 0,000,07732584351,08000930705,08002986030,08452810075over18,09061701461,100,11,12,150p,...,xuhui,xxx,xxxmobilemovieclub,ye,yeah,you,your,yummi,yup,ú1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.191427,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
# token names
print(vectorizer.get_feature_names_out())

['000' '07732584351' '08000930705' '08002986030' '08452810075over18'
 '09061701461' '100' '11' '12' '150p' '16' '20' '2005' '21st' '2nd'
 '4403ldnw1a7rw18' '4txt' '50' '6day' '81010' '87077' '87121' '87575'
 '8am' '900' 'abiola' 'about' 'actin' 'aft' 'again' 'ahead' 'ahhh' 'aid'
 'all' 'alreadi' 'alright' 'alway' 'am' 'amor' 'amp' 'an' 'and' 'anymor'
 'anyth' 'apologet' 'appli' 'arabian' 'ard' 'are' 'around' 'as' 'ask' 'at'
 'avail' 'back' 'badli' 'be' 'been' 'bit' 'bless' 'breather' 'brother'
 'bu' 'buffet' 'bugi' 'burn' 'but' 'by' 'ca' 'call' 'caller' 'callertun'
 'camcord' 'camera' 'can' 'car' 'cash' 'catch' 'caught' 'chanc' 'charg'
 'cheer' 'chg' 'child' 'cine' 'claim' 'clear' 'click' 'co' 'code' 'colour'
 'com' 'comin' 'comp' 'confirm' 'convinc' 'copi' 'cost' 'could' 'crave'
 'crazi' 'credit' 'cri' 'csh11' 'cup' 'cuppa' 'custom' 'da' 'darl' 'date'
 'day' 'dbuk' 'decid' 'deliveri' 'did' 'dinner' 'do' 'doe' 'done' 'dont'
 'down' 'dun' 'earli' 'eat' 'eg' 'egg' 'eh' 'endow' 'england' 