# TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the collection.

The formula to compute the TF-IDF score for a word in a document is given by:
$$
\text{TF-IDF}(w, d, D) = \text{TF}(w, d) \times \text{IDF}(w, D)
$$
where:
- $w$ is the word
- $d$ is the document
- $D$ is the collection of documents
- $\text{TF}(w, d)$ is the term frequency of word $w$ in document $d$
- $\text{IDF}(w, D)$ is the inverse document frequency of word $w$ in collection of documents $D$



In [1]:
import pandas as pd 

In [2]:
messages=pd.read_csv('spam_ham_dataset.csv')

In [3]:
messages.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [4]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [5]:
messages = messages.drop(['label_num', 'Unnamed: 0'], axis=1)

In [6]:
messages.head()

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...


In [7]:
import re 
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\AL KHAIR
[nltk_data]     COMPUTER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wn = WordNetLemmatizer()

In [9]:
stop_words = set(stopwords.words('english'))

In [10]:
# Vectorized function to preprocess text
def preprocess_text(text):
    if isinstance(text, str):
        text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
        text = text.lower()  # Lowercase
        words = text.split()  # Tokenize
        words = [wn.lemmatize(word) for word in words if word not in stop_words]  # Lemmatize and remove stopwords
        return ' '.join(words)  # Join back into a string
    return ""  # Handle invalid entries

# Apply to the 'text' column of messages
messages['cleaned_text'] = messages['text'].apply(preprocess_text)

In [11]:
messages['cleaned_text']

0       subject enron methanol meter follow note gave ...
1       subject hpl nom january see attached file hpln...
2       subject neon retreat ho ho ho around wonderful...
3       subject photoshop window office cheap main tre...
4       subject indian spring deal book teco pvr reven...
                              ...                        
5166    subject put ft transport volume decreased cont...
5167    subject following noms hpl take extra mmcf wee...
5168    subject calpine daily gas nomination julie men...
5169    subject industrial worksheet august activity a...
5170    subject important online banking alert dear va...
Name: cleaned_text, Length: 5171, dtype: object

### Creating the TF-IDF matrix

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(max_features=100)

In [13]:
X = tf.fit_transform(messages['cleaned_text']).toarray()

In [14]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000, 
    formatter=dict(float=lambda x: "%.3g" % x))
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0.29, 0, 0, 0, 0, 0, 0, 0.684, 0.243, 0, 0, 0, 0, 0, 0, 0, 0.21, 0, 0, 0, 0.325, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.093, 0, 0, 0, 0, 0, 0, 0, 0, 0.267, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0.298, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.343, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0.285, 0, 0, 0, 0, 0, 0.108, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.711],
       [0.0912, 0, 0, 0.0983, 0, 0, 0, 0, 0.0897, 0.0746, 0, 0, 0, 0, 0.101, 0, 0, 0, 0, 0, 0, 0.0943, 0, 0, 0, 0, 0, 0.106, 0, 0.0966, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0287, 0, 0, 0, 0, 0, 0, 0.598, 0, 0, 0, 0.208, 0, 0, 0.345, 0.104, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.807, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 

In [15]:
tf.vocabulary_

{'subject': 83,
 'enron': 24,
 'meter': 53,
 'flow': 28,
 'daren': 16,
 'please': 67,
 'daily': 15,
 'volume': 92,
 'gas': 33,
 'change': 8,
 'hpl': 38,
 'nom': 59,
 'see': 77,
 'attached': 2,
 'file': 26,
 'xl': 99,
 'time': 90,
 'know': 43,
 'go': 35,
 'week': 94,
 'like': 45,
 'following': 29,
 'need': 56,
 'get': 34,
 'let': 44,
 'first': 27,
 'would': 97,
 'could': 14,
 'www': 98,
 'com': 9,
 'also': 0,
 'one': 62,
 'email': 21,
 'back': 3,
 'make': 48,
 'deal': 18,
 'price': 69,
 'message': 52,
 'http': 39,
 'take': 85,
 'purchase': 72,
 'day': 17,
 'sale': 76,
 'march': 49,
 'today': 91,
 'mail': 47,
 'new': 58,
 'product': 70,
 'net': 57,
 'delivery': 19,
 'want': 93,
 'forwarded': 31,
 'texas': 86,
 'cc': 7,
 'mmbtu': 54,
 'nomination': 60,
 'stock': 82,
 'inc': 40,
 'month': 55,
 'april': 1,
 'service': 79,
 'company': 10,
 'system': 84,
 'call': 6,
 'information': 41,
 'business': 5,
 'number': 61,
 'help': 36,
 'statement': 81,
 'may': 51,
 'report': 74,
 'july': 42,
 'than

## N-grams

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the number and "gram", for example, "four-gram", "five-gram", and so on.


In [17]:
## Create the Bag OF Words model with ngram
from sklearn.feature_extraction.text import TfidfVectorizer
## for Binary BOW enable binary=True
tf=TfidfVectorizer(max_features=100,ngram_range=(2,3))
X=tf.fit_transform(messages['cleaned_text']).toarray()

In [18]:
tf.vocabulary_

{'subject enron': 87,
 'subject hpl': 90,
 'hpl nom': 54,
 'see attached': 83,
 'attached file': 5,
 'subject hpl nom': 91,
 'see attached file': 84,
 'let know': 60,
 'enron com': 31,
 'texas utility': 97,
 'com cc': 13,
 'cc subject': 6,
 'enron hpl': 35,
 'hpl actuals': 52,
 'teco tap': 94,
 'tap enron': 93,
 'hpl gas': 53,
 'gas daily': 44,
 'subject enron hpl': 88,
 'enron hpl actuals': 36,
 'teco tap enron': 95,
 'tenaska iv': 96,
 'vance taylor': 98,
 'robert cotten': 80,
 'cotten hou': 17,
 'hou ect': 49,
 'ect ect': 24,
 'ect cc': 22,
 'julie meyers': 59,
 'smith hou': 85,
 'melissa graf': 61,
 'graf hou': 47,
 'ect subject': 29,
 'ect pm': 28,
 'enron enron': 32,
 'gary hank': 43,
 'pat clynes': 72,
 'clynes corp': 11,
 'corp enron': 15,
 'na enron': 64,
 'daren farmer': 19,
 'farmer hou': 41,
 'robert cotten hou': 81,
 'cotten hou ect': 18,
 'hou ect ect': 50,
 'ect ect cc': 25,
 'smith hou ect': 86,
 'melissa graf hou': 62,
 'graf hou ect': 48,
 'ect ect subject': 26,
 'hou

In [19]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0.374, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.36, 0.377, 0, 0, 0, 0, 0, 0.416, 0.464, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 