# TF-IDF
It is a numerical statistic used to reflect how important a word is to a document in a collection (corpus).
**In simple words:**  
TF-IDF helps us find out which words are most unique and important in a document compared to all other documents.

**Formulas:**
1. **Term Frequency (TF):**  
    Measures how frequently a term appears in a document.  
    TF = (Number of times term t appears in a document) / (Total number of terms in the document)

2. **Inverse Document Frequency (IDF):**  
    Measures how important a term is across all documents.  
    IDF = log_e(Total number of documents / Number of documents containing the term t)

3. **TF-IDF:**  
    TF-IDF = TF * IDF

**How it is calculated:**  
- For each word in a document, calculate its TF (how often it appears in that document).  
- Calculate its IDF (how rare it is across all documents).  
- Multiply TF and IDF to get the TF-IDF score for that word in that document.

**Meaning of its outputs:**  
- A high TF-IDF score means the word is important for that document (it appears often in that document but not in many others).  
- A low TF-IDF score means the word is either common across all documents or does not appear often in the document.


In [4]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = list(stopwords.words('english'))
documents = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are pets",
    "this log is made of wood"
]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Zainab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tdidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 1),
    lowercase=True,
    stop_words=stop_words,
    token_pattern=r'(?u)\b[a-zA-Z]{2,}\b'
)

X = tdidf_vectorizer.fit_transform(documents)

vocab = tdidf_vectorizer.get_feature_names_out()

In [11]:
import pandas as pd

df = pd.DataFrame(
    X.toarray(),
    columns=vocab,
    index=documents
)
df

Unnamed: 0,cat,cats,dog,dogs,log,made,mat,pets,sat,wood
the cat sat on the mat,0.617614,0.0,0.0,0.0,0.0,0.0,0.617614,0.0,0.486934,0.0
the dog sat on the log,0.0,0.0,0.667679,0.0,0.526405,0.0,0.0,0.0,0.526405,0.0
cats and dogs are pets,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0
this log is made of wood,0.0,0.0,0.0,0.0,0.486934,0.617614,0.0,0.0,0.0,0.617614


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love AI", "AI loves me back"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['ai' 'back' 'love' 'loves' 'me']
[[0.57973867 0.         0.81480247 0.         0.        ]
 [0.37997836 0.53404633 0.         0.53404633 0.53404633]]
