# **Text Documents to a TF-IDF Matrix Using tfidfvectorizer**

**Term Frequency (TF)**

TF(t, d) = (count of t in d) / (number of words in d)

**Document Frequency (DF)**

The significance of a document within a corpus is gauged by its Document Frequency (DF). DF counts the number of papers that contain the phrase at least once, as opposed to TF, which counts the instances of a term in a document. The DF formula is:

DF(t) = occurrence of t in documents

**Inverse Document Frequency (IDF)**

The informativeness of a word is measured by its inverse document frequency, or IDF. All terms are given identical weight while calculating TF, although IDF helps scale up uncommon terms and weigh down common ones (like stop words). The IDF formula is:

IDF(t) = log(N / (DF(t) + 1))

where N is the total number of documents and DF(t) is the number of documents containing the term t.

**TF-IDF**

TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It combines the importance of a term in a document (TF) with the term’s rarity across the corpus (IDF). The formula is:

TF-IDF(t, d) = TF(t, d) x log(N / (DF(t) + 1))

# TF-IDF Implementation Using TfidfVectorizer From scikit-learn

**Import Libraries**

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer


**Load Dataset**

In [2]:
newsgroups = fetch_20newsgroups(subset='train')


**Initialize TfidfVectorizer**

In [3]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)


**Convert Text Documents to TF-IDF Matrix**

In [4]:
tfidf_matrix = vectorizer.fit_transform(newsgroups.data)


**Display TF-IDF Matrix**

In [5]:
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf.head()


Unnamed: 0,00,000,01,02,03,04,0d,0t,10,100,...,write,writes,written,wrong,wrote,year,years,yes,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.116702,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.061716,0.0,0.0,0.133857,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.054855,0.0,0.0,0.0,0.0,0.0,0.120463,0.0,0.0
