# TF-IDF 从零讲解与实现
本 Notebook 从零理解 TF-IDF（Term Frequency – Inverse Document Frequency）的原理，并通过 Python 实现从原理推导到 sklearn 实用的完整流程。

主要内容：
1. 什么是 TF-IDF？
2. 从零实现 TF-IDF
3. 使用 scikit-learn 的 `TfidfVectorizer`
4. 可视化与应用示例（文本分类）

## 一、TF-IDF 的基本思想

TF-IDF 是一种衡量词语在文档中重要性的权重方法
- **TF（Term Frequency）**：词频，某个词在文档中出现的频率
- **IDF（Inverse Document Frequency）**：逆文档频率，该词在整个语料中出现的稀有程度

计算方式：

$$
\text{TF}(t,d) = \frac{\text{count}(t, d)}{\sum_t \text{count}(t, d)}
$$

$$
\text{IDF}(t) = \log\frac{N}{1 + n_t} + 1
$$

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

其中：
- $N$ ：总文档数
- $n_t$ ：包含词 $t$ 的文档数
- $\text{TF-IDF}(t, d)$：词 $t$ 在文档 $d$ 中的重要性

## 二、代码实现

#### 1. 手动实现 TF-IDF

In [None]:
import math
from collections import Counter # For counting term frequencies

docs = [
    "我 喜欢 机器 学习",
    "机器 学习 很好玩",
    "我 不 喜欢 学习"
]

# Pre-Step: The documents after tokenization (tokens/terms separated by spaces)
tokenized_docs = [doc.split() for doc in docs]
print("Tokenized Documents:", tokenized_docs)

# Step 1: Count total number of documents
N = len(tokenized_docs)

# Step 2: Calculate TF(that is, token/term frequency) for each document
tf_list = []
for doc in tokenized_docs:
    counts = Counter(doc) # return a dictionary of term frequencies, e.g., {'我': 1, '喜欢': 1, '机器': 1, '学习': 1}
    total = len(doc)
    tf = {word: counts[word] / total for word in counts}
    tf_list.append(tf)
print("TF List:", tf_list)

# Step 3: Calculate IDF (inverse document frequency) for each token/term
# - If word appears in df documents, then idf = log(N / (1 + df)) + 1
all_words = set(word for doc in tokenized_docs for word in doc)
print("All Unique Words:", all_words)
idf = {}
for word in all_words:
    df = sum(1 for doc in tokenized_docs if word in doc) # if word appears in the document, count it
    idf[word] = math.log(N / (1 + df)) + 1

# Step 4: Calculate TF-IDF for each document
# - TF-IDF = TF * IDF
tfidf_list = []
for tf in tf_list:
    tfidf = {word: tf[word] * idf[word] for word in tf}
    tfidf_list.append(tfidf)

# Step 5: Output the results
for i, tfidf in enumerate(tfidf_list):
    print(f"文档 {i+1} TF-IDF:")
    for word, value in sorted(tfidf.items(), key=lambda x: -x[1]):
        print(f"  {word}: {value:.4f}")
    print()

Tokenized Documents: [['我', '喜欢', '机器', '学习'], ['机器', '学习', '很好玩'], ['我', '不', '喜欢', '学习']]
TF List: [{'我': 0.25, '喜欢': 0.25, '机器': 0.25, '学习': 0.25}, {'机器': 0.3333333333333333, '学习': 0.3333333333333333, '很好玩': 0.3333333333333333}, {'我': 0.25, '不': 0.25, '喜欢': 0.25, '学习': 0.25}]
All Unique Words: {'机器', '很好玩', '我', '学习', '喜欢', '不'}
文档 1 TF-IDF:
  我: 0.2500
  喜欢: 0.2500
  机器: 0.2500
  学习: 0.1781

文档 2 TF-IDF:
  很好玩: 0.4685
  机器: 0.3333
  学习: 0.2374

文档 3 TF-IDF:
  不: 0.3514
  我: 0.2500
  喜欢: 0.2500
  学习: 0.1781



#### 2. 使用 `sklearn.feature_extraction.text.TfidfVectorizer`

**TfidfVectorizer 原理说明（源码文档节选）**

> **Tf** means *term-frequency* while **tf-idf** means *term-frequency times inverse document-frequency*.  
> This is a common term weighting scheme in information retrieval, that has also found good use in document classification.  
>
> The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document  
> is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically  
> less informative than features that occur in a small fraction of the training corpus.  
>
> The formula that is used to compute the tf-idf for a term *t* of a document *d* in a document set is  
> 
> \[
> tfidf(t, d) = tf(t, d) \times idf(t)
> \]  
>
> and the *idf* is computed as  
>
> \[
> idf(t) = \log \frac{n}{df(t)} + 1 \quad (\text{if } smooth\_idf=False)
> \]  
>
> where *n* is the total number of documents in the document set and *df(t)* is the document frequency of *t*;  
> the document frequency is the number of documents in the document set that contain the term *t*.  
> The effect of adding "1" to the idf in the equation above is that terms with zero idf,  
> i.e., terms that occur in all documents in a training set, will not be entirely ignored.  
>
> (Note that the idf formula above differs from the standard textbook notation  
> that defines the idf as  
> \[
> idf(t) = \log \frac{n}{df(t) + 1}
> \])  
>
> If `smooth_idf=True` (the default), the constant "1" is added to the numerator and denominator of the idf  
> as if an extra document was seen containing every term in the collection exactly once,  
> which prevents zero divisions:  
>
> \[
> idf(t) = \log \frac{1 + n}{1 + df(t)} + 1
> \]  
>
> Furthermore, the formulas used to compute tf and idf depend on parameter settings  
> that correspond to the SMART notation used in IR as follows:
>
> - Tf is "n" (natural) by default, "l" (logarithmic) when `sublinear_tf=True`.  
> - Idf is "t" when `use_idf=True`, "n" (none) otherwise.  
> - Normalization is "c" (cosine) when `norm='l2'`, "n" (none) when `norm=None`.  

2-1. analyze by characters

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "我喜欢机器学习",
    "机器学习很好玩",
    "我不喜欢学习"
]

vectorizer = TfidfVectorizer(
    analyzer='char',    # analyze by characters
    token_pattern=None, # when analyzer is 'char', token_pattern should be None
    ngram_range=(1,1)   # using unigrams (single characters), can try (1,2) for bigrams, try (1,3) for trigrams, etc.
    # You can try different `ngram_range` to see the effect
)

X = vectorizer.fit_transform(docs)

print("Term/Token Frequencies:", vectorizer.vocabulary_)

print("Feature Names:", vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(X.toarray())

Term/Token Frequencies: {'我': 7, '喜': 2, '欢': 9, '机': 8, '器': 3, '学': 5, '习': 1, '很': 6, '好': 4, '玩': 10, '不': 0}
Feature Names: ['不' '习' '喜' '器' '好' '学' '很' '我' '机' '欢' '玩']

TF-IDF Matrix:
[[0.         0.31173037 0.40140961 0.40140961 0.         0.31173037
  0.         0.40140961 0.40140961 0.40140961 0.        ]
 [0.         0.26806191 0.         0.34517852 0.45386827 0.26806191
  0.45386827 0.         0.34517852 0.         0.45386827]
 [0.53972482 0.31877017 0.41047463 0.         0.         0.31877017
  0.         0.41047463 0.         0.41047463 0.        ]]


可视化结果

In [None]:
import pandas as pd

# 每一行代表一个文档，每一列代表一个特征词（词汇表中的词），数值为 TF-IDF 权重
df_tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,不,习,喜,器,好,学,很,我,机,欢,玩
0,0.0,0.31173,0.40141,0.40141,0.0,0.31173,0.0,0.40141,0.40141,0.40141,0.0
1,0.0,0.268062,0.0,0.345179,0.453868,0.268062,0.453868,0.0,0.345179,0.0,0.453868
2,0.539725,0.31877,0.410475,0.0,0.0,0.31877,0.0,0.410475,0.0,0.410475,0.0


2-2. analyze by words

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "我 喜欢 机器 学习",
    "机器 学习 很好玩",
    "我 不 喜欢 学习"
]

vectorizer = TfidfVectorizer(
    analyzer='word', # analyze by words
    token_pattern=r"(?u)\b\w+\b", # tokenize by words, r"(?u)\b\w+\b" means to match words
    ngram_range=(1,1)
)

X = vectorizer.fit_transform(docs)

print("Feature Names:", vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(X.toarray())

Feature Names: ['不' '喜欢' '学习' '很好玩' '我' '机器']

TF-IDF Matrix:
[[0.         0.52682017 0.40912286 0.         0.52682017 0.52682017]
 [0.         0.         0.42544054 0.72033345 0.         0.54783215]
 [0.63174505 0.4804584  0.37311881 0.         0.4804584  0.        ]]


2-3. 字符级与分词器示例（中文文本）

In [19]:
import jieba

def jieba_tokenize(text):
    return jieba.lcut(text)

vectorizer_char = TfidfVectorizer(analyzer='char', ngram_range=(1,2))
vectorizer_jieba = TfidfVectorizer(tokenizer=jieba_tokenize, token_pattern=None)

X_char = vectorizer_char.fit_transform(docs)
X_jieba = vectorizer_jieba.fit_transform(docs)

print("Char-level 特征：", vectorizer_char.get_feature_names_out()[:10])
print("Jieba 特征：", vectorizer_jieba.get_feature_names_out()[:10])

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.670 seconds.
Prefix dict has been built successfully.


Char-level 特征： ['不' '不喜' '习' '习很' '喜' '喜欢' '器' '器学' '好' '好玩']
Jieba 特征： ['不' '喜欢' '好玩' '学习' '很' '我' '机器']


#### 3. 应用示例：TF-IDF + 朴素贝叶斯分类

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 样例数据
corpus = [
    "我 爱 吃 苹果",
    "我 爱 吃 香蕉",
    "今天 天气 真好",
    "明天 要 下雨 了",
    "苹果 很 好吃",
    "香蕉 味道 不错"
]
y = [0, 0, 1, 1, 0, 0]  # 0: 水果类, 1: 天气类

vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

## 总结

1. TF-IDF 是一种基于统计的文本表示方法，用于衡量词语在文档中的重要性。
2. 它综合考虑了词频（TF）与逆文档频率（IDF），能抑制常见词、突出关键词。
3. 在实际应用中，可使用 `TfidfVectorizer` 自动完成分词、词频统计与矩阵生成。
4. TF-IDF 向量常作为传统机器学习模型（如 SVM、NB、LR）的输入特征。