>baseline是在当前任务下，用最简单、最低成本、最可复现、最低成本的方法建立的参考性能，用来衡量后续复杂模型是否真正带来有效提升

合格的baseline
| 标准  | 解释          | 面试话术             |
| --- | ----------- | ---------------- |
| 简单  | 不依赖深度学习     | “采用传统特征方法构建参考模型” |
| 稳定  | 不依赖随机初始化波动大 | “结果稳定可复现”        |
| 快速  | 几分钟内能训练完    | “低成本验证任务可分性”     |
| 可解释 | 能看特征权重      | “可分析高权重词项”       |

tf-idf baseline的标准结构

baseline 采用 tf-idf + 线性分类器(logistic regression)。其中 tf-idf 用于构建稀疏词袋特征，捕捉词项在类别区分中的统计重要性；线性模型作为低复杂度判别器，用于验证文本在词频统计空间下的线性可分性。该baseline训练速度快、结果稳定，用作后续深度模型的性能参考。

In [None]:
import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

#1.读取数据
df = pd.read_csv("data.csv") #两列：text，label

#中文分词（英文课跳过）
df["text_cut"] = df["text"].apply(lambda x: " ".join(jieba.lcut(x)))

#2.数据集划分
X_train, X_test, y_train, y_test = train_test_split(
    df["text_cut"], df["label"], test_size=0.2, random_state=42
)

#3.tf-idf特征
vectorizer = TfidfVectorizer(
    max_features=20000,
    ngram_range=(1,2),
    min_df=3,       #至少出现3次
    max_df=0.9      #过滤高频无效词
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

#4.训练分类器
#迭代次数为1000次
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)

#5.评估
y_pred = clf.predict(X_test_tfidf)

print("Accuracy: ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [1]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator("In this course, we will teach you how to", max_length=30, num_return_sequences=2,)



  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/

[{'generated_text': 'In this course, we will teach you how to create a virtual reality experience using the Oculus Rift and the HTC Vive. There are two options which are currently available:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'},
 {'generated_text': 'In this course, we will teach you how to use the “Pledgehammer“: the key to achieving success in a business. The key is to be able to apply our principles in a business.“\n\n\nThe following topics in this course are not intended to be offered in this course.\n1. The Problem with the “Pledgehammer“: the key to achievin