---
title: "人工智能 - 基于 Transformer 的语言分类器"
date: 2025-05-05T21:40:32+08:00
author: "Liu Zheng"
tags: ["笔记", "实验"]
categories: "实验笔记"
---

## 基于 Transformer 的语言分类器

做一个练习，寻找一个文本分类的大模型，找一个医疗问答数据集，对其实现向量化和分类标注数据。然后自己构建神经网络层，实现简单的训练与推理。

选择文本分类模型 [Roberta](https://huggingface.co/papluca/xlm-roberta-base-language-detection)，该模型为基础分类模型

In [4]:
import os
import torch

if os.name == 'posix':  # Linux 或 macOS
    PREFIX = '/home/foliage/projects'
elif os.name == 'nt':  # Windows
    PREFIX = 'D:/Projects'
else:
    raise OSError("Unsupported operating system.")

# 模型路径加载
MODEL_PATH = f"{PREFIX}/huggingface-xlm-roberta-base-language-detection"

# 设置 GPU
DEVICE = torch.device('cuda:0')

下方代码为基于 Pipeline 的简单的语言分类推理

In [5]:
from transformers import pipeline

text = [
    'Brevity is the soul of wit.',
    '你大爷终归是你大爷'
]

pipe = pipeline('text-classification', model=MODEL_PATH, device=DEVICE)
pipe(text, top_k=1, truncation=True)

Device set to use cuda:0


[[{'label': 'en', 'score': 0.8889274001121521}],
 [{'label': 'zh', 'score': 0.9208274483680725}]]

下方为分词器形式推理示例：

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

texts = [
    'Brevity is the soul of wit.',
    '非常重要的两个参数。它们的作用是确保输入文本的长度符合模型的要求。'
]

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH).to(DEVICE)

inputs = tokenizer(texts, padding='max_length', truncation=True, return_tensors='pt')
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

print(f"Model Max Length: {tokenizer.model_max_length}, Model Max Position Embeddings: {model.config.max_position_embeddings}")
print(f"Input IDs Shape: {inputs['input_ids'].shape}")
print(f"Attention Mask Shape: {inputs['attention_mask'].shape}")

with torch.no_grad():
    outputs = model(**inputs)
    logits, attentions = outputs.logits, outputs.attentions

preds = torch.softmax(logits, dim=-1)

id2lang = model.config.id2label
vals, idxs = torch.max(preds, dim=1)
{id2lang[k.item()]: v.item() for k, v in zip(idxs, vals)}

Model Max Length: 512, Model Max Position Embeddings: 514
Input IDs Shape: torch.Size([2, 512])
Attention Mask Shape: torch.Size([2, 512])


{'en': 0.8889275193214417, 'zh': 0.9935047626495361}