# 中文tour(一) quick start 

## 使用pipeline()

在pipeline()传入需要完成的任务即可: 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'conversational', 'image-classification', 'translation_XX_to_YY'

In [1]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [2]:
# 也可以用在一长串的句子上面
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


In [3]:
generator = pipeline("text-generation")
generator("i heard a sin from above.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'i heard a sin from above. They began to pray for him and then their prayers became violent.\n\nWhen they heard that sin was happening to their family they fled the area, leaving the children behind.\n\nThe police discovered the child in'}]

In [4]:
# 这里看到普通分类器对中文不是很敏感
classifier('你太好了！')

[{'label': 'POSITIVE', 'score': 0.7977770566940308}]

这个分类模型的pipeline默认为distilbert-base-uncased-finetuned-sst-2-english，想要使用其他模型，加入参数model

In [5]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [6]:
classifier('你太好了')

[{'label': '5 stars', 'score': 0.6415705680847168}]

## pipeline()的细节

* AutoTokenizer：用来得到与模型匹配的分词器
* AutoModelForSequenceClassification：用于句子的分词模型，不同的任务使用的模型不一样，具体可以参考
* from_pretrained() 上面两个类的方法，取出目标模型的名字

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

如果你没有看到和你现在的数据集的data相似的pre-train model时，你可以拿自己的数据再训练。

In [9]:
classifier('你太好了')

[{'label': '5 stars', 'score': 0.6415705680847168}]

## under the hood: pretrained models

In [10]:
## PYTORCH CODE
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
## TENSORFLOW CODE
# from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
# model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

## using the tokenizer

In [11]:
inputs = tokenizer("我喜欢吸吸果冻。")
print(inputs)

{'input_ids': [101, 1855, 100, 100, 100, 100, 100, 1974, 100, 100, 1916, 1810, 100, 100, 1636, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


当要传入多个句子的时候，做成一个batch，pad成同样的长度

In [12]:
lines = ['我是绝世好男人。','不二子是我的女神。']
pt_batch = tokenizer(lines, 
                    padding=True,
                    truncation=True,
                    max_length=512,
                    return_tensors="pt")

In [13]:
for key, value in pt_batch.items():
    print(f"{key}:{value.numpy().tolist()}")

input_ids:[[101, 1855, 100, 100, 100, 1916, 100, 100, 1636, 102, 0], [101, 100, 100, 1916, 1810, 100, 100, 100, 1979, 1636, 102]]
attention_mask:[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


## using the model

In [14]:
pt_outputs = pt_model(**pt_batch)  # 注意要打包
print(pt_outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.4589, -1.2317],
        [ 1.3499, -1.1157]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


可以看到没有pipeline的模型最后得到的结果是在activation function之前的结果，因为激活函数通常和loss有关

In [18]:
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits,dim=-1)  # softmax的dim有什么影响吗
print(pt_predictions)

tensor([[0.9365, 0.0635],
        [0.9217, 0.0783]], grad_fn=<SoftmaxBackward>)


In [19]:
# 当你提供labels的时候，输出会带上一个loss
import torch
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1,0]))
print(pt_outputs)

SequenceClassifierOutput(loss=tensor(1.4189, grad_fn=<NllLossBackward>), logits=tensor([[ 1.4589, -1.2317],
        [ 1.3499, -1.1157]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


同时transformers提供了一个trainer类来帮助你训练自己的模型

In [21]:
# fine-tuned后的模型可以保存
# tokenizer.save_pretrained(save_directory)
# model.save_pretrained(save_directory)

# 然后你可以用from_pretrained()取出fine-tuned后的模型
# from transformers import TFAutoModel
# tokenizer = AutoTokenizer.from_pretrained(save_directory)
# model = TFAutoModel.from_pretrained(save_directory, from_pt=True)

In [23]:
# 同时你可以要求模型返回所有的hidden states和attention weight
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states = pt_outputs.hidden_states
all_attentions = pt_outputs.attentions

## 自定义模型

1，上面使用model时，前缀是auto，也就是根据模型的名字自动拿出需要的模型，下面是取想要用的模型

In [25]:
from transformers import BertTokenizer, BertForMaskedLM
model_name = "bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [26]:
masker = pipeline('fill-mask',model = model,tokenizer=tokenizer)

In [31]:
masker('i am a good [MASK].')

[{'sequence': 'i am a good man.',
  'score': 0.20526237785816193,
  'token': 2158,
  'token_str': 'm a n'},
 {'sequence': 'i am a good person.',
  'score': 0.11671063303947449,
  'token': 2711,
  'token_str': 'p e r s o n'},
 {'sequence': 'i am a good guy.',
  'score': 0.08655985444784164,
  'token': 3124,
  'token_str': 'g u y'},
 {'sequence': 'i am a good girl.',
  'score': 0.0763998031616211,
  'token': 2611,
  'token_str': 'g i r l'},
 {'sequence': 'i am a good friend.',
  'score': 0.04764905199408531,
  'token': 2767,
  'token_str': 'f r i e n d'}]

2.你可以修改模型的参数来得到最后的结果,此时需要模型的config类

In [35]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = 'distilbert-base-uncased'
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [40]:
classifier = pipeline("text-classification",model=model,tokenizer=tokenizer)
classifier("不二子是绝世好女人！")

[{'label': 'LABEL_9', 'score': 0.11321061104536057}]

In [41]:
classifier("牛逼！")

[{'label': 'LABEL_5', 'score': 0.11135512590408325}]