# Embedding for Chinese

## 使用 [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese)

This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese.

It maps sentences to a 768 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.

### 使用 text2vec 简化操作

In [None]:
# !pip install -U text2vec

In [5]:
from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

model = SentenceModel('shibing624/text2vec-base-chinese')
embeddings = model.encode(sentences)
print(embeddings)


[32m2023-06-09 14:25:01.458[0m | [34m[1mDEBUG   [0m | [36mtext2vec.sentence_model[0m:[36m__init__[0m:[36m74[0m - [34m[1mUse device: cpu[0m


[[-4.44250618e-04 -2.97346562e-01  8.57901335e-01 ... -5.27700007e-01
  -1.43155485e-01 -1.00078315e-01]
 [ 6.53620481e-01 -7.66668767e-02  9.59623873e-01 ... -6.01224482e-01
  -1.67903851e-03  2.14575961e-01]]


### 使用 HuggingFace Transformers

In [6]:
# ！pip install transformers

In [7]:
from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence embeddings:
tensor([[-4.4425e-04, -2.9735e-01,  8.5790e-01,  ..., -5.2770e-01,
         -1.4316e-01, -1.0008e-01],
        [ 6.5362e-01, -7.6667e-02,  9.5962e-01,  ..., -6.0122e-01,
         -1.6790e-03,  2.1458e-01]])


### 更换模型为 [GanymedeNil/text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese)

In [12]:
from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese')
model = BertModel.from_pretrained('GanymedeNil/text2vec-large-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

print(len(sentence_embeddings[1]))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence embeddings:
tensor([[-0.5050, -0.1925,  0.5590,  ...,  0.8610, -0.7712,  0.7617],
        [-0.6504,  0.1314,  0.5595,  ...,  1.0802, -0.4565,  0.7547]])
1024
