## objective

語意檢索：檢索系統不再拘泥於用戶詢問的字面本身，而是能精確捕捉到他後面的意圖並以此來搜索最好最精確的結果。

對比訓練：減少正樣本的距離，增加負樣本的距離

好的對比系統滿足：

-Alignment：相似的例子應該有接近的特徵，距離比較近

-Uniformity：系統應該傾向在特徵裡保留盡可能多的訊息。特徵分布越均勻代表兩倆訊息有差異，保留的訊息越充分。

在SimCSE上進行無監督訓練

## experiment

方法：把同樣的input放進encoder兩次，透過不同的dropout取得不同的embedding，使得語意上有不同的差異

處理偽樣本問題：增大batch size


In [10]:
# !pip3 install paddlenlp

In [None]:
# !pip3 install hnswlib

In [1]:
import random
import os
import time
from typing import Dict, List
import pandas as pd
import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
import sys
import abc
import functools
from functools import partial
from visualdl import LogWriter
import paddle
import paddlenlp
import paddle.nn.functional as F
import paddle.nn as nn
from paddlenlp.datasets import MapDataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import LinearDecayWithWarmup, AutoModel, AutoTokenizer
from paddle.io import DataLoader, BatchSampler
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.datasets import DatasetBuilder
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddle.dataset.common import md5file
from paddlenlp.datasets import load_dataset
from sklearn.model_selection import train_test_split

In [2]:
os.chdir('/home/u9285752')

In [47]:
#offending lines skipped (Lines with too many fields (e.g. a csv line with too many commas))
df = pd.read_csv("./國泰/train.tsv", delimiter="\t",on_bad_lines='skip', header=None, index_col=None) 

In [48]:
df.head()

Unnamed: 0,0,1,2
0,用微信都6年，微信没有微粒贷功能,4。号码来微粒贷,0.0
1,微信消费算吗,还有多少钱没还,0.0
2,交易密码忘记了找回密码绑定的手机卡也掉了,怎么最近安全老是要改密码呢好麻烦,0.0
3,你好我昨天晚上申请的没有打电话给我今天之内一定会打吗？,什么时候可以到账,0.0
4,"“微粒贷开通""",你好，我的微粒贷怎么没有开通呢,0.0


In [57]:
df.columns = ["text_a", "text_b", "label"]

In [32]:
len(df)

86196

In [62]:
strange_data = []
for idx,(i,j) in enumerate(zip(df['text_a'], df['text_b'])):
    if len(i.split('\t')) > 1 or len(j.split('\t')) > 1:
        strange_data.append(idx)

In [63]:
df=df.drop(labels=strange_data,axis=0) 

In [64]:
len(df)

86185

In [65]:
# count the length of queries
print((df['text_a']+df['text_b']).map(len).describe())

count    86185.000000
mean        23.726774
std         11.283784
min          3.000000
25%         16.000000
50%         21.000000
75%         28.000000
max        201.000000
dtype: float64


In [72]:
train_data,val_data = train_test_split(df,train_size=0.9,random_state=20)

In [73]:
train_data = train_data.drop(['text_b', 'label'], axis=1)

In [74]:
train_data=train_data.reset_index(drop=True)
val_data=val_data.reset_index(drop=True)

In [75]:
corpus = df['text_b']
corpus.drop_duplicates(inplace=True)

In [76]:
train_data.head()

Unnamed: 0,text_a
0,不被邀请就不能借？
1,为什么我把版本更新了还没看到微粒
2,我钱包里没有这个选项
3,能把还款日推迟一日吗
4,为什么还没有


In [77]:
val_data.head()

Unnamed: 0,text_a,text_b,label
0,如何提前分期还款,4万我如果借款20天一共要还款是多少？,0.0
1,两天为什么还没有接到确认电话,QQ申请了他说等待电话确认要多久都过了两天了,1.0
2,你好，请问借款可以用于信用卡还款吗？,不能用于购房还有证劵投资还有呢,0.0
3,为什么我的微信没有微粒贷,4。号码来微粒贷,0.0
4,刚才没有接到你们的审核电话,审批周期多久？,0.0


In [78]:
val_data_nolabel = val_data.drop(['label'], axis=1)
val_data_textb = val_data_nolabel.drop(['text_a'], axis=1)

In [83]:
filt = (val_data['label']==1.0)
val_pair = val_data_nolabel.loc[filt]

In [84]:
train_data.to_csv("./國泰/train.csv", header=None, index=False)
val_data.to_csv("./國泰/val.csv",header=None, index=False)

In [85]:
val_pair.to_csv("./國泰/val_pair.csv", header=None, index=False)
val_data_textb.to_csv("./國泰/val_textb.csv", header=None, index=False)

In [86]:
corpus.to_csv("./國泰/corpus.csv", header=None, index=False)

# model

In [96]:
def read_simcse_text(data_path):
    """Reads data."""
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = line.rstrip()
            yield {'text_a': data, 'text_b': data}

In [88]:
def convert_example(example, tokenizer, max_seq_length=512):
    """
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
    Returns:
        input_ids(obj:`list[int]`): The list of query token ids.
        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
    """

    result = []

    for key, text in example.items():

        # do_train
        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
        input_ids = encoded_inputs["input_ids"]
        token_type_ids = encoded_inputs["token_type_ids"]
        result += [input_ids, token_type_ids]
    return result

In [89]:
def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(dataset,
                                                          batch_size=batch_size,
                                                          shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(dataset,
                                               batch_size=batch_size,
                                               shuffle=shuffle)

    return paddle.io.DataLoader(dataset=dataset,
                                batch_sampler=batch_sampler,
                                collate_fn=batchify_fn,
                                return_list=True)


In [143]:
class SimCSE(nn.Layer):

    def __init__(self,
                 pretrained_model,
                 dropout=None,
                 margin=0.0,
                 scale=20,
                 output_emb_size=None):

        super().__init__()

        self.ptm = pretrained_model
        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)

        self.output_emb_size = output_emb_size
        if output_emb_size > 0:
            weight_attr = paddle.ParamAttr(
                initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) #随机截断正態分布初始化函数
# 生成的隨機值與RandomNormal類似，但是在距離平均值两兩個標準差之外的随机值會被丟棄並重新生成
# 為什麼設0.02?如果最初 𝑤 值很大，當我們用sigmoid 或 tanh function时，會停留在一个很平坦的地方，激活值接近飽和，導致梯度下降時梯度很小，學習慢
# 如果太小如果權重值小，會導致在反向傳播時得到很小的梯度值，引起梯度消失
            self.emb_reduce_linear = paddle.nn.Linear(768,
                                                      output_emb_size,
                                                      weight_attr=weight_attr)

        self.margin = margin
        # Used scaling cosine similarity to ease converge
        self.sacle = scale

        # 加入jit注釋能夠把該提取向量的函數導出成靜態圖
        # 相比动态图，静态图在部署方面更具有性能的优势。静态图程序在编译执行时，先搭建模型的神经网络结构，然后再对神经网络执行计算操作。
        # 预先搭建好的神经网络可以脱离Python依赖，在C++端被重新解析执行，而且拥有整体网络结构也能进行一些网络结构的优化。
        # input_id,token_type_id
    @paddle.jit.to_static(input_spec=[
        paddle.static.InputSpec(shape=[None, None], dtype='int64'),
        paddle.static.InputSpec(shape=[None, None], dtype='int64')])
    
    def get_pooled_embedding(self,
                             input_ids,
                             token_type_ids=None,
                             position_ids=None,
                             attention_mask=None):

        # Note: cls_embedding is poolerd embedding with act tanh
        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids,
                                                  position_ids, attention_mask)

          
       # output_emb_size=256
       # the larger the number, the richer the semantic information, but the more resources are consumed
        
        if self.output_emb_size > 0:
            cls_embedding = self.emb_reduce_linear(cls_embedding)

        cls_embedding = self.dropout(cls_embedding)
        cls_embedding = F.normalize(cls_embedding, axis=-1)

        return cls_embedding

    def get_semantic_embedding(self, data_loader):
        self.eval()
        with paddle.no_grad():
            for batch_data in data_loader:
                input_ids, token_type_ids = batch_data
                input_ids = paddle.to_tensor(input_ids)
                token_type_ids = paddle.to_tensor(token_type_ids)

                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)

                yield text_embeddings

    def cosine_sim(self,
                   query_input_ids,
                   title_input_ids,
                   query_token_type_ids=None,
                   query_position_ids=None,
                   query_attention_mask=None,
                   title_token_type_ids=None,
                   title_position_ids=None,
                   title_attention_mask=None):

        query_cls_embedding = self.get_pooled_embedding(query_input_ids,
                                                        query_token_type_ids,
                                                        query_position_ids,
                                                        query_attention_mask)

        title_cls_embedding = self.get_pooled_embedding(title_input_ids,
                                                        title_token_type_ids,
                                                        title_position_ids,
                                                        title_attention_mask)

        
        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)

        return cosine_sim

    def forward(self,
                query_input_ids,
                title_input_ids,
                query_token_type_ids=None,
                query_position_ids=None,
                query_attention_mask=None,
                title_token_type_ids=None,
                title_position_ids=None,
                title_attention_mask=None):

        # 1st encoding
        query_cls_embedding = self.get_pooled_embedding(query_input_ids,
                                                        query_token_type_ids,
                                                        query_position_ids,
                                                        query_attention_mask) #shape=[batch_size, 256]
 
        # 2nd encoding
        title_cls_embedding = self.get_pooled_embedding(title_input_ids,
                                                        title_token_type_ids,
                                                        title_position_ids,
                                                        title_attention_mask)
        
        # similarity matrix, shape = [batch_size, batch_size]
        cosine_sim = paddle.matmul(query_cls_embedding,
                                   title_cls_embedding,
                                   transpose_y=True) # matmul：矩陣相乘  
        

# adjust the degree of attention to difficult samples: 
# the smaller the temperature coefficient, the more attention is paid to separating this sample from the most similar difficult samples to obtain a more uniform representation 
# However, difficult samples are often similar to this sample, and many difficult negative samples are actually potential positive samples. 
# Excessive forcing to separate from difficult samples will destroy the learned latent semantic structure. 
# 當溫度係數很小時，越相似的負例對整體loss影響越大

        cosine_sim *= self.sacle

        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64') #[0,1,2,3,...,batch_size-1]
        labels = paddle.reshape(labels, shape=[-1, 1]) #[[0],[1],[2],[3]...]


        # Converted to classification task: diagonal - positive examples, rest - negative examples
        loss = F.cross_entropy(input=cosine_sim, label=labels)

        return loss

In [94]:
model_name = "ernie-3.0-base-zh"
pretrained_model = AutoModel.from_pretrained(model_name, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1)
print("loading model from {}".format(model_name))
tokenizer = AutoTokenizer.from_pretrained(model_name)

[32m[2022-08-01 00:19:45,443] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieModel'> to load 'ernie-3.0-base-zh'.[0m
[32m[2022-08-01 00:19:45,446] [    INFO][0m - Already cached /home/u9285752/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh.pdparams[0m
W0801 00:19:45.781390    96 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.7, Runtime API Version: 10.2
W0801 00:19:45.782960    96 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
[32m[2022-08-01 00:19:53,886] [    INFO][0m - Weights from pretrained model not used in ErnieModel: ['cls.predictions.transform.weight', 'cls.predictions.layer_norm.weight', 'cls.predictions.transform.bias', 'cls.predictions.layer_norm.bias', 'cls.predictions.decoder_bias'][0m
[32m[2022-08-01 00:19:56,019] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-base-zh'.[0m
[32m[2022-08-01 00:19:56,021

loading model from ernie-3.0-base-zh


In [97]:
if __name__ == '__main__':
    

    writer=LogWriter(logdir="./國泰/writer")
    train_ds = load_dataset(read_simcse_text, data_path="./國泰/train.csv", lazy=False)

    #Partial allow one to derive a function with x parameters to a function with fewer parameters and fixed values set for the more limited function.
    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=30) # convert example to feature
    #Pad: Padding multiple sentences with different lengths to a uniform length, and take the maximum length of the N input data
    #Tuple: wraps multiple batchify functions together
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # tilte_segment
    ): [data for data in fn(samples)]
        
    train_data_loader = create_dataloader(
        train_ds,
        mode='train',
        batch_size=128,
        batchify_fn=batchify_fn,
        trans_fn=trans_func)
    
    scale = 20 
    output_emb_size = 256 
    model = SimCSE(pretrained_model, scale=scale, output_emb_size=output_emb_size)

    init_from_ckpt = None
    if init_from_ckpt and os.path.isfile(init_from_ckpt):
        state_dict = paddle.load(init_from_ckpt)
        model.set_dict(state_dict)
        print("warmup from:{}".format(init_from_ckpt))

    epochs = 15
    num_training_steps = len(train_data_loader) * epochs
    learning_rate = 5e-5
    
    
    # increases learning rate linearly from 0 to given learning_rate, 
    # after this warmup period learning rate would be decreased linearly from the base learning rate to 0
    warmup_proportion = 0.0 # Linear warmup proption over the training process
    lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

    # All bias and LayerNorm parameters are excluded.
    # applying weight decay to the bias units usually makes only a small difference to the final network
    # norm parameters are meant to scale and shift the normalized input of the layer.
    decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
    optimizer = paddle.optimizer.AdamW(
        learning_rate=lr_scheduler,
        parameters=model.parameters(),
        weight_decay= 0.0,
        apply_decay_param_fun=lambda x: x in decay_params)

    time_start=time.time()
    global_step = 0
    tic_train = time.time()
    
    save_steps = 1000
    eval_step = 2
    metric = paddle.metric.Auc()
    save_dir = "./國泰/"
    loss_base = 9999
    for epoch in range(1, epochs + 1):
        for step, batch in enumerate(train_data_loader, start=1):
            # query_input = title_input
            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
            loss = model(
                query_input_ids=query_input_ids,
                title_input_ids=title_input_ids,
                query_token_type_ids=query_token_type_ids,
                title_token_type_ids=title_token_type_ids)
            global_step += 1
            if global_step % 500 == 0:
                print("global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, loss,
                       10 / (time.time() - tic_train)))
                writer.add_scalar(tag="loss", step=global_step, value=loss)
                tic_train = time.time()

            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            
            optimizer.clear_grad()
            if global_step % save_steps == 0:
                save_path = os.path.join(save_dir, "model_%d" % (global_step) + '.pdparams')
                paddle.save(model.state_dict(), save_path)
                tokenizer.save_pretrained(save_dir)
            if loss < loss_base:                    
                loss_base = loss
                paddle.save(model.state_dict(), "./國泰/pick_loss_model.pdparams")
    time_end=time.time()
    print('totally cost time',time_end-time_start)





global step 500, epoch: 1, batch: 500, loss: 0.03672, speed: 0.04 step/s
global step 1000, epoch: 2, batch: 394, loss: 0.02748, speed: 0.05 step/s


[32m[2022-08-01 00:28:33,304] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 00:28:33,307] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 1500, epoch: 3, batch: 288, loss: 0.00800, speed: 0.05 step/s
global step 2000, epoch: 4, batch: 182, loss: 0.00821, speed: 0.05 step/s


[32m[2022-08-01 00:35:02,909] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 00:35:02,910] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 2500, epoch: 5, batch: 76, loss: 0.00679, speed: 0.05 step/s
global step 3000, epoch: 5, batch: 576, loss: 0.00568, speed: 0.05 step/s


[32m[2022-08-01 00:41:24,753] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 00:41:24,755] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 3500, epoch: 6, batch: 470, loss: 0.03622, speed: 0.05 step/s
global step 4000, epoch: 7, batch: 364, loss: 0.00589, speed: 0.05 step/s


[32m[2022-08-01 00:47:48,709] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 00:47:48,710] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 4500, epoch: 8, batch: 258, loss: 0.03937, speed: 0.05 step/s
global step 5000, epoch: 9, batch: 152, loss: 0.04064, speed: 0.05 step/s


[32m[2022-08-01 00:54:09,956] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 00:54:09,957] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 5500, epoch: 10, batch: 46, loss: 0.00171, speed: 0.05 step/s
global step 6000, epoch: 10, batch: 546, loss: 0.02466, speed: 0.05 step/s


[32m[2022-08-01 01:00:30,359] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 01:00:30,361] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 6500, epoch: 11, batch: 440, loss: 0.00230, speed: 0.05 step/s
global step 7000, epoch: 12, batch: 334, loss: 0.01610, speed: 0.05 step/s


[32m[2022-08-01 01:06:55,643] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 01:06:55,645] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 7500, epoch: 13, batch: 228, loss: 0.01171, speed: 0.05 step/s
global step 8000, epoch: 14, batch: 122, loss: 0.03830, speed: 0.05 step/s


[32m[2022-08-01 01:13:20,238] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 01:13:20,242] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


global step 8500, epoch: 15, batch: 16, loss: 0.00552, speed: 0.05 step/s
global step 9000, epoch: 15, batch: 516, loss: 0.02517, speed: 0.05 step/s


[32m[2022-08-01 01:19:42,835] [    INFO][0m - tokenizer config file saved in ./國泰/train_3/tokenizer_config.json[0m
[32m[2022-08-01 01:19:42,836] [    INFO][0m - Special tokens file saved in ./國泰/train_3/special_tokens_map.json[0m


totally cost time 3558.7501447200775


# recall Top3

In [99]:
def gen_id2corpus(corpus_file):
    id2corpus = {}
    with open(corpus_file, 'r', encoding='utf-8') as f:
        for idx, line in enumerate(f):
            id2corpus[idx] = line.rstrip()
    return id2corpus


def gen_text_file(similar_text_pair_file):
    text2similar_text = {}
    texts = []
    with open(similar_text_pair_file, 'r', encoding='utf-8') as f:
        for line in f:
            splited_line = line.rstrip().split(",")
            if len(splited_line) != 2:
                continue

            text, similar_text = line.rstrip().split(",")

            if not text or not similar_text:
                continue

            text2similar_text[text] = similar_text
            texts.append({"text": text})
    return texts, text2similar_text

In [100]:
import hnswlib
from paddlenlp.utils.log import logger

'''
HNSW（Hierarchical Navigable Small World）是ANN搜索基於圖的算法，
我們要做的是把D维空間中所有的向量構建成一張相互聯通的圖，並基於這張圖搜索某個頂點的K個最近鄰
向图中插入新点时，通过随机存在的一个点出发查找到距离新点最近的m个点（m由用户设置），连接新点到这最近的m个点
'''

def build_index(data_loader, model): #在corpus中做編號

    index = hnswlib.Index(space='ip', dim=256) # 確認檢索最近鄰需要使用的距離的方式, 'ip':inner product

    # ef表示最近鄰動態列表的大小（需要大於查找的topk），M表示每個結點的“友點”數，是平衡時間/準確率的超參數
    # max_elements: the maximum number of elements (capacity)
    # ef_construction: controls index search speed/build speed tradeoff
    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
    index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m)

    # higher ef leads to better accuracy, but slower search
    index.set_ef(hnsw_ef)

    # Set number of threads used during batch search/construction
    # By default using all available cores
    index.set_num_threads(16)

    logger.info("start build index..........")

    all_embeddings = []

    for text_embeddings in model.get_semantic_embedding(data_loader):
        all_embeddings.append(text_embeddings.numpy())

    all_embeddings = np.concatenate(all_embeddings, axis=0) #shape = [len(corpus), 256]
    index.add_items(all_embeddings)

    logger.info("Total index number:{}".format(index.get_current_count()))

    return index

# 獲取corpus的embedding

In [145]:
corpus_file = './國泰/corpus.csv' #召回庫
recall_result_dir = './國泰/recall'
recall_result_file = 'recall_result.txt'

if __name__ == "__main__":
    
    batch_size = 32
    hnsw_m = 100 # Recall number for each query from Ann index
    hnsw_ef = 100 # Recall number for each query from Ann index
    hnsw_max_elements = 1000000 # Recall number for each query from Ann index
    
    # Load pretrained semantic model
    params_path = './國泰/pick_loss_model.pdparams'
    if params_path and os.path.isfile(params_path):
        state_dict = paddle.load(params_path)
        model.set_dict(state_dict)
        print("Loaded parameters from %s" % params_path)

    id2corpus = gen_id2corpus(corpus_file) # id2corpus = 0:text

    # conver_example function's input must be dict
    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
    corpus_ds = MapDataset(corpus_list)
    
    trans_func = partial(convert_example,
                         tokenizer=tokenizer,
                         max_seq_length=30)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
    ): [data for data in fn(samples)]
    corpus_data_loader = create_dataloader(corpus_ds,
                                           mode='predict',
                                           batch_size=32,
                                           batchify_fn=batchify_fn,
                                           trans_fn=trans_func)



Loaded parameters from ./國泰/pick_loss_model.pdparams


# 採用hnswlib（進行ANN索引）對corpus的Embedding建立資料庫

In [141]:
final_index = build_index(corpus_data_loader, model)

[32m[2022-08-01 02:00:17,693] [    INFO][0m - start build index..........[0m
[32m[2022-08-01 02:00:30,666] [    INFO][0m - Total index number:20490[0m


# 獲取評估集Query的Embedding並查詢相似結果，召回 Top3 最相似的文本，並把結果生成在recall_result文件中

In [146]:
similar_text_pair_file = './國泰/val_pair.csv'
text_list, text2similar_text = gen_text_file(similar_text_pair_file) #text_list=query, text2similar_text=query:title

query_ds = MapDataset(text_list)

query_data_loader = create_dataloader(query_ds,
                                      mode='predict',
                                      batch_size=32,
                                      batchify_fn=batchify_fn,
                                      trans_fn=trans_func)

query_embedding = model.get_semantic_embedding(query_data_loader)


recall_result_file = './國泰/recall/recall_result3.txt'
recall_num = 3
with open(recall_result_file, 'w', encoding='utf-8') as f:
    for batch_index, batch_query_embedding in enumerate(query_embedding):
        recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), recall_num)

        batch_size = len(cosine_sims)

        for row_index in range(batch_size):
            text_index = batch_size * batch_index + row_index
            for idx, doc_idx in enumerate(recalled_idx[row_index]):
                f.write("{}\t{}\t{}\n".format(
                    text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]))
                print('text: ', text_list[text_index]["text"], '\npair: ', id2corpus[doc_idx], '\nsimilarity: ', round(1.0 - cosine_sims[row_index][idx],3))

text:  两天为什么还没有接到确认电话 
pair:  两天为什么还没有接到确认电话 
similarity:  1.0
text:  两天为什么还没有接到确认电话 
pair:  为什么借款后显示银行电话确认，几个小时还没有接到电话 
similarity:  0.725
text:  两天为什么还没有接到确认电话 
pair:  为什么三个小时了还没有电话核实 
similarity:  0.684
text:  为何放款还没到 
pair:  为何放款还没到 
similarity:  1.0
text:  为何放款还没到 
pair:  如何知道还款扣款成功没 
similarity:  0.666
text:  为何放款还没到 
pair:  为何没有 
similarity:  0.622
text:  一般多久更新额度的 
pair:  一般多久更新额度的 
similarity:  1.0
text:  一般多久更新额度的 
pair:  一般几个月提高额度 
similarity:  0.668
text:  一般多久更新额度的 
pair:  一般多少额度 
similarity:  0.605
text:  你好，怎么更改电话号码 
pair:  你好，怎么更改电话号码 
similarity:  1.0
text:  你好，怎么更改电话号码 
pair:  我的电话需要怎么更改 
similarity:  0.569
text:  你好，怎么更改电话号码 
pair:  您好，可以更换手机号码么 
similarity:  0.568
text:  我需要微粒贷 
pair:  我需要微粒贷 
similarity:  1.0
text:  我需要微粒贷 
pair:  我要微粒贷 
similarity:  0.825
text:  我需要微粒贷 
pair:  我不需要微粒贷 
similarity:  0.731
text:  借的前可以提现吗 
pair:  借的前可以提现吗 
similarity:  1.0
text:  借的前可以提现吗 
pair:  借款可以提现的吗 
similarity:  0.779
text:  借的前可以提现吗 
pair:  你好，你们这个借贷可以取现吗？ 
similarity:  0.60

In [26]:
text = input("Please input query: ")
text_list = []
text_list.append({"text": text})
query_ds = MapDataset(text_list)

query_data_loader = create_dataloader(query_ds,
                                      mode='predict',
                                      batch_size=1,
                                      batchify_fn=batchify_fn,
                                      trans_fn=trans_func)

query_embedding = model.get_semantic_embedding(query_data_loader)


recall_result_file = './國泰/recall/try.txt'
recall_num = 3
with open(recall_result_file, 'w', encoding='utf-8') as f:
    for batch_index, batch_query_embedding in enumerate(query_embedding):
        recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), recall_num)

        batch_size = len(cosine_sims)

        for row_index in range(batch_size):
            text_index = batch_size * batch_index + row_index
            print('\ntext: ', text_list[text_index]["text"])
            for idx, doc_idx in enumerate(recalled_idx[row_index]):
                f.write("{}\t{}\t{}\n".format(
                    text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]))
                print('\n\npair: ', id2corpus[doc_idx], '\nsimilarity: ', round(1.0 - cosine_sims[row_index][idx],3))

Please input query: 你好，如何更换还款银行卡

text:  你好，如何更换还款银行卡


pair:  如何更换还款银行 
similarity:  0.765


pair:  更换卡后，如何还款 
similarity:  0.697


pair:  如何更换卡还款 
similarity:  0.692


# Count recall

In [135]:
def recall(rs, N=3):
    """
    Ratio of recalled Ground Truth at topN Recalled Docs
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> recall(rs, N=1)
    0.333333
    >>> recall(rs, N=2)
    >>> 0.6666667
    >>> recall(rs, N=3)
    >>> 1.0
    Returns:
        Recall@N
    """

    recall_flags = [np.sum(r[0:N]) for r in rs]
    return np.mean(recall_flags)


if __name__ == "__main__":
    
    text2similar = {}
    with open(similar_text_pair_file, 'r', encoding='utf-8') as f:
        for line in f:
            if len(line.rstrip().split(",")) != 2:
                continue
            text, similar_text = line.rstrip().split(",")
            text2similar[text] = similar_text
    rs = []
    recall_num = 3
    recall_result_file = './國泰/recall/recall_result3.txt'
    with open(recall_result_file, 'r', encoding='utf-8') as f:
        relevance_labels = []
        for index, line in enumerate(f):
            if index % recall_num == 0 and index != 0:
                rs.append(relevance_labels)
                relevance_labels = []
#             if len(line.rstrip().split("\t"))!= 3:
#                 continue
            text, recalled_text, cosine_sim = line.rstrip().split("\t")
            if text2similar[text] == recalled_text:
                relevance_labels.append(1)
            else:
                relevance_labels.append(0)
    recall_N = []
    recall_num = [3]
    result = open('result.tsv', 'a')
    res = []
    for topN in recall_num:
        R = round(100 * recall(rs, N=topN), 3)
        recall_N.append(str(R))
    for key, val in zip(recall_num, recall_N):
        print('recall@{}={}'.format(key, val))
        res.append(str(val))
    result.write('\t'.join(res) + '\n')

recall@3=1.229


# Customized input

In [147]:
def test_result(query1, query2):

    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=30)

    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # tilte_segment
    ): [data for data in fn(samples)]        
    pair = [list(query1)+list(query2)]
    data = pair  

    examples = []    
    for idx, text in enumerate(data):
        input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer)
        title_ids, title_segment_ids = convert_example({idx: text[1]}, tokenizer)
        examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
        
    query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)

    scale = 20 
    output_emb_size = 256 # Output_embedding_size, 0 means use hidden_size as output embedding size
    model = SimCSE(pretrained_model,
                   scale=scale,
                   output_emb_size=output_emb_size)

    params_path = './國泰/pick_loss_model.pdparams'
    if params_path and os.path.isfile(params_path):
        state_dict = paddle.load(params_path)
        model.set_dict(state_dict)
        print("Loaded parameters from %s" % params_path)

    model.eval()
    cosine_sims = []
    with paddle.no_grad():
        batch_cosine_sim = model.cosine_sim(
        query_input_ids=query_ids,
        title_input_ids=title_ids,
        query_token_type_ids=query_segment_ids,
        title_token_type_ids=title_segment_ids).numpy()
        cosine_sims.append(batch_cosine_sim)

    cosine_sims = np.concatenate(cosine_sims, axis=0)
    for idx, cosine in enumerate(cosine_sims):
        print('{}'.format(cosine))

In [41]:
query1 = input("Please input query1")
query2 = input("Please input query2")
test_result(query1, query2)

Please input query1怎么更换绑定还款的卡
Please input query2你好，我还款银行怎么更换
Loaded parameters from ./國泰/pick_loss_model.pdparams
0.9819973707199097


In [39]:
query1 = input("Please input query1")
query2 = input("Please input query2")
test_result(query1, query2)

Please input query1微粒贷不见了
Please input query2为什么我的微粒贷突然不见了
Loaded parameters from ./國泰/pick_loss_model.pdparams
0.8149309754371643
