已复现，但指标有些对不上，问题请教下 #11

xieyangyi · 2019-11-04T03:51:33Z

你好，
冒昧的打扰您了，我正在做多轮对话中的改写问题，阅读了您的“Improving Multi-turn Dialogue Modelling with Utterance ReWriter”，深有启发。下载了您的代码，并跑通了结果。但有几个问题想请教您。
1. 您是完全基于github中给出的两万条正样本做训练和预测吗
2. 您的负样本（不需要改写的）是怎么构造的呢，我是通过将正样本中的summarization替换掉current，来构造负样本的，不知道可行不
3. 您训练集和验证集数据有多少呢。我是按照1比3划分的，训练集正负样本共3万条，评测集正负样本共1万条，正负样本比例1：1
4. 您训练集和验证集正负样本比例是1：1吗
5. 我基于pointer-network（LSTM的base网络，没使用coverage机制）的实验结果为，
bleu1: 0.836, bleu2: 0.807, bleu4: 0.746，em_score: 0.449 rouge_1: 0.915, rouge_2: 0.840, rouge_l: 0.865
除了em（exact match）和您给出数据差不多之外，其他两个好得有点离谱了，想请教下
1）您的bleu和rouge怎么计算的呢，不知道是不是我们计算方式不一致。我的计算代码贴在后面

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
from rouge import Rouge

smoothing_function = SmoothingFunction().method4

class Metrics(object):
    def __init__(self):
        pass

    @staticmethod
    def bleu_score(references, candidates):
        """
        计算bleu值
        :param references: 实际值, list of string
        :param candidates: 验证值, list of string
        :return:
        """
        # 遍历计算bleu
        bleu1s = []
        bleu2s = []
        bleu3s = []
        bleu4s = []
        for ref, cand in zip(references, candidates):
            ref_list = [list(ref)]
            cand_list = list(cand)
            bleu1 = sentence_bleu(ref_list, cand_list, weights=(1, 0, 0, 0), smoothing_function=smoothing_function)
            bleu2 = sentence_bleu(ref_list, cand_list, weights=(0.5, 0.5, 0, 0), smoothing_function=smoothing_function)
            bleu3 = sentence_bleu(ref_list, cand_list, weights=(0.33, 0.33, 0.33, 0), smoothing_function=smoothing_function)
            bleu4 = sentence_bleu(ref_list, cand_list, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothing_function)
            # print ("ref: %s, cand: %s, bleus: %.3f, %.3f, %.3f, %.3f"
            #        % (ref, cand, bleu1, bleu2, bleu3, bleu4))
            bleu1s.append(bleu1)
            bleu2s.append(bleu2)
            bleu3s.append(bleu3)
            bleu4s.append(bleu4)

        # 计算平均值
        bleu1_average = sum(bleu1s) / len(bleu1s)
        bleu2_average = sum(bleu2s) / len(bleu2s)
        bleu3_average = sum(bleu3s) / len(bleu3s)
        bleu4_average = sum(bleu4s) / len(bleu4s)

        # 输出
        print "average bleus: bleu1: %.3f, bleu2: %.3f, bleu4: %.3f" % (bleu1_average, bleu2_average, bleu4_average)
        return (bleu1_average, bleu2_average, bleu4_average)

    @staticmethod
    def em_score(references, candidates):
        total_cnt = len(references)
        match_cnt = 0
        for ref, cand in zip(references, candidates):
            if ref == cand:
                match_cnt = match_cnt + 1

        em_score = match_cnt / (float)(total_cnt)
        print "em_score: %.3f, match_cnt: %d, total_cnt: %d" % (em_score, match_cnt, total_cnt)
        return em_score

    @staticmethod
    def rouge_score(references, candidates):
        """
        rouge计算，NLG任务语句生成，词语的recall
        https://github.com/pltrdy/rouge
        :param references: list string
        :param candidates: list string
        :return:
        """
        rouge = Rouge()

        # 遍历计算rouge
        rouge1s = []
        rouge2s = []
        rougels = []
        for ref, cand in zip(references, candidates):
            ref = ' '.join(list(ref))
            cand = ' '.join(list(cand))
            rouge_score = rouge.get_scores(cand, ref)
            rouge_1 = rouge_score[0]["rouge-1"]['f']
            rouge_2 = rouge_score[0]["rouge-2"]['f']
            rouge_l = rouge_score[0]["rouge-l"]['f']
            # print "ref: %s, cand: %s" % (ref, cand)
            # print 'rouge_score: %s' % rouge_score

            rouge1s.append(rouge_1)
            rouge2s.append(rouge_2)
            rougels.append(rouge_l)

        # 计算平均值
        rouge1_average = sum(rouge1s) / len(rouge1s)
        rouge2_average = sum(rouge2s) / len(rouge2s)
        rougel_average = sum(rougels) / len(rougels)

        # 输出
        print "average rouges, rouge_1: %.3f, rouge_2: %.3f, rouge_l: %.3f" \
              % (rouge1_average, rouge2_average, rougel_average)
        return (rouge1_average, rouge2_average, rougel_average)


if __name__ == '__main__':
    references = ["腊八粥喝了吗", "我的机器人女友好好看啊", "那长沙明天天气呢"]
    candidates = ["腊八粥喝了吗", "机器人女友好好看啊", '长沙明天呢']

    # decode
    references = [ref.decode('utf-8') for ref in references]
    candidates = [cand.decode('utf-8') for cand in candidates]

    # 计算metrics
    # Metrics.bleu_score(references, candidates)
    # Metrics.em_score(references, candidates)
    Metrics.rouge_score(references, candidates)

The text was updated successfully, but these errors were encountered:

liu946 · 2019-11-04T04:20:24Z

@xieyangyi 您好。我也在关注这个任务和该库。但是运行此库代码遇到一些问题，不能跑通。于是放弃，转而直接使用copynet做了实验，得到了跟您类似的结果（仅使用正例训练，9：1划分测试集），BLEU和ROUGE远高于论文中report的结果。请问您在复现时是否修改了此库代码，是否能够提供您修改的方法开放出来呢？十分感谢。

xieyangyi · 2019-11-04T06:25:05Z

恩恩，代码做了修改的，主要修改点：
1.

yield (record[0]+ MARK_EOS + record[1], record[2], record[2])

->

yield (record[0]+ MARK_EOS + record[1], record[2], record[3])

self.enc_input_extend_vocab, self.context_oovs ,\
                self.current_input_extend_vocab,self.current_oovs = context2ids(
                context_words, query_words, vocab)
abs_ids_extend_vocab = summarization2ids(summarization_words,
                                                     vocab, self.context_oovs, self.current_oovs)

->

self.enc_input_extend_vocab, self.context_oovs = context2ids(context_words, vocab)
self.current_input_extend_vocab,self.current_oovs = context2ids(query_words, vocab)
abs_ids_extend_vocab = summarization2ids(summarization_words, vocab, self.context_oovs)

使用了bert的vocab.txt作为本文的vocabulary，长度28000多
run_summarization.py中，加入

tf.app.flags.DEFINE_string('encoder_type', "bi", 'encoder type')

无用的代码注释掉

# convert_sentences_to_idsmasks(batch.enc_batch, batch.e)

            # bert model 暂未使用
            # bert_model = modeling.BertModel(
            #     config=bert_config,
            #     is_training=is_training,
            #     input_ids=batch.enc_batch,
            #     input_mask=input_mask,
            #     token_type_ids=segment_ids,
            #     use_one_hot_embeddings=use_one_hot_embeddings
            # )

hparam_list 中加入 'encoder_type'

beam_search.py

enc_states, dec_in_state = model.run_encoder(sess, batch)

->

enc_states, current_states, dec_in_state = model.run_encoder(sess, batch)

beam_search.py

(topk_ids, topk_log_probs, new_states, attn_dists,
         new_coverage) = model.decode_onestep(
             sess=sess,
             batch=batch,
             latest_tokens=latest_tokens,
             enc_states=enc_states,
             dec_init_states=states,
             prev_coverage=prev_coverage)

->

(topk_ids, topk_log_probs, new_states, attn_dists,
         new_coverage, _) = model.decode_onestep(
             sess=sess,
             batch=batch,
             latest_tokens=latest_tokens,
             enc_states=enc_states,
             current_states=current_states,
             dec_init_states=states,
             prev_b_coverage=prev_coverage,
             prev_t_coverage=None)

decode.py
所有的original_title，重命名为 original_context
所有的original_titles，重命名为 original_contexts
关键的就是怎么多，其他比如环境等问题，提示啥就改啥了。
公司内上传patch不合规，请谅解

Henryflee · 2019-11-04T10:10:46Z

@xieyangyi 您好，我做了和您大致的修改，只有两个地方不太一致

1、beam_search.py

hyps = [
Hypothesis(
tokens=[vocab.word2id(data.MARK_GO)],
log_probs=[0.0],
state=dec_in_state,
attn_dists=[],
# zero vector of length attention_length
t_coverage=np.zeros([batch.enc_batch.shape[1]]),
b_coverage=np.zeros([batch.current_batch.shape[1]])
)
for _ in range(FLAGS.beam_size)
]
prev_t_coverage = [h.t_coverage for h in hyps]
prev_b_coverage = [h.b_coverage for h in hyps]
(topk_ids, topk_log_probs, new_states, attn_dists,
new_coverage, _) = model.decode_onestep(
sess=sess,
batch=batch,
latest_tokens=latest_tokens,
enc_states=enc_states,
current_states=current_states,
dec_init_states=states,
prev_b_coverage=prev_t_coverage,
prev_t_coverage=prev_b_coverage)
不过你也没有用coverage，影响不大
2、loss的计算，之前的没有判断如果不是pointer 的情况，现在参考https://github.com/abisee/pointer-generator 增加了逻辑的判断

xieyangyi · 2019-11-04T12:28:00Z

@Henryflee coverage部分，应该按照你的修改来，谢谢！
另外，我最开始提的那些问题，你碰到了吗，如何解决呢。特别是bleu和rouge比论文高不少

Henryflee · 2019-11-04T12:38:10Z

@xieyangyi 暂时还没有到那一步，后续做到了再同步一下。另外，有一个地方不太清楚，论文中的 lambda ，也就是代码中的 p_ts 和 p_bs （attention_decoder.py 中）好像没有使用，最终的分布没有做加权。（不好意思，大意了，代码是有体现的）

xieyangyi · 2019-11-04T12:55:47Z

@Henryflee 我还没注意呢，可以加钉钉吗，我钉钉号 dkj152m

chin-gyou · 2019-11-12T12:28:32Z

@xieyangyi @Henryflee 代码已更新。因公开数据为最新版标注数据，质量有了大幅提高，结果变好是正常的。

rookiebird · 2019-11-22T05:36:55Z

@xieyangyi 大佬我加了你的钉钉，希望能讨论一下。这里虽然bleu 和 rouge 的指标提升了，但是我这里实际上em的指标下降了，而且是0.393挺低的。当然我数据划分和您的不一样。您的0.449是正负样本的均值还是，还是只是针对正样本的呢？

zhangyanbo2007 · 2019-11-22T05:39:15Z

我也在关注这块，我的微信号15821444815，大家加个群讨论好不

zhangyanbo2007 · 2019-11-26T07:50:42Z

@rookiebird 加我微信15821444815，我们讨论下

Henryflee · 2019-11-27T03:13:32Z

@rookiebird 我的正负样本的em分别是 0.58 和 0.20，你的0.39是综合的吧。

rookiebird · 2019-11-27T03:20:51Z

@rookiebird 我的正负样本的em分别是 0.58 和 0.20，你的0.39是综合的吧。

是的啊。那其实差不多的。感觉很多改写后的句子比原句变化变化了很多。所以EM没法像论文里面的那么高了吧。您做了实体提取部分的实验了吗？因为是基于字符分词的，我的结果实体往往没法完整提取，比如骆驼的祥子 -> 祥子，傲慢与偏见-> 慢与偏见。然后我基于词来分词，效果非常差，基于词的方式分词，粒度不够细，生成的句子非常不通顺...

Henryflee · 2019-11-27T03:36:01Z

@rookiebird 我的正负样本的em分别是 0.58 和 0.20，你的0.39是综合的吧。

是的啊。那其实差不多的。感觉很多改写后的句子比原句变化变化了很多。所以EM没法像论文里面的那么高了吧。您做了实体提取部分的实验了吗？因为是基于字符分词的，我的结果实体往往没法完整提取，比如骆驼的祥子 -> 祥子，傲慢与偏见-> 慢与偏见。然后我基于词来分词，效果非常差，基于词的方式分词，粒度不够细，生成的句子非常不通顺...

还在尝试分词的效果。
ps：你有尝试transformer版的效果吗？

rookiebird · 2019-11-27T04:20:30Z

@Henryflee 还没有。因为感觉EM确实比论文的结果低太多了，一直在review自己的代码，害怕是有bug。但是感觉应该还是数据集变化了的问题。您说的尝试分词的效果是什么意思？用不同方式分词吗？我之前是直接用jieba 分词，然后用网上找的这个作为词表分词的，取了频率高于600的单词，大概5W多个 https://github.com/ling0322/webdict

Henryflee · 2019-11-27T04:49:21Z

@Henryflee 还没有。因为感觉EM确实比论文的结果低太多了，一直在review自己的代码，害怕是有bug。但是感觉应该还是数据集变化了的问题。您说的尝试分词的效果是什么意思？用不同方式分词吗？我之前是直接用jieba 分词，然后用网上找的这个作为词表分词的，取了频率高于600的单词，大概5W多个 https://github.com/ling0322/webdict

差不多，先看看效果吧。按照作者的说法，现在是优化的数据集的话，EM 应该比论文中的还要高的。

zhangyanbo2007 · 2019-11-27T08:51:31Z

这是我跑通改写模型并且加入到seq2seq闲聊的效果，效果还行，欢迎大家加我微信交流15821444815
@Henryflee @rookiebird @xieyangyi

zhangyanbo2007 · 2019-11-28T03:19:50Z

负样本构造我感觉是个问题，大家讨论下呗，我的微信15821444815

zhangyanbo2007 · 2019-11-29T06:42:45Z

例程里面val.sh有什么用呢，怎么感觉没用啊，训练完不就是最新的模型了吗，指教一二

SivilTaram · 2020-03-05T15:48:48Z

@zhangyanbo2007 作者这个code跟普通的每个epoch训练完成后再validate的思路不同，他的val.sh开了一个类似于守护进程的，它每次读取train上保存的最新的模型，不断测试它在dev集上的loss，然后再看是否要保存该checkpoint到eval文件夹下。所以train.sh 和 val.sh 需要同时启动。

SivilTaram · 2020-03-05T15:56:43Z

@liu946 我Fork了一份仓库，能在python3 tensorflow 1.14 下直接跑通，有需要可以看看 https://github.com/SivilTaram/dialogue-utterance-rewriter-py3

wqw547243068 · 2020-05-13T09:38:18Z

@liu946 我Fork了一份仓库，能在python3 tensorflow 1.14 下直接跑通，有需要可以看看 https://github.com/SivilTaram/dialogue-utterance-rewriter-py3

感谢分享！还不能直接运行

train.sh里要修改,才能直接运行：
- 两个路径要改下(去掉../)
- --restore_best_model=1改成0
否则训练时，不断弹出错误提示：

tensorflow:Failed to load checkpoint from ./log/extractive/

另外，可以启动tensorboard：

tensorboard --logdir=log/extractive/train --host=10.186.3.100 --port=8079

然后就一目了然了

SivilTaram · 2020-05-27T00:41:03Z

@wqw547243068 多谢提醒，我当时改了windows的batch script，没有对应更新到linux script上。现在已经fix~

Damcy · 2020-06-08T09:23:49Z

@SivilTaram 还是不行。启动train.sh之后，运行val.sh无效，加载模型一直失败，错误原因是shape不一致，我在你的repo下提了一个issue。

WenTingTseng · 2020-10-18T12:16:26Z

请问怎么把corpus处理成train.txt,val.txt格式

gaoxing153 · 2020-12-15T09:08:24Z

这是我跑通改写模型并加入到seq2seq闲聊的效果，效果还行，欢迎大家加我微信交流15821444815
@Henryflee @rookiebird @xieyangyi

你好，我搜索你的微信的时候显示用户不存在，请问一下能添加一下我的联系方式么15571370691，我也准备把这个运用到对话系统中去

gaoxing153 · 2020-12-15T09:09:36Z

请问怎么把corpus处理成train.txt，val.txt格式

上面有人给了新的库，你可以去看看。直接运行那个库的代码或者将那个库中的data文件拷到这个文件夹下应该都可以

Henryflee · 2020-12-21T07:36:50Z

Our related work is update in https://github.com/NetEase-GameAI/SARG.

SivilTaram · 2020-12-21T12:10:51Z

😄 Our related work is here: https://github.com/microsoft/ContextualSP/blob/master/incomplete_utterance_rewriting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

已复现，但指标有些对不上，问题请教下 #11

已复现，但指标有些对不上，问题请教下 #11

xieyangyi commented Nov 4, 2019 •

edited

liu946 commented Nov 4, 2019

xieyangyi commented Nov 4, 2019

Henryflee commented Nov 4, 2019

xieyangyi commented Nov 4, 2019

Henryflee commented Nov 4, 2019 •

edited

xieyangyi commented Nov 4, 2019

chin-gyou commented Nov 12, 2019

rookiebird commented Nov 22, 2019 •

edited

zhangyanbo2007 commented Nov 22, 2019

zhangyanbo2007 commented Nov 26, 2019

Henryflee commented Nov 27, 2019

rookiebird commented Nov 27, 2019

Henryflee commented Nov 27, 2019

rookiebird commented Nov 27, 2019

Henryflee commented Nov 27, 2019

zhangyanbo2007 commented Nov 27, 2019

zhangyanbo2007 commented Nov 28, 2019

zhangyanbo2007 commented Nov 29, 2019

SivilTaram commented Mar 5, 2020

SivilTaram commented Mar 5, 2020 •

edited

wqw547243068 commented May 13, 2020 •

edited

SivilTaram commented May 27, 2020

Damcy commented Jun 8, 2020

WenTingTseng commented Oct 18, 2020

gaoxing153 commented Dec 15, 2020

gaoxing153 commented Dec 15, 2020

Henryflee commented Dec 21, 2020

SivilTaram commented Dec 21, 2020

已复现，但指标有些对不上，问题请教下 #11

已复现，但指标有些对不上，问题请教下 #11

Comments

xieyangyi commented Nov 4, 2019 • edited

liu946 commented Nov 4, 2019

xieyangyi commented Nov 4, 2019

Henryflee commented Nov 4, 2019

xieyangyi commented Nov 4, 2019

Henryflee commented Nov 4, 2019 • edited

xieyangyi commented Nov 4, 2019

chin-gyou commented Nov 12, 2019

rookiebird commented Nov 22, 2019 • edited

zhangyanbo2007 commented Nov 22, 2019

zhangyanbo2007 commented Nov 26, 2019

Henryflee commented Nov 27, 2019

rookiebird commented Nov 27, 2019

Henryflee commented Nov 27, 2019

rookiebird commented Nov 27, 2019

Henryflee commented Nov 27, 2019

zhangyanbo2007 commented Nov 27, 2019

zhangyanbo2007 commented Nov 28, 2019

zhangyanbo2007 commented Nov 29, 2019

SivilTaram commented Mar 5, 2020

SivilTaram commented Mar 5, 2020 • edited

wqw547243068 commented May 13, 2020 • edited

SivilTaram commented May 27, 2020

Damcy commented Jun 8, 2020

WenTingTseng commented Oct 18, 2020

gaoxing153 commented Dec 15, 2020

gaoxing153 commented Dec 15, 2020

Henryflee commented Dec 21, 2020

SivilTaram commented Dec 21, 2020

xieyangyi commented Nov 4, 2019 •

edited

Henryflee commented Nov 4, 2019 •

edited

rookiebird commented Nov 22, 2019 •

edited

SivilTaram commented Mar 5, 2020 •

edited

wqw547243068 commented May 13, 2020 •

edited