Some questions about the trippy dst #7

zqwerty · 2020-10-11T02:12:39Z

I have several questions:

the original performance of trippy is 55.3% on multiwoz 2.1(in paper). Your bert-base DST achieves 56.3%. So where does the improvement comes from? I notice that the original trippy repo mentions:

With a sequence length of 180, you should expect the following average JGA: 56% for MultiWOZ 2.1`
Best performance can be achieved by using the maximum sequence length of 512.

Do you change the code or just the trippy repo has better performance than trippy paper?
How to reproduce the dst experiments? I guess:

for BERT(56.3%): run DO.example.advanced without any modification (although the max seq length is set to 180)
for CONVBERT-DG(58.57%): run DO.example.advanced with --model_name_or_path="convbert-dg"

Look forward to your reply :)

The text was updated successfully, but these errors were encountered:

Shikib · 2020-10-11T16:33:20Z

For the BERT-base results, we didn't change the code. We just trained the model for longer.
Yes, that is correct.

zqwerty · 2020-10-12T02:53:09Z

Thank you very much! I run the BERT script without -mlm_pre --mlm_during and got:

(Pdb) acc_list_s
[('multiwoz_trippy/pred_res.test.final.json', 0.582067281606077)]

and the number for dev set is 0.623407

Is that the joint acc. for dst? If so, why it is higher than 56.3%?

zqwerty · 2020-10-12T03:04:03Z

How can I evaluate dialoglue locally (not submit to eval.ai)?

Shikib · 2020-10-12T12:59:28Z

That is strange. You didn't change anything else to get the 58.2 number? I'm not sure why that happens, in our experiments the highest we were able to get (with our set of hyperparameters) without MLM was 56.3.

The evaluation can be done by running https://github.com/alexa/dialoglue/blob/master/evaluate.py using gt_test.json as the test_annotation_file and your json as the user_annotation_file.

zqwerty · 2020-10-13T01:40:16Z

The script I used:

#!/bin/bash

# Parameters ------------------------------------------------------

#TASK="sim-m"
#DATA_DIR="data/simulated-dialogue/sim-M"
#TASK="sim-r"
#DATA_DIR="data/simulated-dialogue/sim-R"
#TASK="woz2"
#DATA_DIR="data/woz2"
TASK="multiwoz21"
DATA_DIR="data/MULTIWOZ2.1"

# Project paths etc. ----------------------------------------------

OUT_DIR=multiwoz_trippy/
mkdir -p ${OUT_DIR}

# Main ------------------------------------------------------------

for step in train dev test; do
    args_add=""
    if [ "$step" = "train" ]; then
	args_add="--do_train --predict_type=dummy"
    elif [ "$step" = "dev" ] || [ "$step" = "test" ]; then
	args_add="--do_eval --predict_type=${step}"
    fi

    CUDA_VISIBLE_DEVICES=7 python3 run_dst.py \
	    --task_name=${TASK} \
	    --data_dir=${DATA_DIR} \
	    --dataset_config=dataset_config/${TASK}.json \
	    --model_type="bert" \
	    --model_name_or_path="/home/data/zhuqi/pre-trained-models/bert-base-uncased" \
	    --do_lower_case \
	    --learning_rate=1e-4 \
	    --num_train_epochs=50 \
	    --max_seq_length=180 \
	    --per_gpu_train_batch_size=48 \
	    --per_gpu_eval_batch_size=1 \
	    --output_dir=${OUT_DIR} \
	    --save_epochs=20 \
	    --logging_steps=10 \
	    --warmup_proportion=0.1 \
	    --adam_epsilon=1e-6 \
	    --label_value_repetitions \
            --swap_utterances \
	    --append_history \
	    --use_history_labels \
	    --delexicalize_sys_utts \
	    --class_aux_feats_inform \
	    --class_aux_feats_ds \
	    --seed 42 \
	    ${args_add} \
        2>&1 | tee ${OUT_DIR}/${step}.log

    if [ "$step" = "dev" ] || [ "$step" = "test" ]; then
    	python3 metric_bert_dst.py \
    		${TASK} \
		dataset_config/${TASK}.json \
    		"${OUT_DIR}/pred_res.${step}*json" \
    		2>&1 | tee ${OUT_DIR}/eval_pred_${step}.log
    fi
done

I copy the multiwoz data from dialoglue/multiwoz to data/MULTIWOZ2.1, add mkdir -p ${OUT_DIR}, remove --mlm_pre --mlm_during. GPU: Quadro RTX 6000

Here is the pred.log:
eval_pred_dev.log
eval_pred_test.log

zqwerty · 2020-10-14T12:20:46Z

What is the hyperparameter setting for few-shot experiments?

Shikib · 2020-10-16T12:02:32Z

That appears to be identical to our script. I am not sure why the performance is so high.

The hyperparameter for the few shot setting is identical, we just run the experiment 5x (with random seeds = 1 - 5).

mihail-amazon · 2020-10-19T17:59:11Z

I'm going to close this issue for now, but feel free to open a new one if you have any additional questions.

zqwerty · 2020-10-22T11:18:55Z

That appears to be identical to our script. I am not sure why the performance is so high.

The hyperparameter for the few shot setting is identical, we just run the experiment 5x (with random seeds = 1 - 5).

Seems the number in paper for few-shot dst is not correct. I've got ~48 for bert

mihail-amazon closed this as completed Oct 19, 2020

jimpei8989 mentioned this issue Dec 24, 2020

Reproducing few-shot experiments on Multiwoz2.1 #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the trippy dst #7

Some questions about the trippy dst #7

zqwerty commented Oct 11, 2020

Shikib commented Oct 11, 2020

zqwerty commented Oct 12, 2020 •

edited

Loading

zqwerty commented Oct 12, 2020

Shikib commented Oct 12, 2020

zqwerty commented Oct 13, 2020 •

edited

Loading

zqwerty commented Oct 14, 2020 •

edited

Loading

Shikib commented Oct 16, 2020

mihail-amazon commented Oct 19, 2020

zqwerty commented Oct 22, 2020

Some questions about the trippy dst #7

Some questions about the trippy dst #7

Comments

zqwerty commented Oct 11, 2020

Shikib commented Oct 11, 2020

zqwerty commented Oct 12, 2020 • edited Loading

zqwerty commented Oct 12, 2020

Shikib commented Oct 12, 2020

zqwerty commented Oct 13, 2020 • edited Loading

zqwerty commented Oct 14, 2020 • edited Loading

Shikib commented Oct 16, 2020

mihail-amazon commented Oct 19, 2020

zqwerty commented Oct 22, 2020

zqwerty commented Oct 12, 2020 •

edited

Loading

zqwerty commented Oct 13, 2020 •

edited

Loading

zqwerty commented Oct 14, 2020 •

edited

Loading