GitHub - ddehun/DEnsity: Official repository for "DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation (ACL2023 Findings)"

DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation

This repository contains the code and pre-trained models for DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation (ACL2023 Findings).

0. Preparation

1. Requirements

torch==1.7.1
transformers==4.12.3
scikit-learn
scipy
wget

2. Download pre-trained models

Pretrained models (trained on DailyDialog and ConvAI2): Link

Locate the downloaded files as below:

DEnsity/
    logs/
        dd/
            reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/
                models/
                    bestmodel.pth
        convai2/
            reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/
                models/
                    bestmodel.pth
    results/
        pickle_save_path/
            dd/
                maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck
            convai2/
                maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck

1. How to use DEnsity for Evaluation?

from evaluators.model import DEnsity
from utils.utils import load_tokenizer_and_reranker


lm_name = 'bert-base-uncased'
model_path = './logs/dd/reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/models/bestmodel.pth'
mean_cov_pck_fname = "./results/pickle_save_path/dd/maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck"

tokenizer, model = load_tokenizer_and_reranker(lm_name, model_path)
evaluator = DEnsity(None,mean_cov_pck_fname,tokenizer,model)

conversation = ["How are you?", "I'm fine, thank you!", "That's great!!!!"]

turn_level_score = evaluator.evaluate(conversation, is_turn_level=True) # -498.25882 
dialogue_level_score = evaluator.evaluate(conversation, is_turn_level=False) # -352.70813

2. How to Train DEnsity from Scratch?

Below procedure is an example of training our feature extractor (i.e. response selection model) on DailyDialog dataset.

1. Dataset Preparation

Download DailyDialog dataset and locate it as below.

DEnsity/
    data/
        dd/
            train/
                dialogues_train.txt
            validation/
                dialogues_validation.txt

2. Run Training

source scripts/train.sh

3. How to train on new datasets other than DailyDialog and ConvAi2?

Make a new dataset class (e.g., MyDatasetforSelection(SelectionDataset)). You can refer to ConvAI2forSelection() or ConvAI2forSelection() classes in utils/dataset_util.py.

3. How to Reproduce the Paper Result?

1. Preprocessing evaluation dataset

# DailyDialogue-Zhao
# Download human annotation file from [here](https://drive.google.com/drive/folders/1Y0Gzvxas3lukmTBdAI6cVC4qJ5QM0LBt) to `data/evaluation/dd/dd_annotations.json`.
python preprocess/preprocess_dd_zhao_annotation.py

# GRADE-DailyDialog and GRADE-ConvAI2
# Download human annotation file fore [here](https://github.com/li3cmz/GRADE/tree/main/evaluation).
python preprocess/preprocess_grade_annotation.py

# USR-ConvAI2
python preprocess/preprocess_usr_annotation.py

# Dialogue-level FED
preprocess/preprocess_fed_dialogue.ipynb # Run notebook

2. Run Evaluation

Main Results (Turn-level evalatuion in Table 1 of the paper)

source scripts/test.sh

To reproduce the results of dialogue-level evaluation with FED datset, please use the python code.

from tqdm import tqdm

from evaluators.model import DEnsity
from evaluators.eval_utils import get_correlation
from utils.utils import load_tokenizer_and_reranker, read_jsonl


# Load model
lm_name = "bert-base-uncased"
model_path = "./logs/dd/reranker.scl-temp0.1-coeff1.epoch10.lr5e-5/models/bestmodel.pth"
mean_cov_pck_fname = "./results/pickle_save_path/dd/maha.ref-train.reranker-reranker.scl-temp0.1-coeff1.epoch10.lr5e-5.positive.pck"

tokenizer, model = load_tokenizer_and_reranker(lm_name, model_path)
evaluator = DEnsity(None, mean_cov_pck_fname, tokenizer, model)

# Read datset
fed_fname = "./data/evaluation/fed_dialogue_overall/test_processed.json"
fed_examples = read_jsonl(fed_fname)

# Run evaluation
model_scores = []
human_scores = []
tokenizer.truncation_side = "left"

for _, el in enumerate(tqdm(fed_examples)):
    uttrs, score = el["history"], el["score"]
    density_score = evaluator.evaluate(uttrs, is_turn_level=False)
    model_scores.append(density_score)
    human_scores.append(score)


print(round(100 * get_correlation(human_scores, model_scores)["spearman-value"], 2))  # 43.34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluators

evaluators

preprocess

preprocess

reranker

reranker

scripts

scripts

utils

utils

.gitignore

.gitignore

README.md

README.md

Repository files navigation

DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation

Links

0. Preparation

1. How to use DEnsity for Evaluation?

2. How to Train DEnsity from Scratch?

3. How to Reproduce the Paper Result?

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
evaluators		evaluators
preprocess		preprocess
reranker		reranker
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md

ddehun/DEnsity

Folders and files

Latest commit

History

Repository files navigation

DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation

Links

0. Preparation

1. How to use DEnsity for Evaluation?

2. How to Train DEnsity from Scratch?

3. How to Reproduce the Paper Result?

About

Resources

Stars

Watchers

Forks

Languages