This repository contains the dataset and source code for the following paper published in PAKDD 2021:
This repository is based on hugginface transformer package and OpenAI GPT-2. Baseline scripts are adapted from MultiTurnDialogueZoo for HRED & VHRED. The results indicate that DSRNet is able to generate natural language response given dialogue history, questions & topics naturally and adequately, even in a multi-party interlocutor space. It can be used to train an NLG model with very limited examples.
ArXiv paper: https://arxiv.org/abs/2010.05572
Please use the below command to clone and install the requirements.
git clone <repo>
conda env create -f environment.yml
conda activate transformers
Download the following nltk packages in the virtual environment
python
import nltk
nltk.download('words')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
exit()
python -m spacy download en_core_web_sm
Fetch and unzip the mallet package
wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.tar.gz
tar -xvf mallet-2.0.8.tar.gz
The Ubuntu IRC dataset contains 77,563 annotated messages of IRC. Almost all are from the Ubuntu IRC Logs for the #ubuntu
channel.
A small set is a re-annotation of the #linux
channel data from Elsner and Charniak (2008).
The dataset is present in our repo in the kummerfeld/data
folder.
Data Download
Download the dataset here (zip or tar) and unzip. Create the following directories:
mkdir kummerfeld
cd kummerfeld\
mkdir data
cd ..
cd ..
Copy only the train, dev and test folders from the downloaded folder to kummerfeld/data
:
cp -r <path-of-downloaded-folder>/data/<train, dev, or test>/ kummerfeld/data/
Conversation Mining
- To extract conversations and use them as context :
python extract_conversation.py --turn_len=3 --add_qstn=False --sliding_win=True
- To have consecutive utterances as context : modify
--sliding_win=False
in above command.
Argument Description
--turn_len : To adjust the no. of turns in the dialogue history; default=3
--add_qstn : Adds query from the dialogue history to the context; default=False
--sliding_win : Allows processing the dialogue history in a sliding window fashion; default=True
Adding Query:
- To generate a question detector :
python question_detect.py
- To add query to the context along with conversation :
python extract_conversation.py --turn_len=3 --add_qstn=True --sliding_win=True
Adding Topic:
- To add topic words to the context :
python extract_conversation.py
python preprocessing.py #need only run once
python topic_modeling.py --add_qstn=False
- To add query and topic words to the context, run
topic_modeling.py
withadd_qstn=True
.
Adding Entity:
We are unable to release the original code for domain-specific entity extraction for copyright issues. However, we have provided the code used to inject the entity words into the context by replacing the entity extraction module with that of Spacy. Please note, that this will not reproduce the scores mentioned with entities in the paper. For datasets/use-cases with domain-specific vocabulary, it is highly recommended to use a domain specific entity extraction mechanism.
To inject your domain-specific entity extraction code to our entity data formatting code, refer to lines 28 and 144 in entity_extraction.py.
For natural language datasets, this code can be used directly.
To recreate the domain-specific entity extraction code we used for the technical domain, the steps in Mohapatra et. al. should be followed.
- To add entity words to the context :
python extract_conversation.py
python entity_extraction.py --add_qstn=False
- To add query and entity words to the context, run
entity_extraction.py
withadd_qstn=True
.
Data files generated include:
kummerfeld/ctxt-train-{task}.txt
: training set in txt format separated by special tokens.
kummerfeld/ctxt-dev-{task}.txt
: development set in txt format separated by special tokens.
kummerfeld/ctxt-test-{task}.txt
: testing set in txt format separated by special tokens.
- task can be topic/qstn/qstn-topic/None
kummerfeld/data/{mode}
: contains the raw irc dataset files
- mode can be train/dev/test
Data format
Line 1 : I went to propietary drivers i have both selected [eos] drm driver for Intel GMA500 [eos] drm driver for Intel GMA500 [eoc] driver, nvidia, card, explain, boot, adjust, drive, domain, window, machin [eot] [sep] and Intel Cedarview graphics driver [eos]
Line 2 : drm driver for Intel GMA500 drm driver for Intel GMA500 [eos] and Intel Cedarview graphics driver [eos] Only one should be activated [eoc] driver, nvidia, card, explain, boot, adjust, drive, domain, window, machin [eot] [sep] Only one should be activated [eos]
[eos] : indicates end-of-turn
[eoc] : indicates end-of-context
[eot] : indicates end-of-topic
[eoq] : indicates end-of-query
[sep] : separates context from ground truth response
While the model can run without this step, to enhance the accuracy, this step can be adopted. We fine-tuned the GPT-2 language model using this code on a subset of the Ubuntu 2.0 dataset. The fine-tuned model is then used in the following training step.
export CUDA_VISIBLE_DEVICES=0
python train.py --output_dir=MODEL_SAVE_PATH --model_type=gpt2 --model_name_or_path=PRE_TRINED_MODEL_PATH --do_train --do_eval --eval_data_file=kummerfeld/ctxt-dev.txt --per_gpu_train_batch_size 16 --num_train_epochs EPOCH --learning_rate LR --overwrite_cache --use_tokenize --train_data_file=kummerfeld/ctxt-train.txt --overwrite_output_dir
MODEL_SAVE_PATH
: Path for the saving model .
PRE_TRAINED_MODEL_PATH
: Initial checkpoint; Could start from gpt2, gpt2-medium or domain-pretrained model.
EPOCH
: Number of training epochs; 5 is enough for a reasonable performance
LR
: Learning rate; 5e-5, 1e-5, or 1e-4
To visualize the train and evaluation loss curves,
- you can manually check in the train_results.txt and eval_results.txt in MODEL_SAVE_PATH
- or, run
python plot.py --dir MODEL_SAVE_PATH
and look at theloss.png
generated in MODEL_SAVE_PATH.
mkdir output
export CUDA_VISIBLE_DEVICES=0
python generate.py --model_path=CHECKPOINT --test_file 'kummerfeld/ctxt-test.txt' --generate_path 'output/gen.txt' --true_path 'output/true.txt' --json_path 'output/output.json'
Add path to your model checkpoint in CHECKPOINT
Refer to the output.json
file for a neat representation of context, ground truth and generated utterances.
install nlg-eval following the instructions in NLG-EVAL (It is better to follow their custom setup.)
nlg-eval --hypothesis=output/gen.txt --references=output/true.txt > output/nlg_eval.txt
For Bert Score:
pip install bert-score
bert-score -r output/true.txt -c output/gen.txt --lang en
If you use this code and data in your research, please cite our paper:
@misc{kar2020metacontext,
title={Meta-Context Transformers for Domain-Specific Response Generation},
author={Debanjana Kar and Suranjana Samanta and Amar Prakash Azad},
year={2020},
eprint={2010.05572},
archivePrefix={arXiv},
primaryClass={cs.CL}
}