Skip to content

ConvSDG: Session Data Generation for Conversational Search - code repository of WWW 2024 Workshop - Information Retrieval Meets Large Language Models Paper

Notifications You must be signed in to change notification settings

fengranMark/ConvSDG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 

Repository files navigation

ConvSDG

ConvSDG: Session Data Generation for Conversational Search - code repository of WWW 2024 Workshop - Information Retrieval Meets Large Language Models Paper

Environment Dependency

Main packages:

  • python 3.8
  • torch 1.8.1
  • transformer 4.2.0
  • numpy 1.22
  • faiss-gpu 1.7.2
  • pyserini 0.16
  • openai

Runing Steps

1. Download data and Preprocessing

Four public datasets can be downloaded from CAsT, TopiOCQA. The data preprocessing code can refer to the "preprocess" folder with respect to each dataset.

2. Generate Session Data

The session data generation at dialogue-level and query-level can be run by the following commands to obtain generated data.

python generate_session_data_dial-level.py
python generate_augmented_query.py

3. Retrieval Indexing (Dense and Sparse)

To evaluate the trained model by ConvSDG, we should first establish index for both dense and sparse retrievers.

3.1 Dense

For dense retrieval, we use the pre-trained ad-hoc search model ANCE to generate passage embeedings. Two scripts for each dataset are provided by running:

python dense_index.py

3.2 Sparse

For sparse retrieval, we first run the format conversion script as:

python convert_to_pyserini_format.py

Then create the index for the collection by running

bash create_index.sh

4. Generate Supervision Signals

The supervision signals assigned for dialogue-level generated data (for unsupervised w/o relevance judgment) are run by the following commands based on both sparse and dense retrieval.

python search_relevant_p_sparse.py
python search_relevant_p_dense.py

For the query-level generated data (for semi-supervised w/. relevance judgment), we directly use the original annotations as supervision signals. Thus, after generating the augmented query data, run the following command for combination:

python preprocess_cast_augmented.py

5. Conversational Dense Retrieval Fine-tuning

To conduct conversational dense retrieval fine-tuning, please run the following commands. The pre-trained language model we use for dense retrieval is ANCE.

python train_conretriever(_augment).py --pretrained_encoder_path="checkpoints/ad-hoc-ance-msmarco" \ 
  --train_file_path=$train_file_path \ 
  --log_dir_path=$log_dir_path \
  --model_output_path=$model_output_path \ 
  --per_gpu_train_batch_size=16 \ 
  --num_train_epochs=5 \
  --max_query_length=64 \
  --max_doc_length=384 \ 
  --max_concat_length=512 \
  --is_train=True \

6. Retrieval evaluation

Now, we can perform retrieval to evaluate the ConvSDG-trained dense retriever by running:

python test_retrieval_cast.py --pretrained_encoder_path=$trained_model_path \ 
  --passage_embeddings_dir_path=$passage_embeddings_dir_path \ 
  --passage_offset2pid_path=$passage_offset2pid_path \
  --qrel_output_path=$qrel_output_path \ % output dir
  --output_trec_file=$output_trec_file \
  --trec_gold_qrel_file_path=$trec_gold_qrel_file_path \ % gold qrel file
  --per_gpu_train_batch_size=4 \ 
  --test_type=convq \ 
  --max_query_length=64 \
  --max_doc_length=384 \ 
  --max_concat_length=512 \ 
  --is_train=False \
  --top_k=100 \
  --rel_threshold=1 \ # 2 for CAsT-20 and CAsT-21
  --passage_block_num=$passage_block_num \
  --use_gpu=True

Cite Format

@inproceedings{mo2024convsdg,
title={ConvSDG: Session Data Generation for Conversational Search},
author={Mo, Fengran and Yi, Bole and Mao, Kelong and Qu, Chen and Huang, Kaiyu and Nie, Jian-Yun},
booktitle={Companion Proceedings of the ACM on Web Conference 2024},
pages={1634--1642},
year={2024}
}

About

ConvSDG: Session Data Generation for Conversational Search - code repository of WWW 2024 Workshop - Information Retrieval Meets Large Language Models Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published