LongEmbed

Extending Embedding Models for Long Context Retrieval

Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

Please see full details in our pre-print

This repository is the official implementation for the paper "LongEmbed: Extending Embedding Models for Long Context Retrieval"

This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. We have constructed the LongEmbed benchmark for long context retrieval, and investigates various methods to increase the input length of current embedding models.

⚡Released Models

To facilitate future research in long context embedding models, we release E5-Base-4k and E5-RoPE-Base. E5-Base-4k is further fine-tuned on E5-Base to support 4k context, while strictly preserving original behavior for inputs not exceeding 512 tokens. E5-RoPE-Base follows the same training procedure as E5-Base, except for the substitution of APE with RoPE. It is released to facilitate comparison between APE & RoPE-Based embedding models.

Model	Link
E5-Base-4k	Download Link
E5-RoPE-Base	Download Link

Note that E5-Base-4k simply expands the position embedding matrix to allow for 4,096 position ids. The embedding vectors for the original pids {0,1,2,...,511} is mapped to represent {0,8,16,...,4088}. Embedding vectors for other pids are trained. So for inputs not exceeding 512 tokens, please multiply the position ids by 8 to maintain the original behavior.

🔍 Overview of LongEmbed

Task Description

LongEmbed includes 4 real-world retrieval tasks curated from long-form QA and summarization. Note that for QA and summarization datasets, we use the questions and summaries as queries, respectively.

NarrativeQA: A QA dataset comprising long stories averaging 50,474 words and corresponding questions about specific content such as characters, events. We adopt the test set of the original dataset.
2WikiMultihopQA: A multi-hop QA dataset featuring questions with up to 5 hops, synthesized through manually designed templates to prevent shortcut solutions. We use the test split of the length-uniformly sampled version from LongBench.
QMSum: A query-based meeting summarization dataset that requires selecting and summarizing relevant segments of meetings in response to queries. We use the version processed by SCROLLS. Since its test set does not include ground truth summarizations, and its validation set only have 60 documents, which is too small for document retrieval, we include the train set in addition to the validation set.
SummScreenFD: A screenplay summarization dataset comprising pairs of TV series transcripts and human-written summaries. Similar to QMSum, its plot details are scattered throughout the transcript and must be integrated to form succinct descriptions in the summary. We use validation set of the version processed by SCROLLS.

We also include two synthetic tasks, namely needle and passkey retrieval. The former is tailored from the Needle-in-a-Haystack Retrieval for LLMs. The later is adopted from Personalized Passkey Retrieval, with slight change for the efficiency of evaluation. The advantage of synthetic data is that we can flexibly control context length and distribution of target information. For both tasks, we evaluate a broad context range of ${0.25,0.5,1,2,4,8,16,32}\times1024$ tokens. For each context length, we include 50 test samples, each comprising 1 query and 100 candidate documents.

Task Statistics

Dataset	Domain	# Queries	# Docs	Avg. Query Words	Avg. Doc Words
NarrativeQA	Literature, File	10,449	355	9	50,474
QMSum	Meeting	1,527	197	71	10,058
2WikimQA	Wikipedia	300	300	12	6,132
SummScreenFD	ScreenWriting	336	336	102	5,582
Passkey	Synthetic	400	800	11	-
Needle	Synthetic	400	800	7	-

📈 Evaluation on the LongEmbed Benchmark

Environment Setup

To replicate our results, follow these steps to download the code and necessary dependencies:

git clone https://github.com/dwzhu-pku/LongEmbed.git
cd LongEmbed
pip install -r requirements.txt

Loading Data

LongEmbed contains six datasets: NarrativeQA, QMSum, 2WikiMultihopQA, SummScreenFD, Passkey, and Needle. Each dataset has three splits: corpus, queries, and qrels. The corpus.jsonl file contains the documents, the queries.jsonl file contains the queries, and the qrels.jsonl file describes the relevance. To specific split of load each dataset, you may use:

from datasets import load_dataset
# dataset_name in ["narrativeqa", "summ_screen_fd", "qmsum", "2wikimqa", "passkey", "needle"]
# split_name in ["corpus", "queries", "qrels"]
data_list = load_dataset(path="dwzhu/LongEmbed", name="dataset_name", split="split_name")

Evaluation

The evaluation of LongEmbed can be easily conducted using MTEB. For the four real tasks, you can evaluate as follows:

from mteb import MTEB
retrieval_task_list = ["LEMBSummScreenFDRetrieval", "LEMBQMSumRetrieval","LEMBWikimQARetrieval","LEMBNarrativeQARetrieval"]
output_dict = {}
evaluation = MTEB(tasks=retrieval_task_list)
#TODO load the model before evaluation
results = evaluation.run(model,output_folder=args.output_dir, overwrite_results=True, batch_size=args.batch_size,verbosity=0)
for key, value in results.items():
	split = "test" if "test" in value else "validation"
	output_dict[key] = {"ndcg@1": value[split]["ndcg_at_1"], "ndcg@10": value[split]["ndcg_at_10"]}
print(output_dict)

For the two synthetic tasks, we examine a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens. You may evaluate as follows:

from mteb import MTEB
needle_passkey_task_list = ["LEMBNeedleRetrieval", "LEMBPasskeyRetrieval"]
output_dict = {}
context_length_list = [256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
evaluation = MTEB(tasks=needle_passkey_task_list)
#TODO load the model before evaluation
results = evaluation.run(model, output_folder=args.output_dir, overwrite_results=True,batch_size=args.batch_size,verbosity=0)
for key, value in results.items():
	needle_passkey_score_list = []
	for ctx_len in context_length_list:
		needle_passkey_score_list.append([ctx_len, value[f"test_{ctx_len}"]["ndcg_at_1"]])
	needle_passkey_score_list.append(["avg", sum([x[1] for x in needle_passkey_score_list])/len(context_length_list)])
	output_dict[key] = {item[0]: item[1] for item in needle_passkey_score_list}
print(output_dict)

Our code snippet for evaluation can be found in src/test_long_embed.py. You may refer to the scripts in scripts/run_long_embed.sh to reproduce the results.

🌟 Citation

If you find this repo helpful, please cite our paper as follows:

@article{zhu2024longembed,
  title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
  author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
  journal={arXiv preprint arXiv:2404.12096},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
assets		assets
data		data
models		models
results		results
scripts		scripts
src		src
visualize		visualize
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongEmbed

Extending Embedding Models for Long Context Retrieval

⚡Released Models

🔍 Overview of LongEmbed

Task Description

Task Statistics

📈 Evaluation on the LongEmbed Benchmark

Environment Setup

Loading Data

Evaluation

🌟 Citation

About

Releases

Packages

Contributors 3

Languages

dwzhu-pku/LongEmbed

Folders and files

Latest commit

History

Repository files navigation

LongEmbed

Extending Embedding Models for Long Context Retrieval

⚡Released Models

🔍 Overview of LongEmbed

Task Description

Task Statistics

📈 Evaluation on the LongEmbed Benchmark

Environment Setup

Loading Data

Evaluation

🌟 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages