Repository Overview

This repository accompanies our paper “Linking Rationale to Decision on Internet Standards: A Retrieval-Based Approach Using Synthetic Data.” “Beyond the Rules: Understanding the Design Logic of Internet Standards.”
It includes scripts, models, and resources used throughout our studies.

Repository Structure

The repository is organized into two main components:

Information Retrieval (`ir/`)

see “Linking Rationale to Decision on Internet Standards: A Retrieval-Based Approach Using Synthetic Data.” Implements retrieval-based approaches to connect rationales with technical decisions:

i2c (issue/email comments to code/textual edit): Maps discussion threads (rationales/explanations) to decisions in standards

These tasks leverage the IETF mail archives as a discussion base and RFC/Internet-Draft repositories as decision sources.

Generation (`gen/`)

see “Beyond the Rules: Understanding the Design Logic of Internet Standards.”
Implements retrieval-based approaches to connect rationales with technical decisions:

c2i (code/textual edit to issue/email comments): Retrieves relevant discussions for given technical decisions Extends the retrieval pipeline with a RAG (Retrieval-Augmented Generation) component that uses retrieved documents to generate coherent, context-aware explanations for design decisions.

Technical Implementation

This implementation adapts the MTEB benchmark framework (https://github.com/embeddings-benchmark/mteb), streamlined for retrieval and reranking tasks. Our evaluation scripts are compatible with any dataset following MTEB format, and we provide a generator that produces context-aware responses from retrieved documents.

Installation

module load Miniconda3/22.11.1-1
export PS1=\$
source ${EBROOTMINICONDA3}/etc/profile.d/conda.sh
conda deactivate &>/dev/null
echo "Conda environments: $(conda info --envs)"
echo "EBROOTMINCONDA3: ${EBROOTMINICONDA3}"

conda create -p path/to/conda_env python=3.11
conda activate path/to/conda_env
pip install -e .

Other packages need to be installed for RAG. Please check the package list in requirements.txt.

pip install langchain-text-splitters==1.1.0
pip install llama-index-core==0.14.8
pip install ragas==0.4.2

Dataset

The synthetic training data is released:

RFC‑Align
https://huggingface.co/datasets/jiebi/RFCAlign

python download_RFCAlign.py can help you download the dataset

Models

The models trained on synthetic data are (V: verbose; N: non-verbose; D: decision; R: rationale):

RFC‑DRAlign‑QV
https://huggingface.co/jiebi/RFC-DRAlign-QV
RFC‑DRAlign‑QL
https://huggingface.co/jiebi/RFC-DRAlign-QL
RFC‑DRAlign‑LV
https://huggingface.co/jiebi/RFC-DRAlign-LV
RFC‑DRAlign‑LN
https://huggingface.co/jiebi/RFC-DRAlign-LN

We strongly recommend that you download the base model, such as mistralai/Mistral-7B-v0.1, and place it in the base_models folder; similarly, download the peft model and place it in the peft_models folder.

Models Fine-tuning

https://github.com/cheop-byeon/FlagEmbedding

Synthetic Data Generation

https://github.com/cheop-byeon/synthetic-data-kit

Evaluation

To run evaluations on retrieval tasks:

Dense Embedding Models

# See evaluation.sh for comprehensive evaluation setup with all models and datasets
# The script includes all available fine-tuned models (RFC-DRAlign, CodeConvo-based)
sbatch evaluation.sh

BM25 Baseline

# See bm25.sh for BM25-based retrieval evaluation
sbatch bm25.sh

Note: Before running evaluations, download the datasets using:

python download_CodeConvo.py      # Download CodeConvo dataset
python download_RFCAlign.py       # Download RFC-Align dataset

See DATASET_PATH_USAGE.md for detailed dataset download and path resolution instructions.

Acknowledgements

We acknowledge the MTEB benchmark framework developed by Muennighoff et al. (2022):

@article{muennighoff2022mteb,
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},
  year = {2022}
  url = {https://arxiv.org/abs/2210.07316},
  doi = {10.48550/ARXIV.2210.07316},
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
gen/rfc8205		gen/rfc8205
ir		ir
mteb		mteb
.gitignore		.gitignore
DATASET_PATH_USAGE.md		DATASET_PATH_USAGE.md
README.md		README.md
RFCAlign_IR_mteb.py		RFCAlign_IR_mteb.py
analysis.md		analysis.md
analysis.py		analysis.py
bm25.sh		bm25.sh
compare.md		compare.md
compare_errors.py		compare_errors.py
download_CodeConvo.py		download_CodeConvo.py
download_RFCAlign.py		download_RFCAlign.py
evaluation.sh		evaluation.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
significance_test_ir.py		significance_test_ir.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository Overview

Repository Structure

Information Retrieval (`ir/`)

Generation (`gen/`)

Technical Implementation

Installation

Dataset

Models

Models Fine-tuning

Synthetic Data Generation

Evaluation

Dense Embedding Models

BM25 Baseline

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repository Overview

Repository Structure

Information Retrieval (ir/)

Generation (gen/)

Technical Implementation

Installation

Dataset

Models

Models Fine-tuning

Synthetic Data Generation

Evaluation

Dense Embedding Models

BM25 Baseline

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Information Retrieval (`ir/`)

Generation (`gen/`)

Packages