A comprehensive evaluation framework for assessing large language models' ability to revise Internet-Draft (RFC) documents through three complementary text editing scenarios.
TextFlow provides a reproducible research platform for evaluating text revision capabilities in the context of Internet Engineering Task Force (IETF) RFC documents. The framework implements three distinct evaluation tasks that test different aspects of LLM writing assistance:
- Autocomplete: Improving incomplete/draft text (zero-context revision)
- Startup: Revising text from feedback alone (blind revision)
- Edit: Revising text using both original and feedback (context-aware revision)
This repository contains the implementation and evaluation pipeline used in the following publications:
Conference presentations are available in the slides directory:
-
NLDB 2025 - slides/NLDB2025.pdf
- International Conference on Applications of Natural Language to Information Systems
- Program: https://www.jaist.ac.jp/event/nldb2025/program.html
-
ANRW 2025 - slides/ANRW2025.pdf
- ACM/IRTF Applied Networking Research Workshop
- Program: https://www.irtf.org/anrw/2025/program.html
-
IETF 124 Montreal - slides/ietf124Montreal.pdf
- Research and Analysis of Standard-Setting Processes Research Group Session
- Program: https://datatracker.ietf.org/meeting/124/session/rasprg
-
IETF 123 - Applied Networking Research Workshop (ANRW)
- Date: July 22, 2025 at 09:30 UTC
- Video: https://www.youtube.com/watch?v=45Y_IwJi9y0
- Session: Research presentations on LLM-enhanced RFC writing and IETF collaboration tools
-
IETF 124 - RASPRG Working Group
- Date: November 5, 2025 at 15:30-17:30 CET
- Video: https://www.youtube.com/watch?v=rlYnQ5V8B64
- Session: Research and Analysis of Standard-Setting Processes Research Group
@inproceedings{bian2025instruction,
title={Instruction Tuning TextFlow Semi-automatic RFCs Generation},
author={Bian, Jie and Welzl, Michael},
booktitle={International Conference on Applications of Natural Language to Information Systems},
pages={350--364},
year={2025}
}
@inproceedings{bian2025empowering,
title={Empowering IETF Collaboration with NLP Search Innovations and LLM-Enhanced RFC Writing},
author={Bian, Jie and Welzl, Michael},
booktitle={Proceedings of the 2025 Applied Networking Research Workshop},
pages={24--31},
year={2025}
}
- Installation
- Evaluation Tasks
- Dataset Format
- Quick Start
- Running Evaluations
- Evaluation Metrics
- Advanced Features
- Reproducibility
- Citation
- Python: 3.11+
- PyTorch: 2.9.1 with CUDA support (recommended)
- GPU: NVIDIA GPU with CUDA 12.6+ support (for inference acceleration)
- Disk Space: ~20GB (model weights + datasets)
Clone and install the repository:
git clone https://github.com/cheop-byeon/TextFlow.git
cd TextFlow
pip install -e .Or install with conda (recommended for managing CUDA dependencies):
conda create -p path/to/conda_env python=3.12
conda activate path/to/conda_envFor accessing private models or downloading large model weights:
huggingface-cli loginFor multi-GPU evaluation:
accelerate configObjective: Evaluate the model's ability to improve incomplete or draft RFC text.
Evaluation Scenario: Zero-context revision where the model must enhance text quality without external feedback.
Input: Original text passage from an RFC document Output: Revised/improved version of the text
Prompt Template:
You are a professional IETF RFC writer.
Please revise the following text using your knowledge and understanding.
Input:
Original Text:
{old_text}
Output:
Revised Text:
Assessment Basis: Lexical and semantic similarity between model-generated revisions and human reference revisions.
Objective: Evaluate the model's ability to revise text based exclusively on feedback guidance (blind revision).
Evaluation Scenario: Feedback-driven revision where the model must generate improved text without seeing the original.
Input: Feedback or review comments describing needed changes Output: Revised text that incorporates the feedback suggestions
Prompt Template:
You are a professional IETF RFC writer.
Below is some feedback discussing changes needed for a text.
Please provide a revised version of the text based solely on the feedback.
Input:
Feedback:
{feedback}
Output:
Revised Text:
Assessment Basis: Model's ability to infer and apply improvements from textual guidance alone.
Objective: Evaluate context-aware text revision combining original text and feedback (standard editing scenario).
Evaluation Scenario: Traditional document editing where the model has both source material and revision guidance.
Input: Original text + Feedback/review comments Output: Revised text incorporating both context and feedback
Prompt Template:
You are a professional IETF RFC writer.
Identify the parts of the original text that need revision based on the feedback.
Revise the text accordingly.
Input:
Original Text:
{old_text}
Feedback:
{feedback}
Output:
Revised Text:
You are a professional IETF RFC writer.
Identify the parts of the original text that need revision based on the feedback.
Revise the text accordingly.
Input:
Original Text:
{old_text}
Feedback:
{feedback}
Output:
Revised Text:
Use Case: Evaluating the model's ability to perform targeted revisions using both context and feedback.
All tasks inherit from the Task base class and implement:
get_dataset()- Load the evaluation datasetget_prompt()- Generate the prompt for a sampleget_reference()- Get the reference/gold standard revisionpostprocess_generation()- Clean up model outputprocess_results()- Aggregate evaluation metrics
Tasks evaluate text revisions using multiple metrics:
- BLEU - Lexical overlap with reference
- SacreBLEU - Corpus-level BLEU score
- Google BLEU - N-gram based similarity
- BERTScore - Semantic similarity (RoBERTa and DeBERTa variants)
- METEOR - Alignment-based metric
- Exact Match - Perfect match with reference
- WER - Word Error Rate
- MAUVE - Distribution distance metric
Navigate to the evaluation directory and run the main script:
cd ids_evaluation
python main.py \
--model meta-llama/Llama-2-7b-hf \
--tasks ids_edit \
--batch_size 1 \
--max_new_tokens 512 \
--load_dataset_path ../dataset/ids.i2c.test.generation.jsonl \
--save_generations \
--metric_output_path metrics.jsonGeneration Only:
python main.py --model <model_id> --tasks <task> --generation_only --save_generationsEvaluation Only (with pre-generated outputs):
python main.py --model <model_id> --tasks <task> --load_generations_path <path_to_generations.json>python main.py \
--model <base_model_id> \
--peft_model <path_to_peft_adapter> \
--tasks ids_edit \
--batch_size 14-bit quantization:
python main.py --model <model_id> --load_in_4bit --tasks ids_edit8-bit quantization:
python main.py --model <model_id> --load_in_8bit --tasks ids_editaccelerate launch main.py \
--model <model_id> \
--tasks <task> \
--batch_size 2For detailed parameter documentation, see ids_evaluation/README.md.
--model- HuggingFace model ID or local path (required)--tasks- Task names:ids_auto_complete,ids_startup,ids_edit--batch_size- Batch size per GPU (default: 1)--max_new_tokens- Maximum new tokens to generate (default: 512)--load_dataset_path- Path to custom dataset (JSONL format)--seed- Random seed for reproducibility
Datasets should be in JSONL format with the following structure:
For ids_auto_complete:
{"old_text": "original text content"}For ids_startup:
{"comments": "feedback or review comments"}For ids_edit:
{"old_text": "original text content", "comments": "feedback or review comments"}- Framework: Built with HuggingFace Transformers and Accelerate
- Support: Causal LM (GPT-style) and Seq2Seq models
- Quantization: 4-bit and 8-bit support via bitsandbytes
- Distributed: Multi-GPU evaluation via Accelerate
- Prepare your evaluation dataset in JSONL format
- Choose a task type based on your evaluation scenario
- Select a model from HuggingFace Hub
- Run the evaluation command with desired parameters
- Results will be saved as JSON with all metrics
This evaluation harness is derived from the BigCode evaluation harness and the lm-evaluation-harness.