This repository contains the code, data, and results for the research paper "Evaluating the Use of LLMs for Documentation to Code Traceability".
This study performs a comprehensive evaluation of Large Language Models (LLMs) like Claude 3.5 Sonnet, GPT-4o, and o3-mini on their ability to establish traceability links between software documentation and source code. Using two novel datasets derived from the Crawl4AI and Unity Catalog open-source projects, the paper assesses three key capabilities:
- Trace link identification accuracy.
- The quality of relationship explanations.
- Multi-step trace chain reconstruction.
The findings indicate that LLMs significantly outperform traditional baselines (TF-IDF, BM25, CodeBERT) but also highlight current limitations, providing a roadmap for future research and practical application in software development workflows.
The research paper can be found on arXiv: https://arxiv.org/abs/2506.16440
data/: Contains the raw and processed datasets for the Crawl4AI and Unity Catalog projects, including the full documents, code artifacts, and ground-truth trace links.src/: Contains all the Python source code for running the experiments.src/experiments/: Scripts for each research question (RQ1, RQ2, RQ3) and baseline evaluations.src/utils/: Utility scripts for data loading, metrics calculation, and interfacing with LLMs.src/config/: Configuration files for the experiments.
results/: Stores the raw and aggregated results generated by the experiment scripts.
-
Clone the repository:
git clone <repository-url> cd evaluating-llm-doc-code-traceability
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
- Copy the example
.envfile:cp .env.example .env
- Add your API keys for Anthropic and OpenAI to the newly created
.envfile.
- Copy the example
-
Configure the experiment:
- Modify
config.pyto select which Research Questions (RUN_RQS) to execute and to set the number of runs (NUM_RUNS).
- Modify
-
Execute the experiment scripts:
- The main scripts for each research question are located in the
src/experiments/directory. Run them directly, for example:python src/experiments/rq1_traceability.py python src/experiments/rq2_relationships.py python src/experiments/rq3_pathways.py
- Results will be saved in the
results/directory.
- The main scripts for each research question are located in the