This repository contains the artifact for the paper "Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies." The paper investigates whether regex composition tasks are unique enough to merit dedicated machinery or if reuse is sufficient.
Regular expressions (regexes) are prevalent in software engineering but are known to be difficult to compose correctly. This research systematically evaluates three major regex composition strategies:
- Reuse-by-example: Our novel operationalization of regex reuse practices
- Formal regex synthesis: Using algorithmic approaches to generate regexes
- Generative AI: Using large language models (LLMs) to compose regexes
We evaluated these strategies across multiple dimensions including accuracy, syntactic and semantic similarity, constraint balance, and computational efficiency.
-
data/: Contains all data used and produced in our experiments
- regex-composition-bench/: Our novel dataset of regex composition tasks mined from GitHub and RegExLib
- regex-reuse-database/: Production-ready regexes for the reuse-by-example approach
- generated-regexes/: Regexes generated by different strategies
- evaluation-results/: Evaluation results for each strategy
-
modules/: Contains the code for all components of our research
- extractor/: Code for extracting regexes from software repositories
- evaluator/: Code for reuse-by-example and its evaluation
- helpfulness_score/: Implementation of our novel "helpfulness" metric
- regex_semantic_sim/: Semantic similarity comparison between regexes
- regex_syntactic_sim/: Syntactic similarity comparison between regexes
- run_strategies/: Code to run different regex composition strategies, including the prompts for LLMs
- run_analysis/: Scripts for analyzing results
- make_plots/: Scripts for generating plots and visualizations
More detailed information about each component can be found in their respective directories.