Splits the runtime into focused modules (qa_config, qa_runtime, qa_workflow), adds a --approve flag for human-in-the-loop test review before save, and introduces HTML replay utilities for failure investigation. Evaluation now includes ROUGE/similarity scoring in the NLP baseline and an overall quality score in Ragas.