test(e2e): Add SN2 validation with NIST anchoring by ericchansen · Pull Request #102 · ericchansen/q2mm

ericchansen · 2026-03-20T21:27:18Z

Summary

SN2 end-to-end validation test with external reference anchoring (NIST experimental IR + literature barrier heights). Validates the full Q2MM pipeline produces physically meaningful force fields, not just internally consistent ones.

What's tested (17 slow tests)

CH3F Ground State (9 tests)

3-stage frequency comparison: Default FF (RMSD=164 cm^-1) -> Seminario (157) -> Q2MM-optimized (44)
Validates improvement at each stage
QM harmonic frequencies cross-checked against NIST experimental fundamentals (MAE < 80 cm^-1 with 0.963 scaling factor)

SN2 Transition State (5 tests)

Seminario Method D produces negative force constant for reaction coordinate
Optimization converges on TS geometry
Real-mode frequencies checked against QM reference

Reaction Profile (3 tests)

QM barrier height matches our B3LYP/6-31+G(d) reference (-4.24 kcal/mol)
MM energies finite for both GS and TS with their respective Seminario FFs

Key design decisions

data_idx maps to engine's full 3N frequency array, not the QM-only array (this was a bug in the initial implementation -- near-zero translational/rotational modes shifted the indexing)
All tests marked @pytest.mark.slow -- only run with --run-slow
Uses OpenMM backend (no external binaries needed for CI)
ASCII-only in print statements and assert messages (Windows cp1252 compatibility)

Files

File	Description
`test/integration/test_e2e_sn2_validation.py`	17 E2E tests across 3 test classes
`test/fixtures/sn2_external_reference.json`	NIST frequencies, literature barriers, QM metadata

Test output (diagnostic)

CH3F Ground State -- Frequency Comparison (cm^-1)
     QM (B3LYP/6-31+G(d)): [1040.6, 1191.7, 1191.8, 1495.3, 1512.3, 1512.4, 3058.0, 3145.1, 3145.1]
               Default FF:  [889.0,  889.0,  933.5,  1382.3, 1404.5, 1404.5, 2943.4, 3067.2, 3067.2]  RMSD=164.2
             Seminario FF:  [851.1,  1141.6, 1141.6, 1658.1, 1733.3, 1733.3, 2896.7, 3021.8, 3021.8]  RMSD=156.9
           Q2MM Optimized:  [1005.1, 1150.8, 1150.8, 1490.9, 1491.0, 1618.2, 3023.6, 3150.7, 3150.7]  RMSD=44.2

Closes #101

17 slow tests validating the full Q2MM pipeline on F- + CH3F: CH3F ground state (9 tests): - Default FF -> Seminario -> Q2MM-optimized frequency progression - Optimized RMSD=44.2 cm^-1 vs QM (from 164.2 default) - QM harmonic frequencies validated against NIST experimental IR SN2 transition state (5 tests): - Seminario Method D produces negative FC for reaction coordinate - Optimization converges, real-mode frequencies checked vs QM Reaction profile (3 tests): - QM barrier height matches our B3LYP reference (-4.24 kcal/mol) - MM energies finite for both GS and TS geometries External reference fixture includes NIST CH3F fundamentals, literature barrier heights (Czako 2015, Shaik 1992), and DFT scaling factor for harmonic-to-fundamental comparison. Closes #101 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a slow, OpenMM-backed end-to-end integration validation for the SN2 (F- + CH3F) system, anchored to external experimental/literature reference data to ensure Q2MM optimizations remain physically meaningful.

Changes:

Introduces 3 slow pytest classes covering CH3F ground-state frequencies, SN2 TS behavior, and basic reaction-profile energetics checks.
Adds a JSON fixture containing NIST CH3F fundamentals, literature/QM barrier references, and QM metadata (level of theory + scaling factor).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
test/integration/test_e2e_sn2_validation.py	New slow E2E test suite exercising Seminario → optimization → validation against QM + external references.
test/fixtures/sn2_external_reference.json	External reference dataset (NIST + literature/QM anchors) consumed by the new E2E tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T21:30:06Z

+        if len(qm_real) >= 9:
+            qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
+            qm_scaled = qm_unique * scale
+            nist_arr = np.array(nist_ordered)
+
+            mae = _frequency_mae(qm_scaled, nist_arr)
+            # Scaled B3LYP/6-31+G(d) should be within ~50 cm^-1 of experiment
+            assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"


test_qm_frequencies_vs_nist can silently do nothing if len(qm_real) < 9 due to the conditional, which would let missing/incorrect QM reference data pass unnoticed. It would be safer to assert the expected number of CH3F real modes (or explicitly pytest.skip/xfail with a reason) so the test always validates the NIST comparison when inputs are incomplete.

Copilot · 2026-03-20T21:30:07Z

+    def test_seminario_ts_frequencies(self, engine, ts_mol, seminario_ff):
+        """Seminario TS FF should produce at least one imaginary frequency."""
+        freqs = engine.frequencies(ts_mol, seminario_ff)
+        imag = _imaginary_frequencies(freqs)
+        # May not produce imaginary mode since OpenMM bonded terms are harmonic
+        # and negative FC gives a repulsive parabola. Log either way.
+        print(f"\nSeminario TS frequencies: {[f'{f:.1f}' for f in sorted(freqs) if abs(f) > 10]}")
+        print(f"Imaginary modes: {len(imag)}")


This test doesn't assert anything (it only prints), but its name/docstring suggest it's verifying imaginary frequencies. Either rename it to something like *_logged/diagnostic (consistent with the other logging-only tests) or add an assertion/xfail so it provides a real validation signal.

Copilot · 2026-03-20T21:30:07Z

+        rmsd = _frequency_rmsd(sorted(mm_freqs)[-n:], sorted(qm_real)[-n:])
+        # Generic FF should be noticeably off -- RMSD > 100 cm^-1
+        assert rmsd > 50.0, f"Default FF unexpectedly good (RMSD={rmsd:.1f} cm^-1)"


The comment says the generic FF baseline should have RMSD > 100 cm^-1, but the assertion only requires > 50.0. This makes the baseline check much weaker than documented and could let regressions slip through; either tighten the threshold or update the comment to match the intended expectation.

Copilot · 2026-03-20T21:30:07Z

+        n = min(len(opt_real), len(qm_real))
+        rmsd = _frequency_rmsd(opt_real[-n:], qm_real[-n:])
+        # Optimized FF should reproduce QM frequencies reasonably well
+        assert rmsd < 200.0, f"Optimized FF frequency RMSD too high: {rmsd:.1f} cm^-1"


The acceptance criteria in the PR description/reported output targets ~<50 cm^-1 RMSD for the optimized CH3F frequencies, but this test allows up to 200 cm^-1. Consider tightening this threshold (or aligning the PR description) so the test actually enforces the intended physical accuracy.

Replace Python list dumps with aligned, labeled tables for: - CH3F QM vs NIST experimental frequencies (with mode names) - Ground state frequency progression (mode-by-mode comparison) - TS frequency progression with imaginary mode count - Reaction profile barrier heights with literature comparison Much easier for new users to interpret test output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Shows Seminario estimation, optimization, and frequency evaluation times alongside the frequency comparison tables. Proves that optimizations run live (not cached) and quantifies performance: - Seminario estimation: ~1-2 ms - L-BFGS-B optimization: ~4-8 sec (750-1300 evaluations) - Single frequency eval: ~97 ms (OpenMM backend) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T21:44:39Z

+    lines = path.read_text().strip().splitlines()
+    return np.array([float(line) for line in lines if not line.startswith("#")])
+
+
+def _load_qm_energy(path: Path) -> float:
+    """Load a single QM energy in Hartrees from text file."""
+    lines = path.read_text().strip().splitlines()
+    for line in lines:
+        line = line.strip()
+        if line and not line.startswith("#"):
+            return float(line)
+    raise ValueError(f"No energy value found in {path}")
+
+
+def _load_external_reference() -> dict:
+    """Load NIST + literature reference data."""
+    return json.loads(EXT_REF.read_text())


Path.read_text() is called without an explicit encoding in multiple helpers. On Windows this can default to a locale encoding (e.g., cp1252) and fail if the reference files/fixtures are UTF-8. Use read_text(encoding="utf-8") consistently for QM reference text files and the JSON fixture to make the test deterministic across platforms.

Copilot · 2026-03-20T21:44:39Z

+        nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"]))
+        scale = ext_ref["qm_level_of_theory"]["dft_scaling_factor"]
+
+        qm_real = sorted(_real_frequencies(qm_freqs))
+        # NIST has 6 unique fundamentals (e-symmetry modes are degenerate)
+        # Pick one representative from each degenerate pair
+        qm_unique_indices = [0, 1, 3, 4, 7, 8]
+        nist_ordered = [1049, 1182, 1464, 1467, 2930, 3006]
+        mode_names = [
+            "v3 (CF str)",
+            "v6 (CH3 rock)",
+            "v2 (CH3 s-def)",
+            "v5 (CH3 a-def)",
+            "v1 (CH3 s-str)",
+            "v4 (CH3 a-str)",
+        ]
+
+        if len(qm_real) >= 9:
+            qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
+            qm_scaled = qm_unique * scale
+            nist_arr = np.array(nist_ordered)
+


This test defines nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"])) but never uses it, and instead hard-codes nist_ordered. That makes the fixture data effectively ignored and can drift out of sync. Remove the unused variable and derive the ordered NIST array from ext_ref (or document why hard-coding is required).

Copilot · 2026-03-20T21:44:40Z

+        if len(qm_real) >= 9:
+            qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
+            qm_scaled = qm_unique * scale
+            nist_arr = np.array(nist_ordered)
+
+            mae = _frequency_mae(qm_scaled, nist_arr)
+
+            # Print comparison table
+            col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
+            print("\n")
+            print("=" * 70)
+            print("  CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
+            print(f"  Scaling factor: {scale} (B3LYP/6-31+G(d))")
+            print("=" * 70)
+            print(
+                f"  {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"
+            )
+            print("  " + "-" * (col1 + col2 + col3 + col4 + col5))
+            for i in range(len(nist_ordered)):
+                err = qm_scaled[i] - nist_arr[i]
+                print(
+                    f"  {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
+                )
+            print("  " + "-" * (col1 + col2 + col3 + col4 + col5))
+            print(f"  {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
+            print("=" * 70)
+            print()
+
+            assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"


test_qm_frequencies_vs_nist silently does nothing (and therefore passes) if len(qm_real) < 9. Since this is validating a pinned QM reference file, it’s better to fail fast (assert the expected number of modes is present) or explicitly pytest.skip with a message so unexpected truncation doesn’t get missed.

Suggested change

if len(qm_real) >= 9:

qm_unique = np.array([qm_real[i] for i in qm_unique_indices])

qm_scaled = qm_unique * scale

nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table

col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10

print("\n")

print("=" * 70)

print(" CH3F QM vs NIST EXPERIMENTAL (cm^-1)")

print(f" Scaling factor: {scale} (B3LYP/6-31+G(d))")

print("=" * 70)

print(

f" {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"

)

print(" " + "-" * (col1 + col2 + col3 + col4 + col5))

for i in range(len(nist_ordered)):

err = qm_scaled[i] - nist_arr[i]

print(

f" {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"

)

print(" " + "-" * (col1 + col2 + col3 + col4 + col5))

print(f" {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")

print("=" * 70)

print()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

# Fail fast if the QM reference data is unexpectedly truncated.

# We rely on specific mode indices in `qm_unique_indices`, so ensure

# that all of them are valid for the current QM data.

max_required_index = max(qm_unique_indices)

assert len(qm_real) > max_required_index, (

f"QM reference frequencies truncated: expected at least "

f"{max_required_index + 1} real modes, got {len(qm_real)}"

)

qm_unique = np.array([qm_real[i] for i in qm_unique_indices])

qm_scaled = qm_unique * scale

nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table

col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10

print("\n")

print("=" * 70)

print(" CH3F QM vs NIST EXPERIMENTAL (cm^-1)")

print(f" Scaling factor: {scale} (B3LYP/6-31+G(d))")

print("=" * 70)

print(

f" {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"

)

print(" " + "-" * (col1 + col2 + col3 + col4 + col5))

for i in range(len(nist_ordered)):

err = qm_scaled[i] - nist_arr[i]

print(

f" {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"

)

print(" " + "-" * (col1 + col2 + col3 + col4 + col5))

print(f" {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")

print("=" * 70)

print()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

Copilot · 2026-03-20T21:44:40Z

+        for _, freqs_label, _, _, t_freq in stages[1:]:
+            pass


There’s a no-op loop (for _, freqs_label, _, _, t_freq in stages[1:]: pass) left in the diagnostic logger. It adds noise and introduces unused variables; remove it or replace it with the intended logic.

Suggested change

for _, freqs_label, _, _, t_freq in stages[1:]:

pass

Copilot · 2026-03-20T21:44:40Z

+    def test_seminario_ts_frequencies(self, engine, ts_mol, seminario_ff):
+        """Seminario TS FF should produce at least one imaginary frequency."""
+        freqs = engine.frequencies(ts_mol, seminario_ff)
+        imag = _imaginary_frequencies(freqs)
+        real = _real_frequencies(freqs)
+        # May not produce imaginary mode since OpenMM bonded terms are harmonic
+        # and negative FC gives a repulsive parabola. Log either way.
+        print(f"\n  Seminario TS: {len(real)} real modes, {len(imag)} imaginary modes")


test_seminario_ts_frequencies’s docstring says the Seminario TS FF “should produce at least one imaginary frequency”, but the test never asserts that (and the inline comment says it may not happen). Please align the docstring with the actual behavior (diagnostic-only), or turn this into an explicit assert/xfail/skip based on the expected OpenMM behavior.

Copilot · 2026-03-20T21:44:41Z

+        if n > 0:
+            rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])
+            assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"


test_ts_optimized_real_freqs_match_qm will silently pass if n == 0 (no real modes above threshold). That would hide cases where frequency calculation breaks or returns only near-zero modes. Consider asserting n > 0 (or a more specific expected count) and failing with a helpful message when no comparable real modes are found.

Suggested change

if n > 0:

rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])

assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"

assert n > 0, (

"No comparable real vibrational modes found for TS frequency comparison; "

f"MM real modes: {len(mm_real)}, QM real modes: {len(qm_real)}"

)

rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])

assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"

Copilot · 2026-03-20T21:44:41Z

+    "nist": "NIST Chemistry WebBook, https://webbook.nist.gov/cgi/cbook.cgi?ID=C593533",
+    "czako_2015": "Szabó et al., J. Chem. Phys. 142, 244301 (2015)",
+    "shaik_1992": "Shaik & Pross, J. Chem. Soc. Perkin Trans. 2, 1019 (1992)"


In _sources, the key is named czako_2015 but the citation string is “Szabó et al., J. Chem. Phys. 142, 244301 (2015)”. This mismatch makes it harder to trace provenance later; consider renaming the key or adjusting the citation so the identifier matches the referenced source.

Copilot · 2026-03-20T21:44:41Z

+        qm_real = _real_frequencies(qm_freqs)
+        n = min(len(mm_freqs), len(qm_real))
+        rmsd = _frequency_rmsd(sorted(mm_freqs)[-n:], sorted(qm_real)[-n:])
+        # Generic FF should be noticeably off -- RMSD > 100 cm^-1


The comment says the generic FF baseline should have RMSD > 100 cm^-1, but the assertion only checks rmsd > 50.0. Either update the threshold to match the stated intent, or adjust the comment so the test expectation is unambiguous.

Suggested change

# Generic FF should be noticeably off -- RMSD > 100 cm^-1

# Generic FF should be noticeably off; here we require RMSD > 50 cm^-1

Copilot · 2026-03-20T21:44:42Z

+    """Validate the pipeline on the SN2 F⁻ + CH₃F transition state.
+
+    Transition states require special handling (Seminario Method D/E)
+    and must preserve the imaginary mode through optimization.
+    """


The TS class/docstring (and PR description) says the imaginary mode must be preserved through optimization, but there isn’t currently an assertion that the optimized TS has an imaginary mode (or exactly one). Consider asserting on _imaginary_frequencies(engine.frequencies(ts_mol, optimized_ff)) (count and/or magnitude vs QM), or update the docstring/test plan to match what’s actually verified.

Copilot · 2026-03-20T21:44:42Z

+    """Validate energetics across the reaction coordinate.
+
+    Compares MM barrier heights to QM values at:
+    - Separated reactants (F⁻ + CH₃F)
+    - Ion-dipole complex
+    - Transition state
+    """


This reaction-profile class/docstring says it compares barriers involving the ion–dipole complex, but the tests never load/use COMPLEX_XYZ or compute TS–complex barriers (only TS–separated-reactants). Either add the complex energy/barrier assertions, or update the docstring/PR description to reflect what’s actually tested.

The previous table showed MM CH3F and TS energies as both 0.0000, which was meaningless: Seminario FFs set equilibrium from QM geometry so energy at that geometry is zero by construction. Worse, comparing energies across molecule-specific FFs with different connectivity is apples-to-oranges (arbitrary energy zeros). Now shows the full QM reaction profile instead: - Complexation energy (F- + CH3F -> complex): -13.71 kcal/mol - TS - reactants barrier: -4.24 vs -0.45 (CCSD(T)-F12) - TS - complex barrier: 9.47 vs 11.3 (VB benchmark) Explains clearly why MM barriers aren't computable with separate FFs and notes that reactive potentials (EVB, ReaxFF) would be needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Displace CH3F along each QM normal mode eigenvector at 0.05, 0.10, and 0.15 Ang and compare MM energy to QM harmonic prediction (0.5 * eigenvalue * q^2). This validates that the optimized FF reproduces the PES shape, not just curvature at the minimum. Pre-compute and save normal mode eigenvectors/eigenvalues as .npz files so the eigendecomposition (< 0.2 ms) doesn't run every test. Tables show both Seminario and Q2MM-optimized FF with per-mode percentage errors and wall clock timing (~3 s for 27 energy evals). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add _TablePrinter helper that collects all lines, measures the widest one, then sizes = and - bars to match. Refactor all 5 diagnostic tables to use it, eliminating hardcoded widths that didn't align with content. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Format error percentages as complete strings with fixed-width right-justification so columns stay aligned regardless of value magnitude. Center d=X.XX labels within their column group using computed group width instead of hardcoded spaces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 4 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T22:26:19Z

+
+from __future__ import annotations
+
+import copy


import copy appears unused in this module. Even though unused-import linting is relaxed in this repo, removing unused imports helps keep the test file easier to scan.

Suggested change

import copy

Copilot · 2026-03-20T22:26:19Z

+TS_ENERGY = QM_REF / "sn2-ts-energy.txt"
+
+COMPLEX_XYZ = QM_REF / "complex-optimized.xyz"
+EXT_REF = FIXTURES / "sn2_external_reference.json"


COMPLEX_XYZ is defined but never used in this module. Consider removing it to reduce dead code, or add a check that actually uses the complex structure (e.g., an MM energy/frequency sanity check at the complex stationary point).

Copilot · 2026-03-20T22:26:19Z

+        if len(qm_real) >= 9:
+            qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
+            qm_scaled = qm_unique * scale
+            nist_arr = np.array(nist_ordered)
+
+            mae = _frequency_mae(qm_scaled, nist_arr)
+
+            # Print comparison table
+            col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
+            t = _TablePrinter()
+            t.blank()
+            t.bar()
+            t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
+            t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")
+            t.bar()
+            t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")
+            t.sep()
+            for i in range(len(nist_ordered)):
+                err = qm_scaled[i] - nist_arr[i]
+                t.row(
+                    f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
+                )
+            t.sep()
+            t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
+            t.bar()
+            t.blank()
+            t.flush()
+
+            assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"


This test only asserts inside if len(qm_real) >= 9:; if the QM frequency file changes or is mis-parsed, the test will silently pass without validating anything. Prefer an explicit assertion (or a pytest.skip with a clear reason) when len(qm_real) is not the expected value.

Suggested change

if len(qm_real) >= 9:

qm_unique = np.array([qm_real[i] for i in qm_unique_indices])

qm_scaled = qm_unique * scale

nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table

col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10

t = _TablePrinter()

t.blank()

t.bar()

t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")

t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")

t.bar()

t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")

t.sep()

for i in range(len(nist_ordered)):

err = qm_scaled[i] - nist_arr[i]

t.row(

f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"

)

t.sep()

t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")

t.bar()

t.blank()

t.flush()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

assert len(qm_real) >= 9, f"Expected at least 9 real QM frequencies, got {len(qm_real)}"

qm_unique = np.array([qm_real[i] for i in qm_unique_indices])

qm_scaled = qm_unique * scale

nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table

col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10

t = _TablePrinter()

t.blank()

t.bar()

t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")

t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")

t.bar()

t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")

t.sep()

for i in range(len(nist_ordered)):

err = qm_scaled[i] - nist_arr[i]

t.row(

f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"

)

t.sep()

t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")

t.bar()

t.blank()

t.flush()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

Copilot · 2026-03-20T22:26:19Z

+    def test_improvement_progression_logged(
+        self, engine, ch3f_mol, default_ff, seminario_result, optimized_result, qm_freqs
+    ):
+        """Log the full progression for inspection (always passes, diagnostic)."""


Several tests appear to be purely diagnostic (they print tables but have no assertions), e.g. test_improvement_progression_logged. Tests that always pass reduce signal in the suite; consider moving this into a helper that only prints on failure, guarding behind an opt-in flag (env var/CLI option), or converting it to assertions that enforce the intended progression.

Copilot · 2026-03-20T22:26:20Z

+    def test_pes_distortion_small_displacement(self, distortion_results):
+        """At small displacements (0.05 Ang), most modes should roughly track QM.


PR description says this adds 17 slow tests, but this module currently defines 19 test_... functions. Please update the PR description/test count (or consolidate/mark some diagnostic ones as non-tests) so expectations match what will run under --run-slow.

Copilot · 2026-03-20T22:26:20Z

+        ha_to_kcal = 627.5094740631
+        reactant_e = qm_energies["f_minus"] + qm_energies["ch3f"]
+        ts_e = qm_energies["ts"]
+        barrier = (ts_e - reactant_e) * ha_to_kcal
+


Same unit-conversion constant is hard-coded here (ha_to_kcal = 627.5094740631). Prefer q2mm.constants.HARTREE_TO_KCALMOL for consistency across the codebase.

Copilot · 2026-03-20T22:26:20Z

+            disp_mol = Q2MMMolecule(
+                symbols=mol.symbols,
+                geometry=mol.geometry + delta_xyz,
+                atom_types=mol.atom_types,
+                charge=mol.charge,
+                multiplicity=mol.multiplicity,
+                bond_tolerance=mol.bond_tolerance,
+            )
+            e_mm = engine.energy(disp_mol, ff) - e_eq
+


_compute_distortions creates a new Q2MMMolecule for every displacement and then calls engine.energy(disp_mol, ff). In the OpenMM backend, energy() will create a fresh OpenMM Context unless you pass an OpenMMHandle, so this loop can end up constructing dozens of contexts per test run. Consider reusing a single handle/context and only updating positions for each displacement to keep the slow test runtime under control.

Copilot · 2026-03-20T22:26:20Z

+    return {
+        "eigenvalues": data["eigenvalues"],
+        "eigenvectors": data["eigenvectors"],
+        "masses_amu": data["masses_amu"],


The _load_normal_modes docstring says the returned dict includes symbols, but the function currently only returns eigenvalues/eigenvectors/masses_amu. Either return data['symbols'] as well (the .npz includes it) or update the docstring to match reality.

Suggested change

"masses_amu": data["masses_amu"],

"masses_amu": data["masses_amu"],

"symbols": data["symbols"],

Copilot · 2026-03-20T22:26:20Z

+    ha_to_kcal = 627.5094740631
+    bohr_to_m = BOHR_TO_ANG * 1e-10
+    sqrt_m = np.sqrt(np.repeat(masses_amu, 3))


ha_to_kcal = 627.5094740631 is duplicated here and elsewhere in this file. The codebase already defines q2mm.constants.HARTREE_TO_KCALMOL; using the shared constant avoids drift and keeps unit conversions consistent.

Copilot · 2026-03-20T22:26:21Z

+        nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"]))
+        scale = ext_ref["qm_level_of_theory"]["dft_scaling_factor"]
+
+        qm_real = sorted(_real_frequencies(qm_freqs))
+        # NIST has 6 unique fundamentals (e-symmetry modes are degenerate)
+        # Pick one representative from each degenerate pair
+        qm_unique_indices = [0, 1, 3, 4, 7, 8]
+        nist_ordered = [1049, 1182, 1464, 1467, 2930, 3006]
+        mode_names = [


nist = np.array(sorted(ext_ref['ch3f_nist_all_cm1'])) is computed but never used, and nist_ordered hard-codes the same values already present in the JSON fixture. To keep the test self-consistent, consider deriving the ordered NIST array directly from ext_ref['ch3f_nist_fundamentals_cm1'] / ch3f_nist_all_cm1 and dropping the unused local.

ericchansen · 2026-03-21T03:01:03Z

Consolidated into PR #103, which now targets master directly and includes all commits from this branch.

Copilot AI review requested due to automatic review settings March 20, 2026 21:27

Copilot started reviewing on behalf of ericchansen March 20, 2026 21:27 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

ericchansen and others added 2 commits March 20, 2026 16:34

Copilot AI review requested due to automatic review settings March 20, 2026 21:40

Copilot started reviewing on behalf of ericchansen March 20, 2026 21:41 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

ericchansen and others added 4 commits March 20, 2026 16:45

Copilot AI review requested due to automatic review settings March 20, 2026 22:20

Copilot started reviewing on behalf of ericchansen March 20, 2026 22:21 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

ericchansen closed this Mar 21, 2026

ericchansen deleted the test/sn2-e2e-validation-101 branch April 15, 2026 22:18

	# Generic FF should be noticeably off -- RMSD > 100 cm^-1
	# Generic FF should be noticeably off; here we require RMSD > 50 cm^-1

		def test_pes_distortion_small_displacement(self, distortion_results):
		"""At small displacements (0.05 Ang), most modes should roughly track QM.

	"masses_amu": data["masses_amu"],
	"masses_amu": data["masses_amu"],
	"symbols": data["symbols"],

Conversation

ericchansen commented Mar 20, 2026

Summary

What's tested (17 slow tests)

CH3F Ground State (9 tests)

SN2 Transition State (5 tests)

Reaction Profile (3 tests)

Key design decisions

Files

Test output (diagnostic)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment