Skip to content

test(e2e): Add SN2 validation with NIST anchoring#102

Closed
ericchansen wants to merge 7 commits intomasterfrom
test/sn2-e2e-validation-101
Closed

test(e2e): Add SN2 validation with NIST anchoring#102
ericchansen wants to merge 7 commits intomasterfrom
test/sn2-e2e-validation-101

Conversation

@ericchansen
Copy link
Copy Markdown
Owner

Summary

SN2 end-to-end validation test with external reference anchoring (NIST experimental IR + literature barrier heights). Validates the full Q2MM pipeline produces physically meaningful force fields, not just internally consistent ones.

What's tested (17 slow tests)

CH3F Ground State (9 tests)

  • 3-stage frequency comparison: Default FF (RMSD=164 cm^-1) -> Seminario (157) -> Q2MM-optimized (44)
  • Validates improvement at each stage
  • QM harmonic frequencies cross-checked against NIST experimental fundamentals (MAE < 80 cm^-1 with 0.963 scaling factor)

SN2 Transition State (5 tests)

  • Seminario Method D produces negative force constant for reaction coordinate
  • Optimization converges on TS geometry
  • Real-mode frequencies checked against QM reference

Reaction Profile (3 tests)

  • QM barrier height matches our B3LYP/6-31+G(d) reference (-4.24 kcal/mol)
  • MM energies finite for both GS and TS with their respective Seminario FFs

Key design decisions

  • data_idx maps to engine's full 3N frequency array, not the QM-only array (this was a bug in the initial implementation -- near-zero translational/rotational modes shifted the indexing)
  • All tests marked @pytest.mark.slow -- only run with --run-slow
  • Uses OpenMM backend (no external binaries needed for CI)
  • ASCII-only in print statements and assert messages (Windows cp1252 compatibility)

Files

File Description
test/integration/test_e2e_sn2_validation.py 17 E2E tests across 3 test classes
test/fixtures/sn2_external_reference.json NIST frequencies, literature barriers, QM metadata

Test output (diagnostic)

CH3F Ground State -- Frequency Comparison (cm^-1)
     QM (B3LYP/6-31+G(d)): [1040.6, 1191.7, 1191.8, 1495.3, 1512.3, 1512.4, 3058.0, 3145.1, 3145.1]
               Default FF:  [889.0,  889.0,  933.5,  1382.3, 1404.5, 1404.5, 2943.4, 3067.2, 3067.2]  RMSD=164.2
             Seminario FF:  [851.1,  1141.6, 1141.6, 1658.1, 1733.3, 1733.3, 2896.7, 3021.8, 3021.8]  RMSD=156.9
           Q2MM Optimized:  [1005.1, 1150.8, 1150.8, 1490.9, 1491.0, 1618.2, 3023.6, 3150.7, 3150.7]  RMSD=44.2

Closes #101

17 slow tests validating the full Q2MM pipeline on F- + CH3F:

CH3F ground state (9 tests):
- Default FF -> Seminario -> Q2MM-optimized frequency progression
- Optimized RMSD=44.2 cm^-1 vs QM (from 164.2 default)
- QM harmonic frequencies validated against NIST experimental IR

SN2 transition state (5 tests):
- Seminario Method D produces negative FC for reaction coordinate
- Optimization converges, real-mode frequencies checked vs QM

Reaction profile (3 tests):
- QM barrier height matches our B3LYP reference (-4.24 kcal/mol)
- MM energies finite for both GS and TS geometries

External reference fixture includes NIST CH3F fundamentals,
literature barrier heights (Czako 2015, Shaik 1992), and
DFT scaling factor for harmonic-to-fundamental comparison.

Closes #101

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 21:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a slow, OpenMM-backed end-to-end integration validation for the SN2 (F- + CH3F) system, anchored to external experimental/literature reference data to ensure Q2MM optimizations remain physically meaningful.

Changes:

  • Introduces 3 slow pytest classes covering CH3F ground-state frequencies, SN2 TS behavior, and basic reaction-profile energetics checks.
  • Adds a JSON fixture containing NIST CH3F fundamentals, literature/QM barrier references, and QM metadata (level of theory + scaling factor).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
test/integration/test_e2e_sn2_validation.py New slow E2E test suite exercising Seminario → optimization → validation against QM + external references.
test/fixtures/sn2_external_reference.json External reference dataset (NIST + literature/QM anchors) consumed by the new E2E tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +263 to +270
if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)
# Scaled B3LYP/6-31+G(d) should be within ~50 cm^-1 of experiment
assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_qm_frequencies_vs_nist can silently do nothing if len(qm_real) < 9 due to the conditional, which would let missing/incorrect QM reference data pass unnoticed. It would be safer to assert the expected number of CH3F real modes (or explicitly pytest.skip/xfail with a reason) so the test always validates the NIST comparison when inputs are incomplete.

Copilot uses AI. Check for mistakes.
Comment on lines +353 to +360
def test_seminario_ts_frequencies(self, engine, ts_mol, seminario_ff):
"""Seminario TS FF should produce at least one imaginary frequency."""
freqs = engine.frequencies(ts_mol, seminario_ff)
imag = _imaginary_frequencies(freqs)
# May not produce imaginary mode since OpenMM bonded terms are harmonic
# and negative FC gives a repulsive parabola. Log either way.
print(f"\nSeminario TS frequencies: {[f'{f:.1f}' for f in sorted(freqs) if abs(f) > 10]}")
print(f"Imaginary modes: {len(imag)}")
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't assert anything (it only prints), but its name/docstring suggest it's verifying imaginary frequencies. Either rename it to something like *_logged/diagnostic (consistent with the other logging-only tests) or add an assertion/xfail so it provides a real validation signal.

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +180
rmsd = _frequency_rmsd(sorted(mm_freqs)[-n:], sorted(qm_real)[-n:])
# Generic FF should be noticeably off -- RMSD > 100 cm^-1
assert rmsd > 50.0, f"Default FF unexpectedly good (RMSD={rmsd:.1f} cm^-1)"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the generic FF baseline should have RMSD > 100 cm^-1, but the assertion only requires > 50.0. This makes the baseline check much weaker than documented and could let regressions slip through; either tighten the threshold or update the comment to match the intended expectation.

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +239
n = min(len(opt_real), len(qm_real))
rmsd = _frequency_rmsd(opt_real[-n:], qm_real[-n:])
# Optimized FF should reproduce QM frequencies reasonably well
assert rmsd < 200.0, f"Optimized FF frequency RMSD too high: {rmsd:.1f} cm^-1"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acceptance criteria in the PR description/reported output targets ~<50 cm^-1 RMSD for the optimized CH3F frequencies, but this test allows up to 200 cm^-1. Consider tightening this threshold (or aligning the PR description) so the test actually enforces the intended physical accuracy.

Copilot uses AI. Check for mistakes.
ericchansen and others added 2 commits March 20, 2026 16:34
Replace Python list dumps with aligned, labeled tables for:
- CH3F QM vs NIST experimental frequencies (with mode names)
- Ground state frequency progression (mode-by-mode comparison)
- TS frequency progression with imaginary mode count
- Reaction profile barrier heights with literature comparison

Much easier for new users to interpret test output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Shows Seminario estimation, optimization, and frequency evaluation
times alongside the frequency comparison tables. Proves that
optimizations run live (not cached) and quantifies performance:

- Seminario estimation: ~1-2 ms
- L-BFGS-B optimization: ~4-8 sec (750-1300 evaluations)
- Single frequency eval: ~97 ms (OpenMM backend)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 21:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +78
lines = path.read_text().strip().splitlines()
return np.array([float(line) for line in lines if not line.startswith("#")])


def _load_qm_energy(path: Path) -> float:
"""Load a single QM energy in Hartrees from text file."""
lines = path.read_text().strip().splitlines()
for line in lines:
line = line.strip()
if line and not line.startswith("#"):
return float(line)
raise ValueError(f"No energy value found in {path}")


def _load_external_reference() -> dict:
"""Load NIST + literature reference data."""
return json.loads(EXT_REF.read_text())
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.read_text() is called without an explicit encoding in multiple helpers. On Windows this can default to a locale encoding (e.g., cp1252) and fail if the reference files/fixtures are UTF-8. Use read_text(encoding="utf-8") consistently for QM reference text files and the JSON fixture to make the test deterministic across platforms.

Copilot uses AI. Check for mistakes.
Comment on lines +264 to +285
nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"]))
scale = ext_ref["qm_level_of_theory"]["dft_scaling_factor"]

qm_real = sorted(_real_frequencies(qm_freqs))
# NIST has 6 unique fundamentals (e-symmetry modes are degenerate)
# Pick one representative from each degenerate pair
qm_unique_indices = [0, 1, 3, 4, 7, 8]
nist_ordered = [1049, 1182, 1464, 1467, 2930, 3006]
mode_names = [
"v3 (CF str)",
"v6 (CH3 rock)",
"v2 (CH3 s-def)",
"v5 (CH3 a-def)",
"v1 (CH3 s-str)",
"v4 (CH3 a-str)",
]

if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test defines nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"])) but never uses it, and instead hard-codes nist_ordered. That makes the fixture data effectively ignored and can drift out of sync. Remove the unused variable and derive the ordered NIST array from ext_ref (or document why hard-coding is required).

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +309
if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
print("\n")
print("=" * 70)
print(" CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
print(f" Scaling factor: {scale} (B3LYP/6-31+G(d))")
print("=" * 70)
print(
f" {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
print(
f" {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
print(f" {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
print("=" * 70)
print()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_qm_frequencies_vs_nist silently does nothing (and therefore passes) if len(qm_real) < 9. Since this is validating a pinned QM reference file, it’s better to fail fast (assert the expected number of modes is present) or explicitly pytest.skip with a message so unexpected truncation doesn’t get missed.

Suggested change
if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)
mae = _frequency_mae(qm_scaled, nist_arr)
# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
print("\n")
print("=" * 70)
print(" CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
print(f" Scaling factor: {scale} (B3LYP/6-31+G(d))")
print("=" * 70)
print(
f" {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
print(
f" {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
print(f" {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
print("=" * 70)
print()
assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"
# Fail fast if the QM reference data is unexpectedly truncated.
# We rely on specific mode indices in `qm_unique_indices`, so ensure
# that all of them are valid for the current QM data.
max_required_index = max(qm_unique_indices)
assert len(qm_real) > max_required_index, (
f"QM reference frequencies truncated: expected at least "
f"{max_required_index + 1} real modes, got {len(qm_real)}"
)
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)
mae = _frequency_mae(qm_scaled, nist_arr)
# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
print("\n")
print("=" * 70)
print(" CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
print(f" Scaling factor: {scale} (B3LYP/6-31+G(d))")
print("=" * 70)
print(
f" {'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
print(
f" {mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
print(" " + "-" * (col1 + col2 + col3 + col4 + col5))
print(f" {'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
print("=" * 70)
print()
assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +370
for _, freqs_label, _, _, t_freq in stages[1:]:
pass
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a no-op loop (for _, freqs_label, _, _, t_freq in stages[1:]: pass) left in the diagnostic logger. It adds noise and introduces unused variables; remove it or replace it with the intended logic.

Suggested change
for _, freqs_label, _, _, t_freq in stages[1:]:
pass

Copilot uses AI. Check for mistakes.
Comment on lines +446 to +453
def test_seminario_ts_frequencies(self, engine, ts_mol, seminario_ff):
"""Seminario TS FF should produce at least one imaginary frequency."""
freqs = engine.frequencies(ts_mol, seminario_ff)
imag = _imaginary_frequencies(freqs)
real = _real_frequencies(freqs)
# May not produce imaginary mode since OpenMM bonded terms are harmonic
# and negative FC gives a repulsive parabola. Log either way.
print(f"\n Seminario TS: {len(real)} real modes, {len(imag)} imaginary modes")
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_seminario_ts_frequencies’s docstring says the Seminario TS FF “should produce at least one imaginary frequency”, but the test never asserts that (and the inline comment says it may not happen). Please align the docstring with the actual behavior (diagnostic-only), or turn this into an explicit assert/xfail/skip based on the expected OpenMM behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +467 to +469
if n > 0:
rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])
assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_ts_optimized_real_freqs_match_qm will silently pass if n == 0 (no real modes above threshold). That would hide cases where frequency calculation breaks or returns only near-zero modes. Consider asserting n > 0 (or a more specific expected count) and failing with a helpful message when no comparable real modes are found.

Suggested change
if n > 0:
rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])
assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"
assert n > 0, (
"No comparable real vibrational modes found for TS frequency comparison; "
f"MM real modes: {len(mm_real)}, QM real modes: {len(qm_real)}"
)
rmsd = _frequency_rmsd(mm_real[-n:], qm_real[-n:])
assert rmsd < 500.0, f"TS frequency RMSD too high: {rmsd:.1f} cm^-1"

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +6
"nist": "NIST Chemistry WebBook, https://webbook.nist.gov/cgi/cbook.cgi?ID=C593533",
"czako_2015": "Szabó et al., J. Chem. Phys. 142, 244301 (2015)",
"shaik_1992": "Shaik & Pross, J. Chem. Soc. Perkin Trans. 2, 1019 (1992)"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _sources, the key is named czako_2015 but the citation string is “Szabó et al., J. Chem. Phys. 142, 244301 (2015)”. This mismatch makes it harder to trace provenance later; consider renaming the key or adjusting the citation so the identifier matches the referenced source.

Copilot uses AI. Check for mistakes.
qm_real = _real_frequencies(qm_freqs)
n = min(len(mm_freqs), len(qm_real))
rmsd = _frequency_rmsd(sorted(mm_freqs)[-n:], sorted(qm_real)[-n:])
# Generic FF should be noticeably off -- RMSD > 100 cm^-1
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the generic FF baseline should have RMSD > 100 cm^-1, but the assertion only checks rmsd > 50.0. Either update the threshold to match the stated intent, or adjust the comment so the test expectation is unambiguous.

Suggested change
# Generic FF should be noticeably off -- RMSD > 100 cm^-1
# Generic FF should be noticeably off; here we require RMSD > 50 cm^-1

Copilot uses AI. Check for mistakes.
Comment on lines +380 to +384
"""Validate the pipeline on the SN2 F⁻ + CH₃F transition state.

Transition states require special handling (Seminario Method D/E)
and must preserve the imaginary mode through optimization.
"""
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TS class/docstring (and PR description) says the imaginary mode must be preserved through optimization, but there isn’t currently an assertion that the optimized TS has an imaginary mode (or exactly one). Consider asserting on _imaginary_frequencies(engine.frequencies(ts_mol, optimized_ff)) (count and/or magnitude vs QM), or update the docstring/test plan to match what’s actually verified.

Copilot uses AI. Check for mistakes.
Comment on lines +548 to +554
"""Validate energetics across the reaction coordinate.

Compares MM barrier heights to QM values at:
- Separated reactants (F⁻ + CH₃F)
- Ion-dipole complex
- Transition state
"""
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reaction-profile class/docstring says it compares barriers involving the ion–dipole complex, but the tests never load/use COMPLEX_XYZ or compute TS–complex barriers (only TS–separated-reactants). Either add the complex energy/barrier assertions, or update the docstring/PR description to reflect what’s actually tested.

Copilot uses AI. Check for mistakes.
ericchansen and others added 4 commits March 20, 2026 16:45
The previous table showed MM CH3F and TS energies as both 0.0000,
which was meaningless: Seminario FFs set equilibrium from QM geometry
so energy at that geometry is zero by construction. Worse, comparing
energies across molecule-specific FFs with different connectivity is
apples-to-oranges (arbitrary energy zeros).

Now shows the full QM reaction profile instead:
- Complexation energy (F- + CH3F -> complex): -13.71 kcal/mol
- TS - reactants barrier: -4.24 vs -0.45 (CCSD(T)-F12)
- TS - complex barrier: 9.47 vs 11.3 (VB benchmark)

Explains clearly why MM barriers aren't computable with separate FFs
and notes that reactive potentials (EVB, ReaxFF) would be needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Displace CH3F along each QM normal mode eigenvector at 0.05, 0.10,
and 0.15 Ang and compare MM energy to QM harmonic prediction
(0.5 * eigenvalue * q^2).  This validates that the optimized FF
reproduces the PES shape, not just curvature at the minimum.

Pre-compute and save normal mode eigenvectors/eigenvalues as .npz
files so the eigendecomposition (< 0.2 ms) doesn't run every test.
Tables show both Seminario and Q2MM-optimized FF with per-mode
percentage errors and wall clock timing (~3 s for 27 energy evals).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add _TablePrinter helper that collects all lines, measures the
widest one, then sizes = and - bars to match.  Refactor all 5
diagnostic tables to use it, eliminating hardcoded widths that
didn't align with content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Format error percentages as complete strings with fixed-width
right-justification so columns stay aligned regardless of value
magnitude. Center d=X.XX labels within their column group using
computed group width instead of hardcoded spaces.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 22:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 4 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


from __future__ import annotations

import copy
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import copy appears unused in this module. Even though unused-import linting is relaxed in this repo, removing unused imports helps keep the test file easier to scan.

Suggested change
import copy

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +57
TS_ENERGY = QM_REF / "sn2-ts-energy.txt"

COMPLEX_XYZ = QM_REF / "complex-optimized.xyz"
EXT_REF = FIXTURES / "sn2_external_reference.json"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPLEX_XYZ is defined but never used in this module. Consider removing it to reduce dead code, or add a check that actually uses the complex structure (e.g., an MM energy/frequency sanity check at the complex stationary point).

Copilot uses AI. Check for mistakes.
Comment on lines +416 to +444
if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)

mae = _frequency_mae(qm_scaled, nist_arr)

# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
t = _TablePrinter()
t.blank()
t.bar()
t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")
t.bar()
t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")
t.sep()
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
t.row(
f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
t.sep()
t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
t.bar()
t.blank()
t.flush()

assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only asserts inside if len(qm_real) >= 9:; if the QM frequency file changes or is mis-parsed, the test will silently pass without validating anything. Prefer an explicit assertion (or a pytest.skip with a clear reason) when len(qm_real) is not the expected value.

Suggested change
if len(qm_real) >= 9:
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)
mae = _frequency_mae(qm_scaled, nist_arr)
# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
t = _TablePrinter()
t.blank()
t.bar()
t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")
t.bar()
t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")
t.sep()
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
t.row(
f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
t.sep()
t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
t.bar()
t.blank()
t.flush()
assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"
assert len(qm_real) >= 9, f"Expected at least 9 real QM frequencies, got {len(qm_real)}"
qm_unique = np.array([qm_real[i] for i in qm_unique_indices])
qm_scaled = qm_unique * scale
nist_arr = np.array(nist_ordered)
mae = _frequency_mae(qm_scaled, nist_arr)
# Print comparison table
col1, col2, col3, col4, col5 = 17, 12, 12, 12, 10
t = _TablePrinter()
t.blank()
t.bar()
t.title("CH3F QM vs NIST EXPERIMENTAL (cm^-1)")
t.title(f"Scaling factor: {scale} (B3LYP/6-31+G(d))")
t.bar()
t.row(f"{'Mode':<{col1}} {'QM harm':>{col2}} {'QM scaled':>{col3}} {'NIST expt':>{col4}} {'Error':>{col5}}")
t.sep()
for i in range(len(nist_ordered)):
err = qm_scaled[i] - nist_arr[i]
t.row(
f"{mode_names[i]:<{col1}} {qm_unique[i]:>{col2}.1f} {qm_scaled[i]:>{col3}.1f} {nist_arr[i]:>{col4}.0f} {err:>{col5}.1f}"
)
t.sep()
t.row(f"{'MAE':<{col1}} {'':>{col2}} {'':>{col3}} {'':>{col4}} {mae:>{col5}.1f}")
t.bar()
t.blank()
t.flush()
assert mae < 80.0, f"Scaled QM vs NIST MAE too high: {mae:.1f} cm^-1"

Copilot uses AI. Check for mistakes.
Comment on lines +446 to +449
def test_improvement_progression_logged(
self, engine, ch3f_mol, default_ff, seminario_result, optimized_result, qm_freqs
):
"""Log the full progression for inspection (always passes, diagnostic)."""
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tests appear to be purely diagnostic (they print tables but have no assertions), e.g. test_improvement_progression_logged. Tests that always pass reduce signal in the suite; consider moving this into a helper that only prints on failure, guarding behind an opt-in flag (env var/CLI option), or converting it to assertions that enforce the intended progression.

Copilot uses AI. Check for mistakes.
Comment on lines +525 to +526
def test_pes_distortion_small_displacement(self, distortion_results):
"""At small displacements (0.05 Ang), most modes should roughly track QM.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says this adds 17 slow tests, but this module currently defines 19 test_... functions. Please update the PR description/test count (or consolidate/mark some diagnostic ones as non-tests) so expectations match what will run under --run-slow.

Copilot uses AI. Check for mistakes.
Comment on lines +829 to +833
ha_to_kcal = 627.5094740631
reactant_e = qm_energies["f_minus"] + qm_energies["ch3f"]
ts_e = qm_energies["ts"]
barrier = (ts_e - reactant_e) * ha_to_kcal

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same unit-conversion constant is hard-coded here (ha_to_kcal = 627.5094740631). Prefer q2mm.constants.HARTREE_TO_KCALMOL for consistency across the codebase.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +230
disp_mol = Q2MMMolecule(
symbols=mol.symbols,
geometry=mol.geometry + delta_xyz,
atom_types=mol.atom_types,
charge=mol.charge,
multiplicity=mol.multiplicity,
bond_tolerance=mol.bond_tolerance,
)
e_mm = engine.energy(disp_mol, ff) - e_eq

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compute_distortions creates a new Q2MMMolecule for every displacement and then calls engine.energy(disp_mol, ff). In the OpenMM backend, energy() will create a fresh OpenMM Context unless you pass an OpenMMHandle, so this loop can end up constructing dozens of contexts per test run. Consider reusing a single handle/context and only updating positions for each displacement to keep the slow test runtime under control.

Copilot uses AI. Check for mistakes.
return {
"eigenvalues": data["eigenvalues"],
"eigenvectors": data["eigenvectors"],
"masses_amu": data["masses_amu"],
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _load_normal_modes docstring says the returned dict includes symbols, but the function currently only returns eigenvalues/eigenvectors/masses_amu. Either return data['symbols'] as well (the .npz includes it) or update the docstring to match reality.

Suggested change
"masses_amu": data["masses_amu"],
"masses_amu": data["masses_amu"],
"symbols": data["symbols"],

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +187
ha_to_kcal = 627.5094740631
bohr_to_m = BOHR_TO_ANG * 1e-10
sqrt_m = np.sqrt(np.repeat(masses_amu, 3))
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha_to_kcal = 627.5094740631 is duplicated here and elsewhere in this file. The codebase already defines q2mm.constants.HARTREE_TO_KCALMOL; using the shared constant avoids drift and keeps unit conversions consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +399 to +407
nist = np.array(sorted(ext_ref["ch3f_nist_all_cm1"]))
scale = ext_ref["qm_level_of_theory"]["dft_scaling_factor"]

qm_real = sorted(_real_frequencies(qm_freqs))
# NIST has 6 unique fundamentals (e-symmetry modes are degenerate)
# Pick one representative from each degenerate pair
qm_unique_indices = [0, 1, 3, 4, 7, 8]
nist_ordered = [1049, 1182, 1464, 1467, 2930, 3006]
mode_names = [
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nist = np.array(sorted(ext_ref['ch3f_nist_all_cm1'])) is computed but never used, and nist_ordered hard-codes the same values already present in the JSON fixture. To keep the test self-consistent, consider deriving the ordered NIST array directly from ext_ref['ch3f_nist_fundamentals_cm1'] / ch3f_nist_all_cm1 and dropping the unused local.

Copilot uses AI. Check for mistakes.
@ericchansen
Copy link
Copy Markdown
Owner Author

Consolidated into PR #103, which now targets master directly and includes all commits from this branch.

@ericchansen ericchansen deleted the test/sn2-e2e-validation-101 branch April 15, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: SN2 end-to-end validation with experimental and literature anchoring

2 participants