Skip to content

fix: Java test instrumentation and context improvements for E2E optimization#1530

Merged
mashraf-222 merged 4 commits intoomni-javafrom
fix/java-testgen-context-enhancements
Feb 18, 2026
Merged

fix: Java test instrumentation and context improvements for E2E optimization#1530
mashraf-222 merged 4 commits intoomni-javafrom
fix/java-testgen-context-enhancements

Conversation

@mashraf-222
Copy link
Contributor

@mashraf-222 mashraf-222 commented Feb 18, 2026

Problems fixed

Three categories of bugs that caused Java E2E optimization to fail, plus two systemic issues discovered during aerospike-client-java --all run (55 functions, 80% compilation failure rate):

1. Timing instrumentation produces invalid Java when tests contain multi-byte UTF-8 characters

During E2E testing of Buffer.stringToUtf8 with hardcoded tests containing Unicode strings ("éñ", "世界"), the instrumented test files had corrupted statements like t len = Buffer.stringToUtf8(...) instead of int len = .... The int keyword was split across lines, producing not a statement and ';' expected compilation errors in every test method with non-ASCII string literals.

Root cause: _add_timing_instrumentation() in instrumentation.py uses tree-sitter, which returns byte offsets. These byte offsets were used directly to slice body_text, which is a Python str (Unicode). When multi-byte UTF-8 characters appear before the target statement, the byte offset is larger than the character offset, causing the slice to start mid-character. For example, "éñ" is 4 bytes in UTF-8 but 2 chars in Python — shifting all subsequent byte offsets by +2.

Evidence: Reproduced with a minimal test script:

body = '        String input = \"éñ\";\n        int len = Buffer.stringToUtf8(input, buf, 0);'
body_bytes = body.encode('utf8')
# byte offset of \"int len\": 74
# char offset of \"int len\": 72
# body[74:] → \"t len = ...\"  (wrong — sliced into middle of \"int\")
# body[72:] → \"int len = ...\" (correct)

Fix: Convert tree-sitter byte offsets to character offsets before slicing: stmt_start = len(body_bytes[:stmt_byte_start].decode(\"utf8\")).

2. Variable scoping error when target call is inside a variable declaration

After fixing the byte-offset bug, instrumented tests failed with variable len might not have been initialized. The timing instrumentation wraps the target statement (int len = func()) inside a for { try { ... } } block, which moves the len declaration into the try block scope. Subsequent code referencing len (e.g., for (int i = 0; i < len; i++)) can't find it.

Fix: Added split_var_declaration() that detects local_variable_declaration AST nodes, hoists the declaration (int len = 0;) before the timing block, and converts the wrapped statement to just an assignment (len = func();). Uses default values (0, 0L, null, etc.) to satisfy Java's definite assignment rules.

3. AI-generated tests had insufficient type context, causing undeclared variable and missing import errors

During the aerospike-client-java --all run, 19 of 55 functions (35%) failed because the AI generated tests referencing undeclared variables (policy, copy, result, configProvider) and missing class imports (ClientPolicy, Builder). The type skeleton system provided insufficient context: token budget was too low (2000 tokens), wildcard imports were silently skipped, type skeletons lacked constructor summary headers, and types referenced in the target method weren't prioritized.

Root cause analysis from the aerospike run:

  • variable policy — 44 errors: AI referenced policy objects without creating them
  • class ClientPolicy — 80 errors: AI used ClientPolicy without importing com.aerospike.client.policy.ClientPolicy
  • class Builder — 24 errors: AI used DynamicWriteConfig.Builder without proper import
  • variable copy, copy1, copy2 — 38 errors: copy operations never assigned
  • Total: ~148 undeclared variable errors + ~104 missing class import errors across 19 functions

Fixes:

  • Doubled IMPORTED_SKELETON_TOKEN_BUDGET from 2000 to 4000 tokens, giving the AI more complete type information
  • Added _extract_type_names_from_code() to parse the target method's AST and collect all referenced type names
  • Prioritized skeletons for types the target method actually uses (sorted by priority: referenced types first, then others)
  • Added expand_wildcard_import() to import_resolver.py — wildcard imports like com.aerospike.client.policy.* are now expanded to individual class files, so all types in a package are available for skeleton extraction
  • Added _extract_constructor_summaries() that generates one-line // Constructors: ClassName(Type1 param1, Type2 param2) headers at the top of each skeleton, making constructor signatures unambiguous for the AI

4. Existing test instrumentation silently overwrites generated test files (path collision)

During the aerospike run, generated tests and existing tests both used the __perfinstrumented suffix, causing file path collisions. When a function had both generated and existing tests, the existing test instrumentation at line ~1940 in function_optimizer.py overwrote the generated test file, silently destroying generated test content.

Evidence from Fibonacci validation:

Line 304: Wrote behavioral test to .../FibonacciTest__perfinstrumented.java      (generated test 1)
Line 556: Wrote behavioral test to .../FibonacciTest__perfinstrumented_2.java    (generated test 2)
Line 807: Wrote instrumented test to .../FibonacciTest__perfinstrumented.java    (existing test OVERWRITES!)

Fix: Existing tests now use distinct suffixes: __existing_perfinstrumented / __existing_perfonlyinstrumented. Added class name replacement in the generated Java source to keep the file name and class name in sync (Java requirement). Updated the leftover file cleanup regex in optimizer.py to match the new __existing_ prefix variant.

5. Leftover instrumented test files from previous runs cause cascading compilation failures

In multi-function --all runs, a broken instrumented test file from function N persists in the test directory and causes Maven compilation failures for function N+1 (since Maven compiles all test files together). This cascading effect can turn a single bad test into 100% failure for all subsequent functions.

Fix: Added a safety-net cleanup step at the start of each function's optimization cycle in optimizer.py. Before each function is optimized, find_leftover_instrumented_test_files() is called to detect and remove any stale *__perfinstrumented* and *__existing_perfinstrumented* files from the test root.

Code changes

File Change
instrumentation.py Added byte-to-char offset conversion in build_instrumented_body(). Added split_var_declaration() helper for variable hoisting with default value initialization. Applied to both single-range and multi-range branches.
context.py Doubled skeleton token budget to 4000. Added _extract_type_names_from_code() for type prioritization via tree-sitter AST. Added _extract_constructor_summaries() for unambiguous constructor headers. Added priority sorting so target-method types get skeletons first.
import_resolver.py Added expand_wildcard_import() — resolves com.example.* to individual .java files in the package directory, enabling skeleton extraction for all types in wildcard-imported packages.
function_optimizer.py Changed existing test instrumentation to use __existing_perfinstrumented / __existing_perfonlyinstrumented suffixes. Added class name fixup for Java file/class name consistency.
optimizer.py Added per-function leftover instrumented file cleanup before each optimization cycle. Updated cleanup regex for new __existing_ prefix.
test_context.py Updated tests for constructor summary headers, wildcard import expansion (previously tested as "skipped", now tested as "expanded").

Testing

  • All 41 existing instrumentation unit tests pass after changes
  • Verified the byte-offset fix produces correct output with multi-byte chars (é, ñ, 世, 界)
  • Verified variable hoisting produces valid Java: int len = 0; before timing block, len = func(); inside try
  • Updated context tests verify constructor summaries appear in skeleton output and wildcard imports are expanded

Known remaining issues

  • The perfonlyinstrumented variant strips assertions via transform_java_assertions, but leaves behind empty for loops (for (int i = 0; i < len; i++) {}) that reference the hoisted variable. These compile correctly now but are dead code. A cleanup pass could remove them.
  • E2E validation of the full optimization pipeline (hardcoded tests → instrumentation → Maven test run → optimization) is pending and requires the companion codeflash-internal PR for the hardcoded test and prompt updates.

mashraf-222 and others added 4 commits February 18, 2026 16:55
- Increase imported type skeleton token budget from 2000 to 4000
- Add constructor signature summary headers to skeleton output
- Expand wildcard imports (e.g., import com.foo.*) into individual types
  instead of silently skipping them
- Prioritize skeleton processing for types referenced in the target method
  so parameter types are guaranteed context before less-critical types
- Fix invalid [no-arg] annotation in constructor summaries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tests

Use distinct __existing_perfinstrumented prefix for existing test
instrumentation paths to avoid colliding with generated test file paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When Maven compiles all test files together, a broken instrumented test
file from one function's optimization can cause cascading compilation
failures for ALL subsequent functions. This adds pre-iteration cleanup
using find_leftover_instrumented_test_files() as a safety net.

Also updates the Java pattern to match __existing_perfinstrumented
variant files that were missed by the previous pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… Java timing instrumentation

Two bugs in _add_timing_instrumentation that caused instrumented tests to
fail compilation when test code contained multi-byte UTF-8 characters or
variable declarations in the target call statement.

1. Tree-sitter returns byte offsets but body_text is a Python str (Unicode).
   Slicing the str with byte offsets corrupts statements when multi-byte
   chars (é, 世, etc.) precede the target call.

2. Wrapping a local_variable_declaration (e.g., int len = func()) inside
   a for/try block moves the variable out of scope for subsequent code.
   Now hoists the declaration before the timing block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Feb 18, 2026

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for _format_skeleton_for_context in codeflash/languages/java/context.py

⏱️ Runtime : 1.57 milliseconds 1.42 milliseconds (best of 199 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch fix/java-testgen-context-enhancements).

Static Badge

@mashraf-222 mashraf-222 merged commit 39c000c into omni-java Feb 18, 2026
24 of 32 checks passed
@mashraf-222 mashraf-222 deleted the fix/java-testgen-context-enhancements branch February 18, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants