fix: Java test instrumentation and context improvements for E2E optimization by mashraf-222 · Pull Request #1530 · codeflash-ai/codeflash

mashraf-222 · 2026-02-18T20:07:58Z

Problems fixed

Three categories of bugs that caused Java E2E optimization to fail, plus two systemic issues discovered during aerospike-client-java --all run (55 functions, 80% compilation failure rate):

1. Timing instrumentation produces invalid Java when tests contain multi-byte UTF-8 characters

During E2E testing of Buffer.stringToUtf8 with hardcoded tests containing Unicode strings ("éñ", "世界"), the instrumented test files had corrupted statements like t len = Buffer.stringToUtf8(...) instead of int len = .... The int keyword was split across lines, producing not a statement and ';' expected compilation errors in every test method with non-ASCII string literals.

Root cause: _add_timing_instrumentation() in instrumentation.py uses tree-sitter, which returns byte offsets. These byte offsets were used directly to slice body_text, which is a Python str (Unicode). When multi-byte UTF-8 characters appear before the target statement, the byte offset is larger than the character offset, causing the slice to start mid-character. For example, "éñ" is 4 bytes in UTF-8 but 2 chars in Python — shifting all subsequent byte offsets by +2.

Evidence: Reproduced with a minimal test script:

body = '        String input = \"éñ\";\n        int len = Buffer.stringToUtf8(input, buf, 0);'
body_bytes = body.encode('utf8')
# byte offset of \"int len\": 74
# char offset of \"int len\": 72
# body[74:] → \"t len = ...\"  (wrong — sliced into middle of \"int\")
# body[72:] → \"int len = ...\" (correct)

Fix: Convert tree-sitter byte offsets to character offsets before slicing: stmt_start = len(body_bytes[:stmt_byte_start].decode(\"utf8\")).

2. Variable scoping error when target call is inside a variable declaration

After fixing the byte-offset bug, instrumented tests failed with variable len might not have been initialized. The timing instrumentation wraps the target statement (int len = func()) inside a for { try { ... } } block, which moves the len declaration into the try block scope. Subsequent code referencing len (e.g., for (int i = 0; i < len; i++)) can't find it.

Fix: Added split_var_declaration() that detects local_variable_declaration AST nodes, hoists the declaration (int len = 0;) before the timing block, and converts the wrapped statement to just an assignment (len = func();). Uses default values (0, 0L, null, etc.) to satisfy Java's definite assignment rules.

3. AI-generated tests had insufficient type context, causing undeclared variable and missing import errors

During the aerospike-client-java --all run, 19 of 55 functions (35%) failed because the AI generated tests referencing undeclared variables (policy, copy, result, configProvider) and missing class imports (ClientPolicy, Builder). The type skeleton system provided insufficient context: token budget was too low (2000 tokens), wildcard imports were silently skipped, type skeletons lacked constructor summary headers, and types referenced in the target method weren't prioritized.

Root cause analysis from the aerospike run:

variable policy — 44 errors: AI referenced policy objects without creating them
class ClientPolicy — 80 errors: AI used ClientPolicy without importing com.aerospike.client.policy.ClientPolicy
class Builder — 24 errors: AI used DynamicWriteConfig.Builder without proper import
variable copy, copy1, copy2 — 38 errors: copy operations never assigned
Total: ~148 undeclared variable errors + ~104 missing class import errors across 19 functions

Fixes:

Doubled IMPORTED_SKELETON_TOKEN_BUDGET from 2000 to 4000 tokens, giving the AI more complete type information
Added _extract_type_names_from_code() to parse the target method's AST and collect all referenced type names
Prioritized skeletons for types the target method actually uses (sorted by priority: referenced types first, then others)
Added expand_wildcard_import() to import_resolver.py — wildcard imports like com.aerospike.client.policy.* are now expanded to individual class files, so all types in a package are available for skeleton extraction
Added _extract_constructor_summaries() that generates one-line // Constructors: ClassName(Type1 param1, Type2 param2) headers at the top of each skeleton, making constructor signatures unambiguous for the AI

4. Existing test instrumentation silently overwrites generated test files (path collision)

During the aerospike run, generated tests and existing tests both used the __perfinstrumented suffix, causing file path collisions. When a function had both generated and existing tests, the existing test instrumentation at line ~1940 in function_optimizer.py overwrote the generated test file, silently destroying generated test content.

Evidence from Fibonacci validation:

Line 304: Wrote behavioral test to .../FibonacciTest__perfinstrumented.java      (generated test 1)
Line 556: Wrote behavioral test to .../FibonacciTest__perfinstrumented_2.java    (generated test 2)
Line 807: Wrote instrumented test to .../FibonacciTest__perfinstrumented.java    (existing test OVERWRITES!)

Fix: Existing tests now use distinct suffixes: __existing_perfinstrumented / __existing_perfonlyinstrumented. Added class name replacement in the generated Java source to keep the file name and class name in sync (Java requirement). Updated the leftover file cleanup regex in optimizer.py to match the new __existing_ prefix variant.

5. Leftover instrumented test files from previous runs cause cascading compilation failures

In multi-function --all runs, a broken instrumented test file from function N persists in the test directory and causes Maven compilation failures for function N+1 (since Maven compiles all test files together). This cascading effect can turn a single bad test into 100% failure for all subsequent functions.

Fix: Added a safety-net cleanup step at the start of each function's optimization cycle in optimizer.py. Before each function is optimized, find_leftover_instrumented_test_files() is called to detect and remove any stale *__perfinstrumented* and *__existing_perfinstrumented* files from the test root.

Code changes

File	Change
`instrumentation.py`	Added byte-to-char offset conversion in `build_instrumented_body()`. Added `split_var_declaration()` helper for variable hoisting with default value initialization. Applied to both single-range and multi-range branches.
`context.py`	Doubled skeleton token budget to 4000. Added `_extract_type_names_from_code()` for type prioritization via tree-sitter AST. Added `_extract_constructor_summaries()` for unambiguous constructor headers. Added priority sorting so target-method types get skeletons first.
`import_resolver.py`	Added `expand_wildcard_import()` — resolves `com.example.*` to individual `.java` files in the package directory, enabling skeleton extraction for all types in wildcard-imported packages.
`function_optimizer.py`	Changed existing test instrumentation to use `__existing_perfinstrumented` / `__existing_perfonlyinstrumented` suffixes. Added class name fixup for Java file/class name consistency.
`optimizer.py`	Added per-function leftover instrumented file cleanup before each optimization cycle. Updated cleanup regex for new `__existing_` prefix.
`test_context.py`	Updated tests for constructor summary headers, wildcard import expansion (previously tested as "skipped", now tested as "expanded").

Testing

All 41 existing instrumentation unit tests pass after changes
Verified the byte-offset fix produces correct output with multi-byte chars (é, ñ, 世, 界)
Verified variable hoisting produces valid Java: int len = 0; before timing block, len = func(); inside try
Updated context tests verify constructor summaries appear in skeleton output and wildcard imports are expanded

Known remaining issues

The perfonlyinstrumented variant strips assertions via transform_java_assertions, but leaves behind empty for loops (for (int i = 0; i < len; i++) {}) that reference the hoisted variable. These compile correctly now but are dead code. A cleanup pass could remove them.
E2E validation of the full optimization pipeline (hardcoded tests → instrumentation → Maven test run → optimization) is pending and requires the companion codeflash-internal PR for the hardcoded test and prompt updates.

- Increase imported type skeleton token budget from 2000 to 4000 - Add constructor signature summary headers to skeleton output - Expand wildcard imports (e.g., import com.foo.*) into individual types instead of silently skipping them - Prioritize skeleton processing for types referenced in the target method so parameter types are guaranteed context before less-critical types - Fix invalid [no-arg] annotation in constructor summaries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… tests Use distinct __existing_perfinstrumented prefix for existing test instrumentation paths to avoid colliding with generated test file paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When Maven compiles all test files together, a broken instrumented test file from one function's optimization can cause cascading compilation failures for ALL subsequent functions. This adds pre-iteration cleanup using find_leftover_instrumented_test_files() as a safety net. Also updates the Java pattern to match __existing_perfinstrumented variant files that were missed by the previous pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… Java timing instrumentation Two bugs in _add_timing_instrumentation that caused instrumented tests to fail compilation when test code contained multi-byte UTF-8 characters or variable declarations in the target call statement. 1. Tree-sitter returns byte offsets but body_text is a Python str (Unicode). Slicing the str with byte offsets corrupts statements when multi-byte chars (é, 世, etc.) precede the target call. 2. Wrapping a local_variable_declaration (e.g., int len = func()) inside a for/try block moves the variable out of scope for subsequent code. Now hoists the declaration before the timing block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeflash-ai · 2026-02-18T20:24:01Z

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for `_format_skeleton_for_context` in `codeflash/languages/java/context.py`

⏱️ Runtime : 1.57 milliseconds → 1.42 milliseconds (best of 199 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _format_skeleton_for_context by 10% in PR #1530 (fix/java-testgen-context-enhancements) #1532

If you approve, it will be merged into this PR (branch fix/java-testgen-context-enhancements).

mashraf-222 and others added 4 commits February 18, 2026 16:55

fix: prevent existing test instrumentation from overwriting generated…

543617a

… tests Use distinct __existing_perfinstrumented prefix for existing test instrumentation paths to avoid colliding with generated test file paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeflash-ai bot mentioned this pull request Feb 18, 2026

⚡️ Speed up function _format_skeleton_for_context by 10% in PR #1530 (fix/java-testgen-context-enhancements) #1532

Closed

misrasaurabh1 approved these changes Feb 18, 2026

View reviewed changes

mashraf-222 merged commit 39c000c into omni-java Feb 18, 2026
24 of 32 checks passed

mashraf-222 deleted the fix/java-testgen-context-enhancements branch February 18, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Java test instrumentation and context improvements for E2E optimization#1530

fix: Java test instrumentation and context improvements for E2E optimization#1530
mashraf-222 merged 4 commits intoomni-javafrom
fix/java-testgen-context-enhancements

mashraf-222 commented Feb 18, 2026 •

edited

Loading

Uh oh!

codeflash-ai bot commented Feb 18, 2026

⚡️ Speed up function `_format_skeleton_for_context` by 10% in PR #1530 (`fix/java-testgen-context-enhancements`) #1532

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mashraf-222 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems fixed

Code changes

Testing

Known remaining issues

Uh oh!

codeflash-ai bot commented Feb 18, 2026

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for _format_skeleton_for_context in codeflash/languages/java/context.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _format_skeleton_for_context by 10% in PR #1530 (fix/java-testgen-context-enhancements) #1532

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mashraf-222 commented Feb 18, 2026 •

edited

Loading

📄 10% (0.10x) speedup for `_format_skeleton_for_context` in `codeflash/languages/java/context.py`

⚡️ Speed up function `_format_skeleton_for_context` by 10% in PR #1530 (`fix/java-testgen-context-enhancements`) #1532