Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions dev/modules/perl_tidy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Perl::Tidy Support Plan

## Goal

Make `./jcpan -t Perl::Tidy` run without errors on PerlOnJava.

## Current Status

**Version:** Perl-Tidy-20260204 (SHANCOCK/Perl-Tidy-20260204.tar.gz)
**Install:** Succeeds — 16 files installed to `~/.perlonjava/lib/`
**Tests:** 7/44 files pass, 37/44 fail

### Test Results Summary (after \G fixes)

| Category | Files | Result |
|----------|-------|--------|
| Passing | t/atee.t, t/filter_example.t, t/test.t, t/test_DEBUG.t, t/testsa.t, t/testss.t, t/zero.t | 7 OK |
| Snippet tests (DESTROY) | t/snippets1.t–t/snippets33.t (33 files) | 33 FAIL |
| Wide char tests (DESTROY) | t/testwide.t (2/3 pass), t/testwide-passthrough.t (2/6), t/testwide-tidy.t (2/6) | 3 FAIL |
| EOL tests (DESTROY) | t/test-eol.t (1/4 pass) | 1 FAIL |

### Progress Tracking

| Date | Milestone | Tests Passing |
|------|-----------|---------------|
| 2025-04-08 | Initial investigation | 5/44 |
| 2025-04-09 | \G regex fixes (pos undef + non-/g) | 7/44 |

## Fixes Applied

### Fix 1: \G Regex Anchor — pos() undef case (2025-04-09)

**File:** `src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java` (line 651)

**Problem:** When `pos()` was undef, the `\G` anchor check was skipped entirely
(`if (regex.useGAssertion && isPosDefined && matcher.start() != startPos)`).
This allowed `\G(\s+)` to scan forward and match whitespace anywhere in the
string, even though `\G` should anchor at position 0 when pos is undef.

**Impact:** Perl::Tidy's `parse_args` function uses `\G/gc` patterns to
tokenize option strings. With broken `\G`, options like `-dac` were silently
dropped, causing t/atee.t to fail.

**Fix:** Removed `isPosDefined` from the condition. When pos is undef,
`startPos` defaults to 0, so `\G` correctly anchors at 0.

### Fix 2: \G in Non-/g Matches (2025-04-09)

**File:** `src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java` (line 607)

**Problem:** `pos()` was only looked up for `/g` matches. In Perl, `\G`
should anchor at `pos()` even in non-`/g` matches (e.g. `$str =~ /\Gfoo/`).
PerlOnJava was ignoring pos entirely for non-/g matches containing `\G`.

**Impact:** Perl::Tidy's tokenizer uses `\G` in non-/g matches for signature
detection (line 10060: `$input_line =~ /\G\s*\(/`). Without this fix,
subroutine signatures like `sub foo($bar, %opts)` were misidentified as
prototypes, causing t/filter_example.t to fail.

**Fix:** Changed the pos() lookup condition from `isGlobalMatch()` to
`isGlobalMatch() || useGAssertion`, so pos is looked up whenever `\G` is
present in the pattern.

## Remaining Blocker: Missing DESTROY (33+ test files)

**Symptom:**
```
Attempt to create more than 1 object in Perl::Tidy::Formatter, which is not a true class yet
at .../Perl/Tidy/Formatter.pm line 1108.
```

This error kills the 2nd (and all subsequent) calls to `perltidy()` within a
single process. Since each snippet test file calls `perltidy()` 8–20 times in
a loop, only the first test passes per file.

**Root cause:** `Perl::Tidy::Formatter` and `Perl::Tidy::Tokenizer` use
closure-scoped instance counters that are incremented in `new()` and
decremented in `DESTROY()`. PerlOnJava does not call `DESTROY`, so the counter
never resets to 0.

**Formatter.pm singleton pattern:**
```perl
{ ## begin closure to count instances
my $_count = 0;
sub _increment_count { return ++$_count }
sub _decrement_count { return --$_count }
}

sub DESTROY {
my $self = shift;
_decrement_count();
return;
}

sub new {
...
if ( _increment_count() > 1 ) {
confess "Attempt to create more than 1 object...";
}
...
}
```

`Perl::Tidy::Tokenizer` has an identical pattern (lines 271–284, guard at
line 676).

**Other DESTROY methods in Perl::Tidy:** 10 other classes have empty DESTROY
methods (only to prevent AUTOLOAD dispatch) — these are safe with missing
DESTROY. Only `Formatter` and `Tokenizer` have functional DESTROY code.

**Impact:** ~555 subtests across 33+ test files never run.

**Fix (Bundled overlay — Perl/Tidy.pm):**

Patch `Perl::Tidy.pm`'s `perltidy()` function to explicitly call
`_decrement_count()` on Formatter and Tokenizer before returning. This
compensates for the missing DESTROY call with a 2-line surgical change.

In `Perl/Tidy.pm`, add before the final return in `perltidy()` (~line 1395):
```perl
# PerlOnJava: DESTROY not called on JVM — manually reset singleton counters
Perl::Tidy::Formatter::_decrement_count();
Perl::Tidy::Tokenizer::_decrement_count();
```

**Effort:** Low — 2 lines added to one file.

## Implementation Plan

### Phase 1: Fix DESTROY Singleton (unblocks ~555 subtests)

1. **Create bundled overlay** of `Perl/Tidy.pm`
- Copy upstream `Perl/Tidy.pm` (v20260204) to `src/main/perl/lib/Perl/Tidy.pm`
- Add `_decrement_count()` calls before `perltidy()`'s return points
- Mark changes with `# PerlOnJava:` comments
- Store diff in `dev/patches/cpan/Perl-Tidy-20260204/`

2. **Verify:** Re-run `./jcpan -t Perl::Tidy` — expect snippet tests to
progress past first test case in each file.

3. **Run `make`** — ensure no regressions in PerlOnJava's own tests.

### Phase 2: Wide Character Alignment (nice to have)

1. **Investigate** string width computation for Unicode characters
2. May require changes to PerlOnJava's `length()` or Perl::Tidy's alignment code
3. **Verify:** t/testwide.t, t/testwide-passthrough.t, t/testwide-tidy.t

## Expected Results After Phase 1

With the DESTROY fix alone, the test results should improve dramatically:

| Before | After (estimated) |
|--------|-------------------|
| 7/44 files pass | ~38/44 files pass |
| 4/53 subtests fail | TBD (most snippet tests should fully pass) |
| Result: FAIL | Closer to PASS |

## Dependency on Other Work

- **DESTROY implementation** (`dev/design/destroy_weaken_plan.md`): Would fix
this and all other DESTROY-dependent CPAN modules generically. However, it's
a large project. The targeted Perl::Tidy.pm overlay is the pragmatic
short-term fix.
- **Perl::Critic** (`dev/modules/perl_critic.md`): Already installed
(99.9% pass rate). Its `RequireTidyCode` policy fails because Perl::Tidy's
`Formatter::initialize_self_vars` exceeds the JVM 255-argument method
signature limit. That issue is separate from the test suite failures
documented here.

## Related Documents

- [cpan_patch_plan.md](cpan_patch_plan.md) — CPAN patching strategy (Option A: Bundled Overlays)
- [perl_critic.md](perl_critic.md) — Perl::Critic support (uses Perl::Tidy optionally)
- `dev/design/destroy_weaken_plan.md` — DESTROY/weaken implementation design
4 changes: 2 additions & 2 deletions src/main/java/org/perlonjava/core/Configuration.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public final class Configuration {
* Automatically populated by Gradle/Maven during build.
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String gitCommitId = "6858b39e6";
public static final String gitCommitId = "35c9ee0d5";

/**
* Git commit date of the build (ISO format: YYYY-MM-DD).
Expand All @@ -48,7 +48,7 @@ public final class Configuration {
* Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at"
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String buildTimestamp = "Apr 9 2026 10:09:12";
public static final String buildTimestamp = "Apr 9 2026 12:17:38";

// Prevent instantiation
private Configuration() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -777,23 +777,67 @@ private static RuntimeScalar isTextOrBinaryFromHandle(CustomFileChannel cfc, boo

/**
* Common heuristic for text/binary detection.
* Matches Perl's pp_fttext heuristic from pp_sys.c:
* - "odd" chars = null bytes, invalid high-bit bytes, and control chars
* (0-31) except \n, \r, \t, \b (8), \f (12), and ESC (27)
* - Valid UTF-8 multi-byte sequences are treated as text (not odd)
* - File is binary if odd * 3 > length (more than 1/3 odd characters)
*/
private static RuntimeScalar analyzeTextBinary(byte[] buffer, int length, boolean checkForText) {
int textChars = 0;
int totalChars = 0;
int odd = 0;

for (int i = 0; i < length; i++) {
if (buffer[i] == 0) {
return checkForText ? scalarFalse : scalarTrue; // Binary file
}
if ((buffer[i] >= 32 && buffer[i] <= 126) || buffer[i] == '\n' || buffer[i] == '\r' || buffer[i] == '\t') {
textChars++;
int b = buffer[i] & 0xFF; // treat as unsigned
if (b == 0) {
// Null byte — immediately binary
return checkForText ? scalarFalse : scalarTrue;
} else if (b >= 128) {
// Check if this starts a valid UTF-8 multi-byte sequence
int seqLen = utf8SequenceLength(buffer, i, length);
if (seqLen > 1) {
// Valid UTF-8 sequence — skip remaining bytes, not odd
i += seqLen - 1;
} else {
// Invalid high-bit byte — odd
odd++;
}
} else if (b < 32
&& b != '\n' && b != '\r' && b != '\t'
&& b != '\b' && b != '\f' && b != 27) {
// Control characters (except common whitespace and ESC) are "odd"
odd++;
}
totalChars++;
}

double textRatio = (double) textChars / totalChars;
return getScalarBoolean(checkForText ? textRatio > 0.7 : textRatio <= 0.7);
// Perl: odd * 3 > len means binary
boolean isBinary = odd * 3 > length;
return getScalarBoolean(checkForText ? !isBinary : isBinary);
}

/**
* Determines the length of a valid UTF-8 sequence starting at the given position.
* Returns the sequence length (2-4) if valid, or 1 if invalid.
*/
private static int utf8SequenceLength(byte[] buffer, int pos, int length) {
int b = buffer[pos] & 0xFF;
int seqLen;

if ((b & 0xE0) == 0xC0) {
seqLen = 2; // 110xxxxx — 2-byte sequence
} else if ((b & 0xF0) == 0xE0) {
seqLen = 3; // 1110xxxx — 3-byte sequence
} else if ((b & 0xF8) == 0xF0) {
seqLen = 4; // 11110xxx — 4-byte sequence
} else {
return 1; // Not a valid UTF-8 start byte
}

// Verify continuation bytes (10xxxxxx)
if (pos + seqLen > length) return 1; // Not enough bytes
for (int j = 1; j < seqLen; j++) {
if ((buffer[pos + j] & 0xC0) != 0x80) return 1; // Invalid continuation
}
return seqLen;
}

/**
Expand Down
10 changes: 6 additions & 4 deletions src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java
Original file line number Diff line number Diff line change
Expand Up @@ -598,20 +598,21 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc

// hexPrinter(inputStr);

// Only look up pos() for /g matches - non-/g matches always start from 0
// Look up pos() for /g matches and for non-/g matches that use \G.
// In Perl, \G anchors at pos() even in non-/g matches (e.g. $str =~ /\Gfoo/).
RuntimeScalar posScalar = null;
boolean isPosDefined = false;
int startPos = 0;

if (regex.regexFlags.isGlobalMatch()) {
if (regex.regexFlags.isGlobalMatch() || regex.useGAssertion) {
// Use RuntimePosLvalue to get the current position
posScalar = RuntimePosLvalue.pos(string);
isPosDefined = posScalar.getDefinedBoolean();
startPos = isPosDefined ? posScalar.getInt() : 0;

// Check if previous call had zero-length match at this position (for SCALAR context)
// This prevents infinite loops in: while ($str =~ /pat/g)
if (ctx == RuntimeContextType.SCALAR) {
if (regex.regexFlags.isGlobalMatch() && ctx == RuntimeContextType.SCALAR) {
String patternKey = regex.patternString;
if (RuntimePosLvalue.hadZeroLengthMatchAt(string, startPos, patternKey)) {
// Previous match was zero-length at this position - fail to break loop
Expand Down Expand Up @@ -647,7 +648,8 @@ private static RuntimeBase matchRegexDirect(RuntimeScalar quotedRegex, RuntimeSc
try {
while (matcher.find()) {
// If \G is used, ensure the match starts at the expected position
if (regex.useGAssertion && isPosDefined && matcher.start() != startPos) {
// When pos() is undefined, \G anchors at position 0 (startPos defaults to 0)
if (regex.useGAssertion && matcher.start() != startPos) {
break;
}

Expand Down
76 changes: 76 additions & 0 deletions src/test/resources/unit/regex/regex_g_pos.t
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,82 @@ $pattern = qr/\G(\d{3})/; # Use a capture group
# Non-global match should not use \G
$string =~ /$pattern/;
ok(!($1 ne '123'), 'Non-global match does not use \\G, matched \'123\'');
###################
# \G anchoring when pos() is undefined
# \G should anchor at position 0 when pos is undef, not scan forward

# \G(\s+) should NOT match "-dac -tac" at pos 0 (no space at pos 0)
my $cfg = "-dac -tac";
if ($cfg =~ /\G(\s+)/gc) {
ok(0, '\\G(\\s+) should not match when no space at pos 0');
} else {
ok(1, '\\G(\\s+) correctly fails when pos is undef and no space at pos 0');
}

# \G([a-z]+) should NOT match "-dac -tac" at pos 0 (dash at pos 0)
pos($cfg) = undef;
if ($cfg =~ /\G([a-z]+)/gc) {
ok(0, '\\G([a-z]+) should not match when no letter at pos 0');
} else {
ok(1, '\\G([a-z]+) correctly fails when pos is undef and no letter at pos 0');
}

# \G(-) SHOULD match "-dac -tac" at pos 0 (dash at pos 0)
pos($cfg) = undef;
if ($cfg =~ /\G(-)/gc) {
ok($1 eq '-' && pos($cfg) == 1, '\\G(-) correctly matches dash at pos 0');
} else {
ok(0, '\\G(-) should match dash at pos 0');
}

# Simulate parse_args pattern: multiple \G/gc alternations on same string
pos($cfg) = undef;
my @tokens;
my $part = "";
while (1) {
if ($cfg =~ /\G([\"\'])/gc) {
# quote
}
elsif ($cfg =~ /\G(\s+)/gc) {
push @tokens, $part if length($part);
$part = "";
}
elsif ($cfg =~ /\G(.)/gc) {
$part .= $1;
}
else {
push @tokens, $part if length($part);
last;
}
}
ok(scalar(@tokens) == 2 && $tokens[0] eq '-dac' && $tokens[1] eq '-tac',
'\\G/gc tokenizer correctly splits "-dac -tac" into two tokens');

###################
# \G in non-/g matches should still anchor at pos()
# This is used by Perl::Tidy's tokenizer for signature detection

my $line = "sub foo(\$bar) { }";
pos($line) = 7; # after "sub foo"
ok($line =~ /\G\s*\(/, '\\G in non-/g match anchors at pos() - matches ( at pos 7');

pos($line) = 7;
ok(!($line =~ /\Gx/), '\\G in non-/g match anchors at pos() - fails when char does not match');

# \G in non-/g should not change pos()
pos($line) = 7;
$line =~ /\G\s*\(/;
ok(pos($line) == 7, '\\G non-/g match does not change pos()');

# \G with capture in non-/g match
my $data = "hello world";
pos($data) = 6;
if ($data =~ /\G(\w+)/) {
ok($1 eq 'world', '\\G non-/g match with capture works at pos 6');
} else {
ok(0, '\\G non-/g match with capture should match at pos 6');
}

###################
# End of Perl `pos` and `\G` Tests

Expand Down
Loading