Skip to content

tests: fix UTF-8 detection, per-test LC_* settings, CI coverage#17988

Closed
vszakats wants to merge 38 commits intocurl:masterfrom
vszakats:lc-more
Closed

tests: fix UTF-8 detection, per-test LC_* settings, CI coverage#17988
vszakats wants to merge 38 commits intocurl:masterfrom
vszakats:lc-more

Conversation

@vszakats
Copy link
Member

@vszakats vszakats commented Jul 21, 2025

Tests requiring UTF-8 locale:
165, 962, 963, 964, 965, 966, 967, 1448, 1560, 2046, 2047
Tests requiring UTF-8 locale, but passing without one anyway:
955, 956, 957, 958, 959, 960, 961, 968, 1034, 1035

Spec 1997: https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html
Spec 2008: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html

Ref: c221c0e #17938
Ref: 7cf8414
Ref: 4c140a5
Ref: 28faaac #2436
Ref: ecd1d02


LC_* values are a bit inconsistent in the rest of tests:

  1. 1034, 1035: LC_ALL= & LC_CTYPE=en_US.UTF-8
  2. 1148: LC_ALL= & LC_NUMERIC=en_US.UTF-8
  3. rest: LC_ALL=en_US.UTF-8 & LC_CTYPE=en_US.UTF-8

Method 1 clears LC_ALL, to give way to LC_CTYPE, and to configure
this without touching LANG. Later commits mention errors when LC_ALL
is empty, and fill it with the LC_CTYPE value, also keeping LC_CTYPE.
LC_CTYPE is possibly redundant (ignored) when LC_ALL is set.
LC_NUMERIC in method 2 could be replaced with method 1, I think.

I wonder if all could be replace with a single LC_ALL=en_US.UTF-8,
as in 1560 after this patch, because the empty value doesn't always
work and anything else is ignored if LC_ALL is set. [DONE, with C.UTF-8]

@vszakats
Copy link
Member Author

vszakats commented Jul 22, 2025

Still thinking how to return to the original state that once settled this nicely: ecd1d02

23208e3 says there was error when setting LC_ALL to blank. What could this error have been? At that time, runtests deleted that env when specified as LC_ALL= in the test data file:

curl/tests/runtests.pl

Lines 3362 to 3385 in 23208e3

if($s =~ /([^=]*)=(.*)/) {
my ($var, $content) = ($1, $2);
# remember current setting, to restore it once test runs
$oldenv{$var} = ($ENV{$var})?"$ENV{$var}":'notset';
# set new value
if(!$content) {
delete $ENV{$var} if($ENV{$var});
}
else {
if($var =~ /^LD_PRELOAD/) {
if(exe_ext('TOOL') && (exe_ext('TOOL') eq '.exe')) {
# print "Skipping LD_PRELOAD due to lack of OS support\n";
next;
}
if($debug_build || ($has_shared ne "yes")) {
# print "Skipping LD_PRELOAD due to no release shared build\n";
next;
}
}
$ENV{$var} = "$content";
print "setenv $var = $content\n" if($verbose);
}
}
}

This code changed later slightly to allow setting blank values. Deletion required LC_ALL (without =) from that point, but the error was reported before. Also some tests kept using LC_ALL= all along.

curl/tests/runner.pm

Lines 651 to 676 in 7cf8414

if($s =~ /([^=]*)(.*)/) {
my ($var, $content) = ($1, $2);
# remember current setting, to restore it once test runs
$oldenv{$var} = ($ENV{$var})?"$ENV{$var}":'notset';
if($content =~ /^=(.*)/) {
# assign it
$content = $1;
if($var =~ /^LD_PRELOAD/) {
if(exe_ext('TOOL') && (exe_ext('TOOL') eq '.exe')) {
logmsg "Skipping LD_PRELOAD due to lack of OS support\n" if($verbose);
next;
}
if($feature{"debug"} || !$has_shared) {
logmsg "Skipping LD_PRELOAD due to no release shared build\n" if($verbose);
next;
}
}
$ENV{$var} = "$content";
logmsg "setenv $var = $content\n" if($verbose);
}
else {
# remove it
delete $ENV{$var} if($ENV{$var});
}

vszakats added a commit to vszakats/curl that referenced this pull request Jul 22, 2025
No longer necessary after a previous change made sure to strip
the '100.0%' number from the result, before checking it. The dot is
a regex character catching any decimal character.

Follow-up to 17c18fb curl#5194
Ref: curl#2436
Cherry-picked from curl#17988
@github-actions github-actions bot added the CI Continuous Integration label Jul 22, 2025
@vszakats
Copy link
Member Author

The odd thing happens that after stripping all the per-test LC_* settings,
test jobs keep passing. Need to dig deeper

@vszakats
Copy link
Member Author

vszakats commented Jul 22, 2025

I suspect what's happening is that originally (after ecd1d02),
the availability of UTF-8 was verified in tests via precheck.

It worked because these tests also set LC_* env to UTF-8 first, and only after that,
runs precheck. This check failed when the env unsuccessfully tried to set UTF-8
due to lack of support.

This changed in current master via 0b70b23 #15039, moving these
checks to runtests.pl. It made the check run before tests set their
the envs selecting UTF-8.

In effect this makes the check test the calling environment. If it has UTF-8 selected,
the codeset-utf8 feature is enabled, otherwise disabled.

It means that if the calling environment has UTF-8 selected, IDN tests are run,
and finish successfully, even if they don't set LC_* on their own.

If the env has something else selected, the IDN tests are skipped because the
codeset-utf8 feature is disabled:
https://github.com/curl/curl/actions/runs/16447875282/job/46484741182?pr=17988#step:41:131

Except for tests doing IDN which didn't require this feature, e.g. test 1560.

One possible fix is selecting UTF-8 while testing this feature in runtests.pl.

@vszakats
Copy link
Member Author

vszakats commented Jul 22, 2025

After fixing the detection in runtests, the IDN jobs are failing without LC_* settings, as expected:
https://github.com/curl/curl/actions/runs/16451127383/job/46496391285?pr=17988

In Alpine and Slackware by default, and in Linux jobs (I guess those with libidn2).
What's still odd is macOS jobs remaining green.

FAIL 165: 'HTTP over proxy with IDN host name' HTTP, HTTP GET, HTTP proxy, IDN
FAIL 962: 'SMTP without SMTPUTF8 support - UTF-8 based sender (host part only)' SMTP, IDN
FAIL 963: 'SMTP without SMTPUTF8 support (IDN) - UTF-8 recipient (host part only)' SMTP, IDN
FAIL 964: 'SMTP external VRFY without SMTPUTF8 (IDN) - UTF-8 recipient (host part)' SMTP, VRFY, IDN
FAIL 965: 'SMTP with SMTPUTF8 support - UTF-8 based sender' SMTP, IDN
FAIL 966: 'SMTP with SMTPUTF8 support - UTF-8 based recipient' SMTP, IDN
FAIL 967: 'SMTP external VRFY with SMTPUTF8 support' SMTP, VRFY, IDN
FAIL 1448: 'Redirect following to UTF-8 IDN host name' HTTP, HTTP GET, IDN, followlocation, --resolve, --write-out
FAIL 1560: 'URL API' unittest, urlapi
FAIL 2046: 'Connection reuse with IDN host name' HTTP, HTTP GET, IDN, followlocation, --resolve, --write-out
FAIL 2047: 'Connection reuse with IDN host name over HTTP proxy' HTTP, HTTP GET, HTTP proxy, IDN, followlocation, --write-out

@vszakats
Copy link
Member Author

macOS jobs fail (correctly so) after overriding LC_ALL, not just LC_CTYPE:
https://github.com/curl/curl/actions/runs/16452498331/job/46501107703?pr=17988

@vszakats vszakats changed the title tests: tidy up locale settings tests: fix UTF-8 detection, tidy up LC_* settings in tests, improve CI coverage Jul 22, 2025
@vszakats
Copy link
Member Author

vszakats commented Jul 22, 2025

I've found it interesting that on Ubuntu specifically, deleting LC_ALL and setting LC_CTYPE to en_US.UTF-8
does not work to set the locale. (while it does work on macOS, Alpine, Slackware)
This is after manually overriding them both to C on Ubuntu (and macOS) before running the tests.

https://github.com/curl/curl/actions/runs/16454137943/job/46506733830?pr=17988

Neither does setting LC_ALL to blank, or setting LC_CTYPE to C.UTF8

This looks like the issue that led to:
b4c9982 #4743
23208e3 #4738

What works is setting both LC_ALL and LC_CTYPE to en_US.UTF-8.

@dfandrich
Copy link
Contributor

dfandrich commented Jul 22, 2025 via email

@vszakats
Copy link
Member Author

Ubuntu this locale C.utf8. Does that one work?

I tried LC_CTYPE=C.UTF-8 + LC_ALL= in test 962 in the previous push, to no avail.

Testing now with both set to en_US.UTF-8. And just LC_ALL set to it. Both seem to work.
(I thought to have tested the latter and failed, but apparently I missed it.)

I'll try C.UTF-8 next. If it works, I'd prefer it for being culture-agnostic.

vszakats added a commit that referenced this pull request Jul 22, 2025
No longer necessary after a previous change made sure to strip
the '100.0%' number from the result, before checking it. The dot is
a regex character catching any decimal separator.

Follow-up to 17c18fb #5194
Ref: #2436
Cherry-picked from #17988
Closes #17993
@vszakats vszakats changed the title tests: fix UTF-8 detection, tidy up LC_* settings in tests, improve CI coverage tests: fix UTF-8 detection, LC_* settings in tests, extend/fix CI coverage Jul 22, 2025
@vszakats vszakats changed the title tests: fix UTF-8 detection, LC_* settings in tests, extend/fix CI coverage tests: fix UTF-8 detection, per-test LC_* settings, CI coverage Jul 22, 2025
@vszakats
Copy link
Member Author

A single LC_ALL=C.UTF-8 looks like the winner.

@vszakats
Copy link
Member Author

vszakats commented Jul 22, 2025

A bunch of tests are set to require codeset-utf8 and set the locale to UTF-8,
but pass without an UTF-8 locale anyway (as tested in CI):
955, 956, 957, 958, 959, 960, 961, 968, 1034, 1035

  • invalid UTF-8 in domain names
  • getting UTF-8 encoded string via stdin
  • passing UTF-8 encoded string via command-line

The common theme is that no UTF-8 processing is happening in these tests.
It's just a stream of raw bytes.

Dropping these would unlock them in envs without UTF-8 support,
remove 40 lines and a tiny bit of complexity with it.

I'm not sure if there is a risk in some exotic POSIX environments, but
it's probably not worth trying.

@vszakats vszakats marked this pull request as ready for review July 22, 2025 22:40
vszakats added 3 commits July 23, 2025 14:31
This reverts commit bcb9361.

Doesn't look terribly interesting.

It's empty on BSDs, Slackware, Alpine, Ubuntu
macOS: LC_ALL: 'en_US.UTF-8' LC_CTYPE: 'en_US.UTF-8' LC_NUMERIC: ''

Also need these warnings fixed where undefined

Use of uninitialized value in concatenation (.) or string at ../../tests/runtests.pl line 868.
Use of uninitialized value in concatenation (.) or string at ../../tests/runtests.pl line 868.
Use of uninitialized value in concatenation (.) or string at ../../tests/runtests.pl line 868.
vszakats added a commit that referenced this pull request Jul 23, 2025
vszakats added a commit that referenced this pull request Jul 23, 2025
Syncing with autotools, and fixing the `Protocols:` verifier test.

Cherry-picked from #17988
vszakats added a commit that referenced this pull request Jul 23, 2025
vszakats added a commit that referenced this pull request Jul 23, 2025
Fixing:
```
In file included from lib/vtls/vtls.c:50:
In file included from lib/vtls/../urldata.h:314:
lib/vtls/../curl_sspi.h:41:10: fatal error: 'security.h' file not found
   41 | #include <security.h>
      |          ^~~~~~~~~~~~
1 error generated.

lib/curl_sspi.h:41:10: fatal error: 'security.h' file not found
   41 | #include <security.h>
      |          ^~~~~~~~~~~~
1 error generated.
```

Cherry-picked from #17988
@vszakats vszakats closed this in 1cc8a52 Jul 23, 2025
vszakats added a commit that referenced this pull request Jul 23, 2025
After 7cf8414 #12862, `VAR=` no longer
removes the env variable, but sets it to an empty/blank value instead.
To remove an env, `VAR` shall be used (without the assigment operator.)

`SSL_CERT_FILE`, `CURL_HOME`, `HOME`, `XDG_CONFIG_HOME`, were added
before the change above. Make tests unset these envs again, as their
commit messages suggest, instead of blanking them. It does not change
the outcome of the tests.

Ref: 764e4f0 #8213
Ref: e992770 #6600

Folllow-up to 7cf8414 #12862
Cherry-picked from #17988
Closes #17994
@vszakats vszakats deleted the lc-more branch July 23, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Continuous Integration tests

Development

Successfully merging this pull request may close these issues.

2 participants