Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] cli valgrind errors in nightlies #28946

Closed
asfimport opened this issue Jul 6, 2021 · 11 comments
Closed

[R] cli valgrind errors in nightlies #28946

asfimport opened this issue Jul 6, 2021 · 11 comments
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

The r-valgrind nightly job has been failing from July 1-today.

The last known good build was on June 30.

The following were merged on 30 June so likely one of them is the culprit:
ARROW-13128: [C#] TimestampArray conversion logic for nano and micro … …
ARROW-13025: [C++][Python] Add FunctionOptions::Equals/ToString/Seria… …
ARROW-13095: [C++] Implement trig compute functions …
MINOR: [C#] Fixing example to use WriteEndAsync instead of WriteFoote… …
ARROW-13010: [C++][Compute] Support outputting to slices from kleene … …
ARROW-12996: Add bytes_read() to StreamingReader …
ARROW-13072: [C++] Add bit-wise arithmetic kernels …
ARROW-13104: [C++] Fix unsafe cast in ByteStreamSplit implementation …
ARROW-13134: [C++][CI] Pin aws-sdk-cpp to < 1.9 …

For reference:
last commit on June 29: 42048e5
last commit on June 30: ab57479

The error is:

2021-07-06T09:03:49.4267523Z ==3154== HEAP SUMMARY:
2021-07-06T09:03:49.4269578Z ==3154==     in use at exit: 322,229,205 bytes in 61,881 blocks
2021-07-06T09:03:49.4271423Z ==3154==   total heap usage: 3,962,401 allocs, 3,900,520 frees, 2,926,474,902 bytes allocated
2021-07-06T09:03:49.4273205Z ==3154== 
2021-07-06T09:03:56.2319948Z ==3154== 336 bytes in 1 blocks are possibly lost in loss record 174 of 3,640
2021-07-06T09:03:56.2321803Z ==3154==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
2021-07-06T09:03:56.2323056Z ==3154==    by 0x40149CA: allocate_dtv (dl-tls.c:286)
2021-07-06T09:03:56.2324056Z ==3154==    by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
2021-07-06T09:03:56.2324838Z ==3154==    by 0x5721322: allocate_stack (allocatestack.c:622)
2021-07-06T09:03:56.2329162Z ==3154==    by 0x5721322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
2021-07-06T09:03:56.2329726Z ==3154==    by 0x1317E67A: cli__start_thread (thread.c:46)
2021-07-06T09:03:56.2330502Z ==3154==    by 0x1317E6D6: clic_start_thread (thread.c:63)
2021-07-06T09:03:56.2331474Z ==3154==    by 0x494214E: R_doDotCall (dotcode.c:604)
2021-07-06T09:03:56.2331893Z ==3154==    by 0x49B135F: bcEval (eval.c:7671)
2021-07-06T09:03:56.2332311Z ==3154==    by 0x498C2A2: Rf_eval (eval.c:727)
2021-07-06T09:03:56.2332712Z ==3154==    by 0x498F011: R_execClosure (eval.c:1897)
2021-07-06T09:03:56.2333146Z ==3154==    by 0x498ECC4: Rf_applyClosure (eval.c:1823)
2021-07-06T09:03:56.2333567Z ==3154==    by 0x49A084C: bcEval (eval.c:7083)
2021-07-06T09:03:56.2333952Z ==3154==    by 0x498C2A2: Rf_eval (eval.c:727)
2021-07-06T09:03:56.2334279Z ==3154== 
2021-07-06T09:03:56.2361291Z ==3154== LEAK SUMMARY:
2021-07-06T09:03:56.2362383Z ==3154==    definitely lost: 0 bytes in 0 blocks
2021-07-06T09:03:56.2362824Z ==3154==    indirectly lost: 0 bytes in 0 blocks
2021-07-06T09:03:56.2363239Z ==3154==      possibly lost: 336 bytes in 1 blocks
2021-07-06T09:03:56.2363802Z ==3154==    still reachable: 322,228,869 bytes in 61,880 blocks
2021-07-06T09:03:56.2364367Z ==3154==                       of which reachable via heuristic:
2021-07-06T09:03:56.2365602Z ==3154==                         newarray           : 4,264 bytes in 1 blocks
2021-07-06T09:03:56.2366369Z ==3154==         suppressed: 0 bytes in 0 blocks
2021-07-06T09:03:56.2366834Z ==3154== Reachable blocks (those to which a pointer was found) are not shown.
2021-07-06T09:03:56.2367759Z ==3154== To see them, rerun with: --leak-check=full --show-leak-kinds=all
2021-07-06T09:03:56.2368331Z ==3154== 
2021-07-06T09:03:56.2368935Z ==3154== For lists of detected and suppressed errors, rerun with: -s
2021-07-06T09:03:56.2369460Z ==3154== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1)

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=7714&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=13491

Reporter: Jonathan Keane / @jonkeane
Assignee: Jonathan Keane / @jonkeane

PRs and other links:

Note: This issue was originally created as ARROW-13265. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev:
I've tried to run the mentioned commits during July 4th holiday, but locally
the Ubuntu 21.04 installation problem (now fixed) was a large blocker to show the problem rather than hiding it.

@asfimport
Copy link
Collaborator Author

Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev:
my candidate was https://issues.apache.org/jira/browse/ARROW-13104, but I couldn't move beyond a mere gut feeling

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
It seems to be a thread that's not stopped before the process ends. Technically it may be harmless, but Valgrind doesn't like it.

It would be nice if you could try to enable tests selectively to try and find out which one produces the leak exactly.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also @jonkeane  you can reproduce locally using archery, so you should bisect the aforementioned commits to find which one introduced the issue.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Yup, that's exactly what I'm already running locally to see if I can narrow it down. I'll post here when I have more information

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
This might not actually be us after all. The day that this error started there was a release of the cli package (from 2.5.0 to 3.0.0).

When I tried replicating the last passing build locally, that actually failed by default (with cli 3.0.0) and I was only able to recreate the failure with cli 2.5.0. I'm running HEAD with cli 2.5.0 to see if that passes.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
A ha! Yes, we can see the valgrind issue:

2021-07-01T07:50:16.9999210Z ==3158== 336 bytes in 1 blocks are possibly lost in loss record 175 of 3,689
2021-07-01T07:50:17.0000076Z ==3158==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
2021-07-01T07:50:17.0000778Z ==3158==    by 0x40149CA: allocate_dtv (dl-tls.c:286)
2021-07-01T07:50:17.0001425Z ==3158==    by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
2021-07-01T07:50:17.0001913Z ==3158==    by 0x5721322: allocate_stack (allocatestack.c:622)
2021-07-01T07:50:17.0002419Z ==3158==    by 0x5721322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
2021-07-01T07:50:17.0002939Z ==3158==    by 0x1315E67A: cli__start_thread (thread.c:46)
2021-07-01T07:50:17.0003412Z ==3158==    by 0x1315E6D6: clic_start_thread (thread.c:63)
2021-07-01T07:50:17.0003877Z ==3158==    by 0x494214E: R_doDotCall (dotcode.c:604)
2021-07-01T07:50:17.0004296Z ==3158==    by 0x49B135F: bcEval (eval.c:7671)
2021-07-01T07:50:17.0004716Z ==3158==    by 0x498C2A2: Rf_eval (eval.c:727)
2021-07-01T07:50:17.0005153Z ==3158==    by 0x498F011: R_execClosure (eval.c:1897)
2021-07-01T07:50:17.0005609Z ==3158==    by 0x498ECC4: Rf_applyClosure (eval.c:1823)
2021-07-01T07:50:17.0006034Z ==3158==    by 0x49A084C: bcEval (eval.c:7083)
2021-07-01T07:50:17.0006462Z ==3158==    by 0x498C2A2: Rf_eval (eval.c:727)

which is calling cli__start_thread

I'm still digging to see if I can figure out what is calling that in our tests.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ha! Looking at that code, you may be able to disable the thread creation simply by setting the CLI_NO_THREAD environment variable.

That said, it may be worth reporting the issue to their issue tracker anyway.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
I've submitted r-lib/cli#311 to hopefully resolve that. We can try setting that in our CI, but AFAIK we cannot set that on CRAN when they run the valgrind tests, so we'll have to ask that they overlook this in our submission since it looks like it's a {cli} issue.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Issue resolved by pull request 10676
#10676

@asfimport asfimport added this to the 5.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants