We had a number of segments go down both on 12/16 (multiple waves of segments going down) as well as 12/18 due to a problem query that was run. The log messages show this query along with a segmentation fault error in each segment log that crashed. I've obfuscated the table and column names for the sake of privacy in the below error message as well as the provided DDL and problem query, but everything else is exactly the same as it was when it was executed
Segmentation fault","Failed process was running: SELECT col1
, col17
, LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14)
, col14
, col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) AS t
, col14 - LAG(col14) OVER (PARTITION BY col1, col17 ORDER BY col1, col17, col14) < '5 mins' AS v
FROM myschema.mytable
WHERE col14 >= '2024-11-01'
AND col14 < '2024-12-01'
AND col17 = 'FOOBAR’;",,,,,,0,,"postmaster.c",4315,
2024-12-16 22:08:13.008648 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",4023,
2024-12-16 22:08:13.136640 UTC,,,p3534856,th-2126882688,,,,0,,,seg171,,,,,"WARNING","01000","ic-proxy-server: received signal 3",,,,,,,0,,"ic_proxy_main.c",474,
2024-12-16 22:08:13.182293 UTC,,,p3534838,th-2126882688,,,,0,,,seg171,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 3534856) exited with exit code 1",,,,,,,0,,"postmaster.c",4293,
As far as I can tell it doesn't seem like this query was ever successful, however it's hard to tell from the logs since they're pretty verbose. However I did confirm with an engineer that this query was first executed on 12/16 (the day of our segments going down the first time with this query), there's no entry of it being executed at all on 12/17, and then it was executed again on 12/18 causing the same error and crash. This is a very large table with a lot of partitions and just from taking a quick sampling each partition is around 100GB+ in size, however I can get exact sizes if that's helpful for analysis.
I've attached the core dump analysis files for all 31 segments that went down during this time (this is directly related to issue #803) provided by @edespino in his comment here #803 (comment). I've also provided the DDL to create the table that I captured using gpbackup and then obfuscated as well as the query itself in files that are attached. Happy to provide any more detail upon request. I've tested this in my local Cloudberry docker environment but haven't been able to recreate the segfault. Mind you though that this test didn't include any data in the table so I imagine that would influence what steps the query execution plan actually takes.
This query shouldn't cause a segmentation fault and should just return the results when it's finished
Apache Cloudberry version
Cloudberry 1.6.0 (this is a pre-Apache release)
What happened
We had a number of segments go down both on 12/16 (multiple waves of segments going down) as well as 12/18 due to a problem query that was run. The log messages show this query along with a segmentation fault error in each segment log that crashed. I've obfuscated the table and column names for the sake of privacy in the below error message as well as the provided DDL and problem query, but everything else is exactly the same as it was when it was executed
As far as I can tell it doesn't seem like this query was ever successful, however it's hard to tell from the logs since they're pretty verbose. However I did confirm with an engineer that this query was first executed on 12/16 (the day of our segments going down the first time with this query), there's no entry of it being executed at all on 12/17, and then it was executed again on 12/18 causing the same error and crash. This is a very large table with a lot of partitions and just from taking a quick sampling each partition is around 100GB+ in size, however I can get exact sizes if that's helpful for analysis.
I've attached the core dump analysis files for all 31 segments that went down during this time (this is directly related to issue #803) provided by @edespino in his comment here #803 (comment). I've also provided the DDL to create the table that I captured using gpbackup and then obfuscated as well as the query itself in files that are attached. Happy to provide any more detail upon request. I've tested this in my local Cloudberry docker environment but haven't been able to recreate the segfault. Mind you though that this test didn't include any data in the table so I imagine that would influence what steps the query execution plan actually takes.
segfault_artifacts.zip
What you think should happen instead
This query shouldn't cause a segmentation fault and should just return the results when it's finished
How to reproduce
I've tested this locally and haven't been able to reproduce the segfault
Operating System
Rocky Linux 8.10 (Green Obsidian)
Anything else
No response
Are you willing to submit PR?
Code of Conduct