[CI] Some TPCH benchmarks started failing #35383

ElenaHenderson · 2023-05-01T18:31:41Z

Describe the bug, including details regarding any error messages, version, and platform.

When #34912 was merged, some TPCH benchmarks started failing. Here is a build with failed benchmarks: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2835

#35364 fixed some of these failing TPCH benchmarks but these benchmarks are still failing:

Component(s)

Benchmarking

kou · 2023-05-01T20:53:01Z

@icexelloss @rtpsw Could you take a look at this?

kou · 2023-05-01T20:56:47Z

BTW, it's difficult to find error messages from the log. Is this a related error for this?

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2835#0187d7f0-b6aa-4a1a-b054-5873f056fdda/6-1380



[230501-13:39:32.996] [66823] [buildkite.benchmark.run] INFO: Started executing -> conbench tpch --iterations=3 --all=true --drop-caches=true --run-id=$RUN_ID --run-name="$RUN_NAME" --run-reason="$RUN_REASON"
--
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 346, in wrapper
  | return fun(self, *args, **kwargs)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_common.py", line 480, in wrapper
  | raise raise_from(err, None)
  | File "<string>", line 3, in raise_from
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_common.py", line 478, in wrapper
  | return fun(self)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 373, in _get_kinfo_proc
  | ret = cext.proc_kinfo_oneshot(self.pid)
  | ProcessLookupError: [Errno 3] assume no such process (originated from sysctl(kinfo_proc), len == 0)
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 361, in _init
  | self.create_time()
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 719, in create_time
  | self._create_time = self._proc.create_time()
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 346, in wrapper
  | return fun(self, *args, **kwargs)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 471, in create_time
  | return self._get_kinfo_proc()[kinfo_proc_map['ctime']]
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 351, in wrapper
  | raise NoSuchProcess(self.pid, self._name)
  | psutil.NoSuchProcess: process no longer exists (pid=44552)
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  | return _run_code(code, main_globals, None,
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 87, in _run_code
  | exec(code, run_globals)
  | File "/opt/homebrew/var/buildkite-agent/builds/test-mac-arm/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/buildkite/benchmark/monitor_memory.py", line 24, in <module>
  | mem_rss_bytes = psutil.Process(proc.pid).memory_info().rss
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 332, in __init__
  | self._init(pid)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 373, in _init
  | raise NoSuchProcess(pid, msg='process PID not found')
  | psutil.NoSuchProcess: process PID not found (pid=44552)

Can we improve log output?

icexelloss · 2023-05-02T13:36:14Z

Sorry that I missed notification of this. Thanks to @westonpace there is a fix in the linked PR above.

…d segmentation fault (#35384) ### Rationale for this change The recent change (#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: #35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…o avoid segmentation fault (apache#35384) ### Rationale for this change The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`. This is later used to initialize the kernel states. However, this is different than what we used to use. The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread). This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not. So we need to query both properties and use them in their respective spots. ### What changes are included in this PR? Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not. ### Are these changes tested? The bug was caught by the benchmarks which is a bit concerning. Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans. ### Are there any user-facing changes? No. * Closes: apache#35383 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

ElenaHenderson added the Type: bug label May 1, 2023

github-actions bot added the Component: Benchmarking label May 1, 2023

github-actions bot mentioned this issue May 1, 2023

GH-35383: [C++] Prefer max_concurrency over executor capacity to avoid segmentation fault #35384

Merged

github-actions bot assigned westonpace May 1, 2023

westonpace closed this as completed in #35384 May 2, 2023

westonpace added this to the 13.0.0 milestone May 2, 2023

jonkeane mentioned this issue May 12, 2023

Separation of concerns between CI runner and Conbench: on stability assessment and error propagation conbench/conbench#1251

Open

jgehrcke mentioned this issue Jun 29, 2023

Post catastrophic errors to Conbench voltrondata-labs/benchmarks#145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Some TPCH benchmarks started failing #35383

[CI] Some TPCH benchmarks started failing #35383

ElenaHenderson commented May 1, 2023

kou commented May 1, 2023

kou commented May 1, 2023

icexelloss commented May 2, 2023

[CI] Some TPCH benchmarks started failing #35383

[CI] Some TPCH benchmarks started failing #35383

Comments

ElenaHenderson commented May 1, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

kou commented May 1, 2023

kou commented May 1, 2023

icexelloss commented May 2, 2023