Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Some TPCH benchmarks started failing #35383

Closed
ElenaHenderson opened this issue May 1, 2023 · 3 comments · Fixed by #35384
Closed

[CI] Some TPCH benchmarks started failing #35383

ElenaHenderson opened this issue May 1, 2023 · 3 comments · Fixed by #35384

Comments

@ElenaHenderson
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

When #34912 was merged, some TPCH benchmarks started failing. Here is a build with failed benchmarks: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2835

#35364 fixed some of these failing TPCH benchmarks but these benchmarks are still failing:

Screen Shot 2023-05-01 at 11 29 41 AM

Component(s)

Benchmarking

@kou
Copy link
Member

kou commented May 1, 2023

@icexelloss @rtpsw Could you take a look at this?

@kou
Copy link
Member

kou commented May 1, 2023

BTW, it's difficult to find error messages from the log. Is this a related error for this?

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2835#0187d7f0-b6aa-4a1a-b054-5873f056fdda/6-1380



[230501-13:39:32.996] [66823] [buildkite.benchmark.run] INFO: Started executing -> conbench tpch --iterations=3 --all=true --drop-caches=true --run-id=$RUN_ID --run-name="$RUN_NAME" --run-reason="$RUN_REASON"
--
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 346, in wrapper
  | return fun(self, *args, **kwargs)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_common.py", line 480, in wrapper
  | raise raise_from(err, None)
  | File "<string>", line 3, in raise_from
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_common.py", line 478, in wrapper
  | return fun(self)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 373, in _get_kinfo_proc
  | ret = cext.proc_kinfo_oneshot(self.pid)
  | ProcessLookupError: [Errno 3] assume no such process (originated from sysctl(kinfo_proc), len == 0)
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 361, in _init
  | self.create_time()
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 719, in create_time
  | self._create_time = self._proc.create_time()
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 346, in wrapper
  | return fun(self, *args, **kwargs)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 471, in create_time
  | return self._get_kinfo_proc()[kinfo_proc_map['ctime']]
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/_psosx.py", line 351, in wrapper
  | raise NoSuchProcess(self.pid, self._name)
  | psutil.NoSuchProcess: process no longer exists (pid=44552)
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  | return _run_code(code, main_globals, None,
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/runpy.py", line 87, in _run_code
  | exec(code, run_globals)
  | File "/opt/homebrew/var/buildkite-agent/builds/test-mac-arm/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/buildkite/benchmark/monitor_memory.py", line 24, in <module>
  | mem_rss_bytes = psutil.Process(proc.pid).memory_info().rss
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 332, in __init__
  | self._init(pid)
  | File "/Users/voltrondata/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/psutil/__init__.py", line 373, in _init
  | raise NoSuchProcess(pid, msg='process PID not found')
  | psutil.NoSuchProcess: process PID not found (pid=44552)

Can we improve log output?

@icexelloss
Copy link
Contributor

Sorry that I missed notification of this. Thanks to @westonpace there is a fix in the linked PR above.

westonpace added a commit that referenced this issue May 2, 2023
…d segmentation fault (#35384)

### Rationale for this change

The recent change (#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`.  This is later used to initialize the kernel states.  However, this is different than what we used to use.  The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread).

This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not.  So we need to query both properties and use them in their respective spots.

### What changes are included in this PR?

Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not.

### Are these changes tested?

The bug was caught by the benchmarks which is a bit concerning.  Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans.

### Are there any user-facing changes?

No.
* Closes: #35383

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 13.0.0 milestone May 2, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…o avoid segmentation fault (apache#35384)

### Rationale for this change

The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`.  This is later used to initialize the kernel states.  However, this is different than what we used to use.  The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread).

This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not.  So we need to query both properties and use them in their respective spots.

### What changes are included in this PR?

Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not.

### Are these changes tested?

The bug was caught by the benchmarks which is a bit concerning.  Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans.

### Are there any user-facing changes?

No.
* Closes: apache#35383

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…o avoid segmentation fault (apache#35384)

### Rationale for this change

The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`.  This is later used to initialize the kernel states.  However, this is different than what we used to use.  The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread).

This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not.  So we need to query both properties and use them in their respective spots.

### What changes are included in this PR?

Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not.

### Are these changes tested?

The bug was caught by the benchmarks which is a bit concerning.  Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans.

### Are there any user-facing changes?

No.
* Closes: apache#35383

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…o avoid segmentation fault (apache#35384)

### Rationale for this change

The recent change (apache#34912) calculates the max concurrency using `plan->query_context()->executor()->GetCapacity()`.  This is later used to initialize the kernel states.  However, this is different than what we used to use.  The previous method used was `plan->query_context()->max_concurrency()` which is slightly different(if the aggregate node IS run in parallel then we initialize one state for each CPU thread, one for each I/O thread, and one for the calling user thread).

This is unfortunately a bit complicated as `max_concurrency` would not be a good indicator to use when determining if the plan is running in parallel or not.  So we need to query both properties and use them in their respective spots.

### What changes are included in this PR?

Now, `max_concurrency` is used to figure out how many thread local states need to be initialized and `GetCapacity` is used to figure out if there are multiple CPU threads or not.

### Are these changes tested?

The bug was caught by the benchmarks which is a bit concerning.  Most of the CI have a very small number of CPU threads and don't experience much concurrency and so I think we just didn't see this pattern. Or possibly, this pattern is only experienced in the legacy way that pyarrow launches exec plans.

### Are there any user-facing changes?

No.
* Closes: apache#35383

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants