Skip to content

Conversation

mkundu1
Copy link
Contributor

@mkundu1 mkundu1 commented Oct 11, 2023

trying to understand the impact

Looking at how we are launching Fluent from PyFluent, the timeout loop in FluentConnection is unnecessary. I have replaced it with a check_health call which will fail only if there is some issue at grpc communication level and it will expose the grpc error message in that case. Exposing the grpc error message here will help us to triage issues in user's system.

The start_timeout is used only when we wait for the server to finish writing the server-info file.

Tested by hardcoding a port in FluentConnection, so grpc connection will fail due to port-mismatch:
Old behaviour (fails after 60 s):

>>> solver = pyfluent.launch_fluent()
pyfluent.launcher ERROR: Exception caught - RuntimeError: The connection to the Fluent server could not be established within the configurable 60 second time limit.
Traceback (most recent call last):
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\launcher\launcher.py", line 750, in launch_fluent
    session = new_session.create_from_server_info_file(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\session.py", line 226, in create_from_server_info_file
    fluent_connection=FluentConnection(
                      ^^^^^^^^^^^^^^^^^
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\fluent_connection.py", line 283, in __init__
    raise RuntimeError(
RuntimeError: The connection to the Fluent server could not be established within the configurable 60 second time limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\launcher\launcher.py", line 784, in launch_fluent
    raise LaunchFluentError(launch_cmd) from ex
ansys.fluent.core.launcher.launcher.LaunchFluentError:
Fluent Launch string: start "" "D:\ANSYSDev\vNNN\fluent\ntbin\win64\fluent.exe" 3ddp   -sifile="C:\Users\mkundu\AppData\Local\Temp\serverinfo-1aqp0o22.txt" -nm -hidden

New behaviour (fails immediately after Fluent is launched and shows the grpc error message):

>>> solver = pyfluent.launch_fluent()
pyfluent.launcher ERROR: Exception caught - RuntimeError: failed to connect to all addresses; last error: UNAVAILABLE: ipv6:%5B::1%5D:12345: Connection refused
Traceback (most recent call last):
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\launcher\launcher.py", line 745, in launch_fluent
    session = new_session.create_from_server_info_file(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\session.py", line 226, in create_from_server_info_file
    fluent_connection=FluentConnection(
                      ^^^^^^^^^^^^^^^^^
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\fluent_connection.py", line 272, in __init__
    self.health_check_service.check_health()
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\services\error_handler.py", line 15, in func
    raise RuntimeError(ex.details()) from None
RuntimeError: failed to connect to all addresses; last error: UNAVAILABLE: ipv6:%5B::1%5D:12345: Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\ANSYSDev\PyFluentDev\pyfluent\src\ansys\fluent\core\launcher\launcher.py", line 779, in launch_fluent
    raise LaunchFluentError(launch_cmd) from ex
ansys.fluent.core.launcher.launcher.LaunchFluentError:
Fluent Launch string: start "" "D:\ANSYSDev\vNNN\fluent\ntbin\win64\fluent.exe" 3ddp   -sifile="C:\Users\mkundu\AppData\Local\Temp\serverinfo-z5hlusk9.txt" -nm -hidden

@mkundu1 mkundu1 linked an issue Oct 11, 2023 that may be closed by this pull request
@mkundu1 mkundu1 changed the title Remove timeout - testing Remove timeout Oct 19, 2023
@mkundu1 mkundu1 changed the title Remove timeout Remove timeout loop in FluentConnection Oct 19, 2023
@mkundu1 mkundu1 marked this pull request as ready for review October 20, 2023 00:01
Copy link
Member

@raph-luc raph-luc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, but the test_recover_grpc_error_from_launch_error is going to leave unreachable zombies behind (watchdog also not going to work due to incorrect port number), and I have an idea of how to make this test work with containers as well, and clean up after itself so it doesn't leave zombies, going to see if I can make some suggestions

Edit: if you prefer, can merge this without the test for now, and then we can add more robust tests in a separate PR

@mkundu1
Copy link
Contributor Author

mkundu1 commented Oct 20, 2023

This is a great idea, but the test_recover_grpc_error_from_launch_error is going to leave unreachable zombies behind (watchdog also not going to work due to incorrect port number), and I have an idea of how to make this test work with containers as well, and clean up after itself so it doesn't leave zombies, going to see if I can make some suggestions

Edit: if you prefer, can merge this without the test for now, and then we can add more robust tests in a separate PR

Thanks @raph-luc, this is a good point. Feel free to merge your suggestions directly to this branch if you prefer. We'll complete the PR sometime next week.

@mkundu1
Copy link
Contributor Author

mkundu1 commented Oct 25, 2023

Thanks @raph-luc, I've kept both tests. Both are useful examples of how to recover the grpc error in those scenarios.

@mkundu1 mkundu1 merged commit f30c019 into main Oct 25, 2023
@mkundu1 mkundu1 deleted the fix/timeout branch October 25, 2023 14:12
raph-luc pushed a commit that referenced this pull request Oct 25, 2023
* Remove timeout

* Remove timeout

* Fix test

* Remove timeout argument

* Call the correct health-check

* Add test

* Add test

* Fix

* Fix test

* Disable test for docker

* Add test_recover_grpc_error_from_connection_error
raph-luc added a commit that referenced this pull request Oct 26, 2023
* Fix vale warnings (#2139)

* fix tensor type for displacement variable (#2145)

* Fix set_state implementation for command argument instance. (#2147)

* Update flobject.py (#2148)

* SVAR Doc (#1635)

* Test to catch Watchdog launch errors, and improved Watchdog behavior on Windows (#2144)

* Cavitation Model Example And Example Warning Fix (#2102)

* Add type annotations for some modules under services (#2108)

* More robust Windows launch command for Watchdog (#2167)

* Making h5py an optional dependency, not installed by default (#2171)

* Expose settings root like in pyconsole. (#2149)

* Remove timeout loop in FluentConnection (#2126)

* Fix SVAR doc (#2172)

---------

Co-authored-by: Mainak Kundu <94432368+mkundu1@users.noreply.github.com>
Co-authored-by: Oleg Chernukhin <92750311+ochernuk@users.noreply.github.com>
Co-authored-by: Prithwish Mukherjee <109645853+prmukherj@users.noreply.github.com>
Co-authored-by: Harshal Pohekar <106588300+hpohekar@users.noreply.github.com>
Co-authored-by: Aseem Jain <95020968+ajain-work@users.noreply.github.com>
Co-authored-by: Prithwish Mukherjee <prithwish.mukherjee@ansys.com>
Co-authored-by: Adam Boutin <143635850+ansaboutin@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

grpc error during initial connection
3 participants