Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propogate the k8s exception to avoid waiting forever #2747

Merged
merged 1 commit into from May 25, 2023

Conversation

sighingnow
Copy link
Collaborator

@sighingnow sighingnow commented May 25, 2023

Fixes #2746

After this pull request, the error in issue #2746 becomes more readable and accurate:

In [1]: import logging
   ...: import graphscope
   ...: graphscope.set_option(show_log=True)
   ...: graphscope.set_option(log_level=logging.DEBUG)
   ...: sess = graphscope.session()
2023-05-25 19:28:48,699 [INFO][session:650]: Initializing graphscope session with parameters: {'addr': None, 'mode': 'eager', 'cluster_type': 'k8s', 'num_workers': 2, 'preemptive': True, 'k8s_namespace': None, 'k8s_service_type': 'NodePort', 'k8s_image_registry': 'registry.cn-hongkong.aliyuncs.com', 'k8s_image_repository': 'graphscope', 'k8s_image_tag': '0.22.0', 'k8s_image_pull_policy': 'IfNotPresent', 'k8s_image_pull_secrets': [], 'k8s_coordinator_cpu': 0.5, 'k8s_coordinator_mem': '512Mi', 'etcd_addrs': None, 'etcd_listening_client_port': 2379, 'etcd_listening_peer_port': 2380, 'k8s_vineyard_image': 'vineyardcloudnative/vineyardd:latest', 'k8s_vineyard_deployment': None, 'k8s_vineyard_cpu': 0.5, 'k8s_vineyard_mem': '512Mi', 'vineyard_shared_mem': '4Gi', 'k8s_engine_cpu': 0.2, 'k8s_engine_mem': '1Gi', 'k8s_mars_worker_cpu': 0.2, 'k8s_mars_worker_mem': '4Mi', 'k8s_mars_scheduler_cpu': 0.2, 'k8s_mars_scheduler_mem': '2Mi', 'k8s_coordinator_pod_node_selector': None, 'k8s_engine_pod_node_selector': None, 'enabled_engines': 'analytical,interactive,learning', 'reconnect': False, 'k8s_volumes': {}, 'k8s_waiting_for_delete': False, 'timeout_seconds': 600, 'dangling_timeout_seconds': 600, 'with_mars': False, 'with_dataset': False, 'hosts': ['localhost'], 'k8s_client_config': {}}
2023-05-25 19:28:48,741 [INFO][cluster:272]: Launching coordinator...
['python3', '-m', 'gscoordinator', '--cluster_type', 'k8s', '--port', '59564', '--num_workers', '2', '--preemptive', 'True', '--instance_id', 'ruxbdy', '--log_level', 'DEBUG', '--k8s_namespace', 'default', '--k8s_service_type', 'NodePort', '--k8s_image_repository', 'graphscope', '--k8s_image_pull_policy', 'IfNotPresent', '--k8s_coordinator_name', 'coordinator-ruxbdy', '--k8s_coordinator_service_name', 'coordinator-ruxbdy', '--k8s_vineyard_image', 'vineyardcloudnative/vineyardd:latest', '--k8s_vineyard_cpu', '0.5', '--k8s_vineyard_mem', '512Mi', '--vineyard_shared_mem', '4Gi', '--k8s_engine_cpu', '0.2', '--k8s_engine_mem', '1Gi', '--k8s_mars_worker_cpu', '0.2', '--k8s_mars_worker_mem', '4Mi', '--k8s_mars_scheduler_cpu', '0.2', '--k8s_mars_scheduler_mem', '2Mi', '--k8s_with_mars', 'False', '--k8s_enabled_engines', 'analytical,interactive,learning', '--k8s_with_dataset', 'False', '--timeout_seconds', '600', '--dangling_timeout_seconds', '600', '--waiting_for_delete', 'False', '--k8s_delete_namespace', 'False', '--k8s_image_registry', 'registry.cn-hongkong.aliyuncs.com', '--k8s_image_tag', '0.22.0']
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Successfully assigned default/coordinator-ruxbdy-b6674d89-vddbz to izj6ch597fca8xxhqh1ua9z
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0"
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Failed to pull image "registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0": rpc error: code = Unknown desc = Error response from daemon: manifest for registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0 not found: manifest unknown: manifest unknown
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Error: ErrImagePull
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Back-off pulling image "registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0"
2023-05-25 19:28:50,776 [INFO][utils:193]: coordinator-ruxbdy-b6674d89-vddbz: Error: ImagePullBackOff
stop been called ....
2023-05-25 19:28:52,776 [INFO][cluster:555]: Stopping coordinator
2023-05-25 19:28:52,792 [INFO][cluster:575]: Stopped coordinator
2023-05-25 19:28:52,792 [INFO][cluster:555]: Stopping coordinator
2023-05-25 19:28:52,792 [INFO][cluster:575]: Stopped coordinator
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-1-9097ce55384f>:5 in <module>                                                     │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/client/utils.py:400 in wrapper                               │
│                                                                                                  │
│   397 │   │   │   assert len(original_defaults) == len(new_defaults), "set defaults failed"      │
│   398 │   │   │   func.__defaults__ = tuple(new_defaults)                                        │
│   399 │   │   │                                                                                  │
│ ❱ 400 │   │   │   return_value = func(*args, **kwargs)                                           │
│   401 │   │   │                                                                                  │
│   402 │   │   │   # Restore original defaults.                                                   │403 │   │   │   func.__defaults__ = original_defaults                                          │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/client/session.py:684 in __init__                            │
│                                                                                                  │
│    681 │   │   atexit.register(self.close)                                                       │
│    682 │   │   # create and connect session                                                      │683 │   │   with CaptureKeyboardInterrupt(self.close):                                        │
│ ❱  684 │   │   │   self._connect()                                                               │
│    685 │   │                                                                                     │
│    686 │   │   self._disconnected: bool = False                                                  │
│    687                                                                                           │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/client/session.py:1043 in _connect                           │
│                                                                                                  │
│   1040 │   │                                                                                     │
│   1041 │   │   # launching graphscope service                                                    │1042 │   │   if self._launcher is not None:                                                    │
│ ❱ 1043 │   │   │   self._launcher.start()                                                        │
│   1044 │   │   │   self._coordinator_endpoint = self._launcher.coordinator_endpoint              │
│   1045 │   │                                                                                     │
│   1046 │   │   # waiting service ready                                                           │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/deploy/kubernetes/cluster.py:533 in start                    │
│                                                                                                  │
│   530 │   │   │   self._create_services()                                                        │
│   531 │   │   │   time.sleep(1)                                                                  │
│   532 │   │   │                                                                                  │
│ ❱ 533 │   │   │   self._waiting_for_services_ready()                                             │
│   534 │   │   │                                                                                  │
│   535 │   │   │   self._coordinator_endpoint = self._get_coordinator_endpoint()                  │
│   536 │   │   │   logger.info(                                                                   │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/deploy/kubernetes/cluster.py:464 in                          │
│ _waiting_for_services_ready                                                                      │
│                                                                                                  │
│   461 │   │   )                                                                                  │
│   462 │   │   self._coordinator_pods_watcher.start()                                             │
│   463 │   │                                                                                      │
│ ❱ 464 │   │   if wait_for_deployment_complete(                                                   │
│   465 │   │   │   api_client=self._api_client,                                                   │
│   466 │   │   │   namespace=self._namespace,                                                     │
│   467 │   │   │   name=self._coordinator_name,                                                   │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/deploy/kubernetes/utils.py:116 in                            │
│ wait_for_deployment_complete                                                                     │
│                                                                                                  │
│   113 │   │   │   │   │   value = tp()                                                           │
│   114 │   │   │   │   if value.__traceback__ is not tb:                                          │
│   115 │   │   │   │   │   raise value.with_traceback(tb)                                         │
│ ❱ 116 │   │   │   │   raise value                                                                │
│   117 │   │   response = app_api.read_namespaced_deployment_status(                              │
│   118 │   │   │   namespace=namespace, name=name                                                 │
│   119 │   │   )                                                                                  │
│                                                                                                  │
│ /opt/gs-for-build/python/graphscope/deploy/kubernetes/utils.py:198 in _stream_event_impl         │
│                                                                                                  │
│   195 │   │   │   │   │   │   │   error_message.append(f"Kubernetes event error: {msg}")         │
│   196 │   │   │   │   if error_message:                                                          │
│   197 │   │   │   │   │   try:                                                                   │
│ ❱ 198 │   │   │   │   │   │   raise K8sError('Error when launching Coordinator on kubernetes c   │
│   199 │   │   │   │   │   except:  # noqa: E722,B110, pylint: disable=bare-except                │200 │   │   │   │   │   │   self._exc_info = sys.exc_info()                                    │
│   201 │   │   │   │   │   │   return                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
K8sError: Error when launching Coordinator on kubernetes cluster:
Kubernetes event error: coordinator-ruxbdy-b6674d89-vddbz: Failed to pull image "registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0":
rpc error: code = Unknown desc = Error response from daemon: manifest for registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.22.0 not found:
manifest unknown: manifest unknown
Kubernetes event error: coordinator-ruxbdy-b6674d89-vddbz: Error: ErrImagePull
Kubernetes event error: coordinator-ruxbdy-b6674d89-vddbz: Error: ImagePullBackOff

In [2]: exit
gsbot@ ➜  python git:(ht/hungs-up) ✗

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@sighingnow sighingnow merged commit 3cf0798 into alibaba:main May 25, 2023
23 of 24 checks passed
@sighingnow sighingnow deleted the ht/hungs-up branch May 25, 2023 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] graphscope hungs up when exit the python interpreter if error occurs during creating (k8s) sessions
2 participants