Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] session config k8s_coordinator_pod_node_selector Not effective #3476

Closed
JackyYangPassion opened this issue Jan 10, 2024 · 4 comments · Fixed by #3477
Closed

[BUG] session config k8s_coordinator_pod_node_selector Not effective #3476

JackyYangPassion opened this issue Jan 10, 2024 · 4 comments · Fixed by #3477
Labels

Comments

@JackyYangPassion
Copy link
Contributor

Describe the bug
add label for k8s worker

kubectl label nodes  node-worker  graphscope=1

create session

session = graphscope.session(
                             k8s_coordinator_cpu=1,
                             k8s_coordinator_mem="1Gi",
                             k8s_vineyard_cpu=4,
                             k8s_vineyard_mem="5Gi",
                             vineyard_shared_mem="5Gi",
                             k8s_engine_cpu=2,
                             k8s_namespace='gs-new-orc-jacky100',
                             k8s_engine_mem="5Gi",
                             num_workers=3,
                             k8s_coordinator_pod_node_selector={"graphscope":"1"},
                             k8s_engine_pod_node_selector={"graphscope":"1"},
                             k8s_image_tag="latest",
                             k8s_client_config='~/.kube/config')

error log

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Deployment in 
version \"v1\" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field 
PodSpec.spec.template.spec.nodeSelector of type map[string]string","reason":"BadRequest","code":400}
@dashanji
Copy link
Collaborator

Hi, @JackyYangPassion. Thanks for the report.

Could you please provide the full error log? Thanks.

@JackyYangPassion
Copy link
Contributor Author

JackyYangPassion commented Jan 10, 2024

Thanks for your reply @dashanji
the full log from the jupyter notbook

2024-01-10 20:50:11,794 [INFO][cluster:235]: Launching coordinator...
2024-01-10 20:50:12,802 [INFO][cluster:414]: Stopping coordinator
2024-01-10 20:50:12,825 [INFO][cluster:434]: Stopped coordinator
2024-01-10 20:50:12,825 [INFO][cluster:414]: Stopping coordinator
2024-01-10 20:50:12,826 [INFO][cluster:434]: Stopped coordinator
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:31                                                                                   │
│                                                                                                  │
│   28 │   │   │   │   │   │   │    k8s_coordinator_pod_node_selector={"graphscope":"1"},          │
│   29 │   │   │   │   │   │   │    k8s_engine_pod_node_selector={"graphscope":"1"},               │
│   30 │   │   │   │   │   │   │    k8s_image_tag="latest",                                        │
│ ❱ 31 │   │   │   │   │   │   │    k8s_client_config='~/.kube/config')                            │
│   32 print('========= Session created. ==========')                                              │
│   33                                                                                             │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/graphscope/client/session.py:563 in __init__      │
│                                                                                                  │
│    560 │   │   atexit.register(self.close)                                                       │
│    561 │   │   # create and connect session                                                      │
│    562 │   │   with CaptureKeyboardInterrupt(self.close):                                        │
│ ❱  563 │   │   │   self._connect()                                                               │
│    564 │   │                                                                                     │
│    565 │   │   self._disconnected: bool = False                                                  │
│    566                                                                                           │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/graphscope/client/session.py:909 in _connect      │
│                                                                                                  │
│    906 │   │                                                                                     │
│    907 │   │   # launching graphscope service                                                    │
│    908 │   │   if self._launcher is not None:                                                    │
│ ❱  909 │   │   │   self._launcher.start()                                                        │
│    910 │   │   │   self._coordinator_endpoint = self._launcher.coordinator_endpoint              │
│    911 │   │                                                                                     │
│    912 │   │   # waiting service ready                                                           │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/graphscope/deploy/kubernetes/cluster.py:389 in    │
│ start                                                                                            │
│                                                                                                  │
│   386 │   │   │   self._create_namespace()                                                       │
│   387 │   │   │   self._create_role_and_binding()                                                │
│   388 │   │   │                                                                                  │
│ ❱ 389 │   │   │   self._create_services()                                                        │
│   390 │   │   │   time.sleep(1)                                                                  │
│   391 │   │   │                                                                                  │
│   392 │   │   │   self._waiting_for_services_ready()                                             │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/graphscope/deploy/kubernetes/cluster.py:301 in    │
│ _create_services                                                                                 │
│                                                                                                  │
│   298 │   │   return args                                                                        │
│   299 │                                                                                          │
│   300 │   def _create_services(self):                                                            │
│ ❱ 301 │   │   self._create_coordinator()                                                         │
│   302 │                                                                                          │
│   303 │   def _waiting_for_services_ready(self):                                                 │
│   304 │   │   response = self._app_api.read_namespaced_deployment_status(                        │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/graphscope/deploy/kubernetes/cluster.py:274 in    │
│ _create_coordinator                                                                              │
│                                                                                                  │
│   271 │   │                                                                                      │
│   272 │   │   deployment = coordinator.get_coordinator_deployment()                              │
│   273 │   │   response = self._app_api.create_namespaced_deployment(                             │
│ ❱ 274 │   │   │   self._namespace, deployment                                                    │
│   275 │   │   )                                                                                  │
│   276 │   │   targets.append(response)                                                           │
│   277                                                                                            │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py:353 in       │
│ create_namespaced_deployment                                                                     │
│                                                                                                  │
│    350 │   │   │   │    returns the request thread.                                              │
│    351 │   │   """                                                                               │
│    352 │   │   kwargs['_return_http_data_only'] = True                                           │
│ ❱  353 │   │   return self.create_namespaced_deployment_with_http_info(namespace, body, **kwarg  │
│    354 │                                                                                         │
│    355 │   def create_namespaced_deployment_with_http_info(self, namespace, body, **kwargs):  #  │
│    356 │   │   """create_namespaced_deployment  # noqa: E501                                     │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py:466 in       │
│ create_namespaced_deployment_with_http_info                                                      │
│                                                                                                  │
│    463 │   │   │   _return_http_data_only=local_var_params.get('_return_http_data_only'),  # no  │
│    464 │   │   │   _preload_content=local_var_params.get('_preload_content', True),              │
│    465 │   │   │   _request_timeout=local_var_params.get('_request_timeout'),                    │
│ ❱  466 │   │   │   collection_formats=collection_formats)                                        │
│    467 │                                                                                         │
│    468 │   def create_namespaced_replica_set(self, namespace, body, **kwargs):  # noqa: E501     │
│    469 │   │   """create_namespaced_replica_set  # noqa: E501                                    │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/api_client.py:353 in call_api   │
│                                                                                                  │
│   350 │   │   │   │   │   │   │   │      body, post_params, files,                               │
│   351 │   │   │   │   │   │   │   │      response_type, auth_settings,                           │
│   352 │   │   │   │   │   │   │   │      _return_http_data_only, collection_formats,             │
│ ❱ 353 │   │   │   │   │   │   │   │      _preload_content, _request_timeout, _host)              │
│   354 │   │                                                                                      │
│   355 │   │   return self.pool.apply_async(self.__call_api, (resource_path,                      │
│   356 │   │   │   │   │   │   │   │   │   │   │   │   │      method, path_params,                │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/api_client.py:184 in __call_api │
│                                                                                                  │
│   181 │   │   │   method, url, query_params=query_params, headers=header_params,                 │
│   182 │   │   │   post_params=post_params, body=body,                                            │
│   183 │   │   │   _preload_content=_preload_content,                                             │
│ ❱ 184 │   │   │   _request_timeout=_request_timeout)                                             │
│   185 │   │                                                                                      │
│   186 │   │   self.last_response = response_data                                                 │
│   187                                                                                            │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/api_client.py:397 in request    │
│                                                                                                  │
│   394 │   │   │   │   │   │   │   │   │   │    post_params=post_params,                          │
│   395 │   │   │   │   │   │   │   │   │   │    _preload_content=_preload_content,                │
│   396 │   │   │   │   │   │   │   │   │   │    _request_timeout=_request_timeout,                │
│ ❱ 397 │   │   │   │   │   │   │   │   │   │    body=body)                                        │
│   398 │   │   elif method == "PUT":                                                              │
│   399 │   │   │   return self.rest_client.PUT(url,                                               │
│   400 │   │   │   │   │   │   │   │   │   │   query_params=query_params,                         │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/rest.py:285 in POST             │
│                                                                                                  │
│   282 │   │   │   │   │   │   │   post_params=post_params,                                       │
│   283 │   │   │   │   │   │   │   _preload_content=_preload_content,                             │
│   284 │   │   │   │   │   │   │   _request_timeout=_request_timeout,                             │
│ ❱ 285 │   │   │   │   │   │   │   body=body)                                                     │
│   286 │                                                                                          │
│   287 │   def PUT(self, url, headers=None, query_params=None, post_params=None,                  │
│   288 │   │   │   body=None, _preload_content=True, _request_timeout=None):                      │
│                                                                                                  │
│ /usr/local/python3/lib/python3.7/site-packages/kubernetes/client/rest.py:238 in request          │
│                                                                                                  │
│   235 │   │   │   logger.debug("response body: %s", r.data)                                      │
│   236 │   │                                                                                      │
│   237 │   │   if not 200 <= r.status <= 299:                                                     │
│ ❱ 238 │   │   │   raise ApiException(http_resp=r)                                                │
│   239 │   │                                                                                      │
│   240 │   │   return r                                                                           │
│   241                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '6966ccb2-e7da-461a-a521-e4864dda18c4', 'Cache-Control': 
'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 
'0c2b55b8-02df-4c93-956c-e04dc793d0cb', 'X-Kubernetes-Pf-Prioritylevel-Uid': 
'0ddb7b8c-60c6-44e5-ac99-c6e7df6626ae', 'Date': 'Wed, 10 Jan 2024 12:50:11 GMT', 'Content-Length': '295'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Deployment in 
version \"v1\" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field 
PodSpec.spec.template.spec.nodeSelector of type map[string]string","reason":"BadRequest","code":400}

@dashanji
Copy link
Collaborator

@JackyYangPassion Thanks, we try to reproduce the bug.

Copy link
Contributor

/cc @yecol @sighingnow, this issus/pr has had no activity for a long time, please help to review the status and assign people to work on it.

siyuan0322 pushed a commit that referenced this issue Jun 11, 2024
…3477)

## What do these changes do?

As titled.

During test, I find not all pods are scheduled to the same node.

```yaml
NAME                                             READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
coordinator-ozgjbb-6c57549cf7-hhwqt              1/1     Running   0          4m2s    10.244.1.85   kind-worker    <none>           <none>
gs-engine-ozgjbb-0                               3/3     Running   0          3m58s   10.244.1.86   kind-worker    <none>           <none>
gs-engine-ozgjbb-1                               3/3     Running   0          114s    10.244.1.87   kind-worker    <none>           <none>
gs-engine-ozgjbb-2                               3/3     Running   0          108s    10.244.1.88   kind-worker    <none>           <none>
gs-engine-ozgjbb-ozgjbb-vineyard-etcd-0          1/1     Running   0          3m58s   10.244.3.71   kind-worker2   <none>           <none>
gs-interactive-frontend-ozgjbb-8d996bc8b-ctn6x   1/1     Running   0          3m58s   10.244.3.72   kind-worker2   <none>           <none>
```

We should support to add the node selector to the frontend and vineyard
etcd.

## Related issue number

Fixes #3476

---------

Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants