[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

dashanji · 2023-05-18T09:44:43Z

What do these changes do?

`🤖 Generated by Copilot at 84e0afe`

This pull request adds a new feature to graphscope that allows deploying engines on demand in a lazy mode, and refactors some existing code to improve the quality and consistency. It modifies the EngineCluster, KubernetesClusterLauncher, and Session classes, and adds a new test case and fixture to test_demo_script.py.

Related issue number

Fixes some parts of #2539

dashanji · 2023-05-18T09:56:19Z

Example:

Create the vineyard cluster is not exist and the vineyard cluster and gae engine will be deployed

sess = graphscope.session(k8s_vineyard_deployment='vineyardd-sample', mode='lazy')

$ kubectl get po
NAME                                  READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt   1/1     Running   0          23s
gs-analytical-dtpouc-0                1/1     Running   0          12s
vineyardd-sample-64dccb4597-jhgvn     1/1     Running   0          16s
vineyardd-sample-etcd-0               1/1     Running   0          20s

load graph

g = load_ogbn_mag(sess,"/testingdata/ogbn_mag_small")

While calling gremlin, the gie engine will be deployed on the vineyard nodes.

 interactive = sess.gremlin(g)

$ kubectl get po
NAME                                              READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt               1/1     Running   0          3m40s
gs-analytical-dtpouc-0                            1/1     Running   0          3m29s
gs-interactive-dtpouc-0                           1/1     Running   0          22s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k   1/1     Running   0          22s
vineyardd-sample-64dccb4597-jhgvn                 1/1     Running   0          3m33s
vineyardd-sample-etcd-0                           1/1     Running   0          3m37

While calling graphlearn, the gle engine is also deployed on the vineyard nodes.

lg = sess.graphlearn(g)

$ kubectl get po
NAME                                              READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt               1/1     Running   0          4m6s
gs-analytical-dtpouc-0                            1/1     Running   0          3m55s
gs-interactive-dtpouc-0                           1/1     Running   0          48s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k   1/1     Running   0          48s
gs-learning-dtpouc-0                              1/1     Running   0          4s
vineyardd-sample-64dccb4597-jhgvn                 1/1     Running   0          3m59s
vineyardd-sample-etcd-0                           1/1     Running   0          4m3s

sighingnow

See comments about with_vineyard.

Basically LGTM 👍

sighingnow · 2023-05-19T08:09:12Z

coordinator/gscoordinator/cluster_builder.py

@@ -54,6 +54,7 @@ def __init__(
        engine_cpu,
        engine_mem,
        engine_pod_node_selector,
+        engine_prefix,


engine_prefix -> engine_pod_prefix

sighingnow · 2023-05-19T08:12:58Z

python/graphscope/client/session.py

-                self._config_params,
-            )
-
+        self._print_session_info()


_print_session_info -> _log_session_info

sighingnow · 2023-05-19T08:17:39Z

python/graphscope/client/session.py

+                    f"Not a valid engine name: {item}, valid engines are {valid_engines}"
+                )
+            if item == "vineyard":
+                self._with_vineyard = True


You can add something like enable_vineyard_scheduler=True/False rather than initialize a variable _with_vineyard=False and match the name "vineyard" in enabled engines ...

When you see with_vineyard=False you would wondering how graphscope works without vineyard

I agree. Vineyard container is always included.

rename to something like create_vineyard_cluster_if_not_exists

siyuan0322

As discussed, it's better to split the create vineyard cluster part and launch engine lazily part to two different PRs. As they are orthogonal features.

siyuan0322 · 2023-05-19T09:14:32Z

interactive_engine/assembly/src/bin/graphscope/giectl


  instance_id=${coordinator_name#*-}

-  pod_ips=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -o jsonpath='{.items[*].status.podIP}')
+  pod_names=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -oname | grep ${filter_name} | awk -F '/' '{print $2}'| xargs)


could this line get rid of awk, using only jsonpath?

siyuan0322 · 2023-05-19T09:15:12Z

python/graphscope/client/session.py

+                    f"Not a valid engine name: {item}, valid engines are {valid_engines}"
+                )
+            if item == "vineyard":
+                self._with_vineyard = True


rename to something like create_vineyard_cluster_if_not_exists

dashanji · 2023-05-19T09:24:54Z

Thanks for the advice, I will add a flag like create_vineyard_cluster_if_not_exists later. As for launching engines on demand, I think we'd better to have a discussion. @sighingnow @siyuan0322

codecov-commenter · 2023-05-23T16:32:40Z

Codecov Report

Merging #2710 (73915a4) into main (a119c23) will decrease coverage by 5.95%.
The diff coverage is 6.84%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2710      +/-   ##
==========================================
- Coverage   72.96%   67.02%   -5.95%     
==========================================
  Files          99       99              
  Lines       10381    10451      +70     
==========================================
- Hits         7575     7005     -570     
- Misses       2806     3446     +640

Impacted Files	Coverage Δ
python/graphscope/deploy/kubernetes/cluster.py	`23.22% <0.00%> (-52.97%)`	⬇️
...on/graphscope/tests/kubernetes/test_demo_script.py	`0.00% <0.00%> (-72.16%)`	⬇️
python/graphscope/client/session.py	`72.46% <80.00%> (-3.36%)`	⬇️
python/graphscope/config.py	`96.07% <100.00%> (+0.07%)`	⬆️

... and 11 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a119c23...73915a4. Read the comment docs.

siyuan0322 · 2023-05-24T06:06:00Z

coordinator/gscoordinator/cluster_builder.py

@@ -402,7 +425,7 @@ def get_engine_pod_spec(self):

        engine_volume_mounts = [socket_volume[2], shm_volume[2]]

-        if self._volumes and self._volumes is not None:
+        if self._volumes is not None:


volumes could be empty dict {}. better keep this.

siyuan0322 · 2023-05-24T06:14:54Z

coordinator/gscoordinator/kubernetes_launcher.py

+        if self._mode not in ["eager", "lazy"]:
+            logger.error(
+                "Invalid mode: %s, mode must be one of eager or lazy", self._mode
+            )


If mode is not recognized, set the mode to the default ('eager').
also log it: logger.error('Invalid mode %s, choose from 'eager' or 'lazy'. Proceeding with default mode: 'eager')

siyuan0322 · 2023-05-24T06:18:39Z

coordinator/gscoordinator/kubernetes_launcher.py

+        # external vineyard deployment. The vineyard objects are not
+        # shared between the engine pods, so raise an error here.
+        if self._mode == "lazy" and self._vineyard_deployment is None:
+            raise ValueError("If the mode is lazy, the vineyard deployment must be set")


Rephrase the error message to: Lazy mode is only possible with a vineyard deployment, please add a vineyard deployment name by vineyard_deployment='vy-deploy'
Proceeding as eager mode.

And set the mode to eager, don't raise error, continuing with eager mode.

siyuan0322 · 2023-05-24T06:24:01Z

coordinator/gscoordinator/kubernetes_launcher.py

+        engine_pod_host_ip_list = getattr(self, f"_{engine_type}_pod_host_ip_list")
+
+        return (
+            engine_pod_name_list is not None


The default value of these lists _{engine_type}_pod_name_list are [], which is not None. So by default it's always not None. Is this intended?

siyuan0322 · 2023-05-24T06:28:03Z

coordinator/gscoordinator/kubernetes_launcher.py

+
+        if not self.check_if_engine_exist(engine_type):
+            self._engine_pod_prefix = f"gs-{engine_type}-".replace("_", "-")
+            self._engine_cluster = self._build_engine_cluster(


If I called deploy_engine multiple times, would I lose the reference to previous cluster? since you set the self._engine_cluster repeatedly.

siyuan0322 · 2023-05-24T06:30:25Z

coordinator/gscoordinator/kubernetes_launcher.py

    def close_interactive_instance(self, object_id):
+        pod_name_list, _, _ = self._allocate_interactive_engine()


There could be multiple sets of interactive engines. I think current design doesn't suffice.

This close interactive instance just got the last created instance.

siyuan0322 · 2023-05-24T06:31:04Z

coordinator/gscoordinator/kubernetes_launcher.py

-        if not (self._with_analytical or self._with_analytical_java):
+    def _allocate_analytical_engine(self):
+        # check the engine flag
+        if self._with_analytical and self._with_analytical_java:


Don't raise errors. technically, analytical is a subset of analytical-java, so just go ahead with analytical-java.

siyuan0322 · 2023-05-24T06:37:50Z

python/graphscope/tests/kubernetes/test_demo_script.py

@@ -308,6 +328,55 @@ def test_demo_distribute(gs_session_distributed, data_dir, modern_graph_data_dir
    # GNN engine


+def test_demo_with_lazy_mode(gs_session_with_lazy_mode, data_dir, modern_graph_data_dir):


Since this lazy focus on the automated pod creating and destroying, could you add the check for the pod? launch -> check the existence and number of specific pod. destroy(close) -> check the nonexistence of the pod.

siyuan0322 · 2023-05-24T06:40:10Z

coordinator/gscoordinator/kubernetes_launcher.py

+            return False
+        return True
+
+    def _deploy_vineyard_deployment_if_not_exist(self):


These exists really got me.. Why check if exists after it exists?
After some study, I think vineyard_deployment_exists in this context is ambiguous. in self. vineyard_deployment_exists() you mean the value of self.vineyard_deployment is not None, in self._check_if_vineyard_deployment_exist(), you mean the deployment of vineyard pods is exists (This is more natural).

I think you could just use self._vineyard_deployment_name is not None, get rid of this method self.vineyard_deployment_exists()

siyuan0322 · 2023-05-24T06:42:52Z

coordinator/gscoordinator/kubernetes_launcher.py

+            # Set the engine pod info
+            setattr(self, f"_{engine_type}_pod_name_list", self._pod_name_list)
+            setattr(self, f"_{engine_type}_pod_ip_list", self._pod_ip_list)
+            setattr(self, f"_{engine_type}_pod_host_ip_list", self._pod_host_ip_list)


Since you use these 3 variables extensively, have you consider extract them to a high level data structure? like a class.

siyuan0322 · 2023-05-24T06:45:02Z

The main logic is much more clearer, just some minor issues.

…engine instances. * Deploy the vineyard deployment if not exist. * Add a new option for KubernetesLauncher to deploy the engines in the eager mode or lazy mode. * Split the create_engine_instances into two steps. - Allocate the engine instances. - Distribute the relevant process to the engine instances. * Add a new pytest on kubernetes to test the lazy mode. * Add a new label for engine pods to indicate the specific engine. * Set the engine_selector of all engine kubernetes resources. * Create different engine statefulset based on object_id for gae and gle. * Delete all engine kubernetes resources based on engine_selector for gae and gie. * Bump up the upload-artifact to v3 Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>

dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 11a3c23 to 23a327e Compare May 18, 2023 12:23

dashanji requested review from siyuan0322 and sighingnow May 19, 2023 07:21

sighingnow reviewed May 19, 2023

View reviewed changes

siyuan0322 approved these changes May 19, 2023

View reviewed changes

dashanji added this to In progress in GraphScope Sprint April-May May 23, 2023

dashanji force-pushed the integrate-vineyardctl branch 3 times, most recently from aa9bac8 to 93196da Compare May 23, 2023 14:31

dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 84e0afe to 22013b1 Compare May 24, 2023 05:47

sighingnow approved these changes May 24, 2023

View reviewed changes

siyuan0322 approved these changes May 24, 2023

View reviewed changes

dashanji force-pushed the integrate-vineyardctl branch 9 times, most recently from 8dc663b to 25f7424 Compare May 31, 2023 11:04

dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 57cae3e to 224c9a5 Compare May 31, 2023 15:32

dashanji force-pushed the integrate-vineyardctl branch from 224c9a5 to 13dda85 Compare June 1, 2023 02:12

siyuan0322 approved these changes Jun 1, 2023

View reviewed changes

sighingnow merged commit 699ffbe into alibaba:main Jun 2, 2023
43 checks passed

GraphScope Sprint April-May automation moved this from In progress to Done Jun 2, 2023

sighingnow mentioned this pull request Jun 5, 2023

Standalone deployment of GAE #1856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

dashanji commented May 18, 2023 •

edited by ghost

Loading

dashanji commented May 18, 2023 •

edited

Loading

sighingnow left a comment

sighingnow May 19, 2023

sighingnow May 19, 2023

sighingnow May 19, 2023

siyuan0322 May 19, 2023

siyuan0322 May 19, 2023

siyuan0322 left a comment

siyuan0322 May 19, 2023

siyuan0322 May 19, 2023

dashanji commented May 19, 2023

codecov-commenter commented May 23, 2023 •

edited

Loading

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 May 24, 2023

siyuan0322 commented May 24, 2023 •

edited

Loading

		def close_interactive_instance(self, object_id):
		pod_name_list, _, _ = self._allocate_interactive_engine()

		@@ -308,6 +328,55 @@ def test_demo_distribute(gs_session_distributed, data_dir, modern_graph_data_dir
		# GNN engine


		def test_demo_with_lazy_mode(gs_session_with_lazy_mode, data_dir, modern_graph_data_dir):

[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

Conversation

dashanji commented May 18, 2023 • edited by ghost Loading

What do these changes do?

🤖 Generated by Copilot at 84e0afe

Related issue number

dashanji commented May 18, 2023 • edited Loading

sighingnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siyuan0322 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashanji commented May 19, 2023

codecov-commenter commented May 23, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siyuan0322 commented May 24, 2023 • edited Loading

dashanji commented May 18, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 84e0afe`

dashanji commented May 18, 2023 •

edited

Loading

codecov-commenter commented May 23, 2023 •

edited

Loading

siyuan0322 commented May 24, 2023 •

edited

Loading