Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710

Merged
merged 1 commit into from Jun 2, 2023

Conversation

dashanji
Copy link
Collaborator

@dashanji dashanji commented May 18, 2023

What do these changes do?

🤖 Generated by Copilot at 84e0afe

This pull request adds a new feature to graphscope that allows deploying engines on demand in a lazy mode, and refactors some existing code to improve the quality and consistency. It modifies the EngineCluster, KubernetesClusterLauncher, and Session classes, and adds a new test case and fixture to test_demo_script.py.

Related issue number

Fixes some parts of #2539

@dashanji
Copy link
Collaborator Author

dashanji commented May 18, 2023

Example:

Create the vineyard cluster is not exist and the vineyard cluster and gae engine will be deployed

sess = graphscope.session(k8s_vineyard_deployment='vineyardd-sample', mode='lazy')
$ kubectl get po
NAME                                  READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt   1/1     Running   0          23s
gs-analytical-dtpouc-0                1/1     Running   0          12s
vineyardd-sample-64dccb4597-jhgvn     1/1     Running   0          16s
vineyardd-sample-etcd-0               1/1     Running   0          20s

load graph

g = load_ogbn_mag(sess,"/testingdata/ogbn_mag_small")

While calling gremlin, the gie engine will be deployed on the vineyard nodes.

 interactive = sess.gremlin(g)
$ kubectl get po
NAME                                              READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt               1/1     Running   0          3m40s
gs-analytical-dtpouc-0                            1/1     Running   0          3m29s
gs-interactive-dtpouc-0                           1/1     Running   0          22s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k   1/1     Running   0          22s
vineyardd-sample-64dccb4597-jhgvn                 1/1     Running   0          3m33s
vineyardd-sample-etcd-0                           1/1     Running   0          3m37

While calling graphlearn, the gle engine is also deployed on the vineyard nodes.

lg = sess.graphlearn(g)
$ kubectl get po
NAME                                              READY   STATUS    RESTARTS   AGE
coordinator-dtpouc-6bcb65654f-fvmwt               1/1     Running   0          4m6s
gs-analytical-dtpouc-0                            1/1     Running   0          3m55s
gs-interactive-dtpouc-0                           1/1     Running   0          48s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k   1/1     Running   0          48s
gs-learning-dtpouc-0                              1/1     Running   0          4s
vineyardd-sample-64dccb4597-jhgvn                 1/1     Running   0          3m59s
vineyardd-sample-etcd-0                           1/1     Running   0          4m3s

@dashanji dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 11a3c23 to 23a327e Compare May 18, 2023 12:23
Copy link
Collaborator

@sighingnow sighingnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments about with_vineyard.

Basically LGTM 👍

@@ -54,6 +54,7 @@ def __init__(
engine_cpu,
engine_mem,
engine_pod_node_selector,
engine_prefix,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine_prefix -> engine_pod_prefix

self._config_params,
)

self._print_session_info()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_print_session_info -> _log_session_info

f"Not a valid engine name: {item}, valid engines are {valid_engines}"
)
if item == "vineyard":
self._with_vineyard = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add something like enable_vineyard_scheduler=True/False rather than initialize a variable _with_vineyard=False and match the name "vineyard" in enabled engines ...

When you see with_vineyard=False you would wondering how graphscope works without vineyard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Vineyard container is always included.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like create_vineyard_cluster_if_not_exists

Copy link
Collaborator

@siyuan0322 siyuan0322 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, it's better to split the create vineyard cluster part and launch engine lazily part to two different PRs. As they are orthogonal features.


instance_id=${coordinator_name#*-}

pod_ips=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -o jsonpath='{.items[*].status.podIP}')
pod_names=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -oname | grep ${filter_name} | awk -F '/' '{print $2}'| xargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this line get rid of awk, using only jsonpath?

f"Not a valid engine name: {item}, valid engines are {valid_engines}"
)
if item == "vineyard":
self._with_vineyard = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like create_vineyard_cluster_if_not_exists

@dashanji
Copy link
Collaborator Author

Thanks for the advice, I will add a flag like create_vineyard_cluster_if_not_exists later. As for launching engines on demand, I think we'd better to have a discussion. @sighingnow @siyuan0322

@dashanji dashanji added this to In progress in GraphScope Sprint April-May May 23, 2023
@dashanji dashanji force-pushed the integrate-vineyardctl branch 3 times, most recently from aa9bac8 to 93196da Compare May 23, 2023 14:31
@codecov-commenter
Copy link

codecov-commenter commented May 23, 2023

Codecov Report

Merging #2710 (73915a4) into main (a119c23) will decrease coverage by 5.95%.
The diff coverage is 6.84%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2710      +/-   ##
==========================================
- Coverage   72.96%   67.02%   -5.95%     
==========================================
  Files          99       99              
  Lines       10381    10451      +70     
==========================================
- Hits         7575     7005     -570     
- Misses       2806     3446     +640     
Impacted Files Coverage Δ
python/graphscope/deploy/kubernetes/cluster.py 23.22% <0.00%> (-52.97%) ⬇️
...on/graphscope/tests/kubernetes/test_demo_script.py 0.00% <0.00%> (-72.16%) ⬇️
python/graphscope/client/session.py 72.46% <80.00%> (-3.36%) ⬇️
python/graphscope/config.py 96.07% <100.00%> (+0.07%) ⬆️

... and 11 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a119c23...73915a4. Read the comment docs.

@dashanji dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 84e0afe to 22013b1 Compare May 24, 2023 05:47
@@ -402,7 +425,7 @@ def get_engine_pod_spec(self):

engine_volume_mounts = [socket_volume[2], shm_volume[2]]

if self._volumes and self._volumes is not None:
if self._volumes is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volumes could be empty dict {}. better keep this.

if self._mode not in ["eager", "lazy"]:
logger.error(
"Invalid mode: %s, mode must be one of eager or lazy", self._mode
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mode is not recognized, set the mode to the default ('eager').
also log it: logger.error('Invalid mode %s, choose from 'eager' or 'lazy'. Proceeding with default mode: 'eager')

# external vineyard deployment. The vineyard objects are not
# shared between the engine pods, so raise an error here.
if self._mode == "lazy" and self._vineyard_deployment is None:
raise ValueError("If the mode is lazy, the vineyard deployment must be set")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase the error message to: Lazy mode is only possible with a vineyard deployment, please add a vineyard deployment name by vineyard_deployment='vy-deploy'
Proceeding as eager mode.

And set the mode to eager, don't raise error, continuing with eager mode.

engine_pod_host_ip_list = getattr(self, f"_{engine_type}_pod_host_ip_list")

return (
engine_pod_name_list is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of these lists _{engine_type}_pod_name_list are [], which is not None. So by default it's always not None. Is this intended?


if not self.check_if_engine_exist(engine_type):
self._engine_pod_prefix = f"gs-{engine_type}-".replace("_", "-")
self._engine_cluster = self._build_engine_cluster(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I called deploy_engine multiple times, would I lose the reference to previous cluster? since you set the self._engine_cluster repeatedly.

def close_interactive_instance(self, object_id):
pod_name_list, _, _ = self._allocate_interactive_engine()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be multiple sets of interactive engines. I think current design doesn't suffice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This close interactive instance just got the last created instance.

if not (self._with_analytical or self._with_analytical_java):
def _allocate_analytical_engine(self):
# check the engine flag
if self._with_analytical and self._with_analytical_java:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't raise errors. technically, analytical is a subset of analytical-java, so just go ahead with analytical-java.

@@ -308,6 +328,55 @@ def test_demo_distribute(gs_session_distributed, data_dir, modern_graph_data_dir
# GNN engine


def test_demo_with_lazy_mode(gs_session_with_lazy_mode, data_dir, modern_graph_data_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this lazy focus on the automated pod creating and destroying, could you add the check for the pod? launch -> check the existence and number of specific pod. destroy(close) -> check the nonexistence of the pod.

return False
return True

def _deploy_vineyard_deployment_if_not_exist(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exists really got me.. Why check if exists after it exists?
After some study, I think vineyard_deployment_exists in this context is ambiguous. in self. vineyard_deployment_exists() you mean the value of self.vineyard_deployment is not None, in self._check_if_vineyard_deployment_exist(), you mean the deployment of vineyard pods is exists (This is more natural).

I think you could just use self._vineyard_deployment_name is not None, get rid of this method self.vineyard_deployment_exists()

# Set the engine pod info
setattr(self, f"_{engine_type}_pod_name_list", self._pod_name_list)
setattr(self, f"_{engine_type}_pod_ip_list", self._pod_ip_list)
setattr(self, f"_{engine_type}_pod_host_ip_list", self._pod_host_ip_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you use these 3 variables extensively, have you consider extract them to a high level data structure? like a class.

@siyuan0322
Copy link
Collaborator

siyuan0322 commented May 24, 2023

The main logic is much more clearer, just some minor issues.

@dashanji dashanji force-pushed the integrate-vineyardctl branch 9 times, most recently from 8dc663b to 25f7424 Compare May 31, 2023 11:04
@dashanji dashanji force-pushed the integrate-vineyardctl branch 2 times, most recently from 57cae3e to 224c9a5 Compare May 31, 2023 15:32
…engine instances.

* Deploy the vineyard deployment if not exist.
* Add a new option for KubernetesLauncher to deploy the engines in the eager mode or lazy mode.
* Split the create_engine_instances into two steps.
- Allocate the engine instances.
- Distribute the relevant process to the engine instances.
* Add a new pytest on kubernetes to test the lazy mode.
* Add a new label for engine pods to indicate the specific engine.
* Set the engine_selector of all engine kubernetes resources.
* Create different engine statefulset based on object_id for gae and gle.
* Delete all engine kubernetes resources based on engine_selector for gae and gie.
* Bump up the upload-artifact to v3

Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
@sighingnow sighingnow merged commit 699ffbe into alibaba:main Jun 2, 2023
43 checks passed
GraphScope Sprint April-May automation moved this from In progress to Done Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants