New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s]Support deploying vineyard cluster independently and deploy the engines on called #2710
Conversation
Example: Create the vineyard cluster is not exist and the vineyard cluster and gae engine will be deployed sess = graphscope.session(k8s_vineyard_deployment='vineyardd-sample', mode='lazy') $ kubectl get po
NAME READY STATUS RESTARTS AGE
coordinator-dtpouc-6bcb65654f-fvmwt 1/1 Running 0 23s
gs-analytical-dtpouc-0 1/1 Running 0 12s
vineyardd-sample-64dccb4597-jhgvn 1/1 Running 0 16s
vineyardd-sample-etcd-0 1/1 Running 0 20s load graph g = load_ogbn_mag(sess,"/testingdata/ogbn_mag_small") While calling gremlin, the gie engine will be deployed on the vineyard nodes. interactive = sess.gremlin(g) $ kubectl get po
NAME READY STATUS RESTARTS AGE
coordinator-dtpouc-6bcb65654f-fvmwt 1/1 Running 0 3m40s
gs-analytical-dtpouc-0 1/1 Running 0 3m29s
gs-interactive-dtpouc-0 1/1 Running 0 22s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k 1/1 Running 0 22s
vineyardd-sample-64dccb4597-jhgvn 1/1 Running 0 3m33s
vineyardd-sample-etcd-0 1/1 Running 0 3m37 While calling graphlearn, the gle engine is also deployed on the vineyard nodes. lg = sess.graphlearn(g) $ kubectl get po
NAME READY STATUS RESTARTS AGE
coordinator-dtpouc-6bcb65654f-fvmwt 1/1 Running 0 4m6s
gs-analytical-dtpouc-0 1/1 Running 0 3m55s
gs-interactive-dtpouc-0 1/1 Running 0 48s
gs-interactive-frontend-dtpouc-7c45778f7c-pxd7k 1/1 Running 0 48s
gs-learning-dtpouc-0 1/1 Running 0 4s
vineyardd-sample-64dccb4597-jhgvn 1/1 Running 0 3m59s
vineyardd-sample-etcd-0 1/1 Running 0 4m3s |
11a3c23
to
23a327e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments about with_vineyard
.
Basically LGTM 👍
@@ -54,6 +54,7 @@ def __init__( | |||
engine_cpu, | |||
engine_mem, | |||
engine_pod_node_selector, | |||
engine_prefix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine_prefix -> engine_pod_prefix
python/graphscope/client/session.py
Outdated
self._config_params, | ||
) | ||
|
||
self._print_session_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_print_session_info -> _log_session_info
python/graphscope/client/session.py
Outdated
f"Not a valid engine name: {item}, valid engines are {valid_engines}" | ||
) | ||
if item == "vineyard": | ||
self._with_vineyard = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add something like enable_vineyard_scheduler=True/False
rather than initialize a variable _with_vineyard=False
and match the name "vineyard"
in enabled engines ...
When you see with_vineyard=False
you would wondering how graphscope works without vineyard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Vineyard container is always included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to something like create_vineyard_cluster_if_not_exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, it's better to split the create vineyard cluster part and launch engine lazily part to two different PRs. As they are orthogonal features.
|
||
instance_id=${coordinator_name#*-} | ||
|
||
pod_ips=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -o jsonpath='{.items[*].status.podIP}') | ||
pod_names=$(kubectl get pod -lapp.kubernetes.io/component=engine,app.kubernetes.io/instance=${instance_id} -oname | grep ${filter_name} | awk -F '/' '{print $2}'| xargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this line get rid of awk, using only jsonpath?
python/graphscope/client/session.py
Outdated
f"Not a valid engine name: {item}, valid engines are {valid_engines}" | ||
) | ||
if item == "vineyard": | ||
self._with_vineyard = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to something like create_vineyard_cluster_if_not_exists
Thanks for the advice, I will add a flag like |
aa9bac8
to
93196da
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #2710 +/- ##
==========================================
- Coverage 72.96% 67.02% -5.95%
==========================================
Files 99 99
Lines 10381 10451 +70
==========================================
- Hits 7575 7005 -570
- Misses 2806 3446 +640
... and 11 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
84e0afe
to
22013b1
Compare
@@ -402,7 +425,7 @@ def get_engine_pod_spec(self): | |||
|
|||
engine_volume_mounts = [socket_volume[2], shm_volume[2]] | |||
|
|||
if self._volumes and self._volumes is not None: | |||
if self._volumes is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volumes could be empty dict {}
. better keep this.
if self._mode not in ["eager", "lazy"]: | ||
logger.error( | ||
"Invalid mode: %s, mode must be one of eager or lazy", self._mode | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mode is not recognized, set the mode to the default ('eager').
also log it: logger.error('Invalid mode %s, choose from 'eager' or 'lazy'. Proceeding with default mode: 'eager')
# external vineyard deployment. The vineyard objects are not | ||
# shared between the engine pods, so raise an error here. | ||
if self._mode == "lazy" and self._vineyard_deployment is None: | ||
raise ValueError("If the mode is lazy, the vineyard deployment must be set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrase the error message to: Lazy mode is only possible with a vineyard deployment, please add a vineyard deployment name by vineyard_deployment='vy-deploy'
Proceeding as eager mode.
And set the mode to eager, don't raise error, continuing with eager mode.
engine_pod_host_ip_list = getattr(self, f"_{engine_type}_pod_host_ip_list") | ||
|
||
return ( | ||
engine_pod_name_list is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of these lists _{engine_type}_pod_name_list
are []
, which is not None. So by default it's always not None. Is this intended?
|
||
if not self.check_if_engine_exist(engine_type): | ||
self._engine_pod_prefix = f"gs-{engine_type}-".replace("_", "-") | ||
self._engine_cluster = self._build_engine_cluster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I called deploy_engine multiple times, would I lose the reference to previous cluster? since you set the self._engine_cluster
repeatedly.
def close_interactive_instance(self, object_id): | ||
pod_name_list, _, _ = self._allocate_interactive_engine() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be multiple sets of interactive engines. I think current design doesn't suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This close interactive instance just got the last created instance.
if not (self._with_analytical or self._with_analytical_java): | ||
def _allocate_analytical_engine(self): | ||
# check the engine flag | ||
if self._with_analytical and self._with_analytical_java: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't raise errors. technically, analytical is a subset of analytical-java, so just go ahead with analytical-java.
@@ -308,6 +328,55 @@ def test_demo_distribute(gs_session_distributed, data_dir, modern_graph_data_dir | |||
# GNN engine | |||
|
|||
|
|||
def test_demo_with_lazy_mode(gs_session_with_lazy_mode, data_dir, modern_graph_data_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this lazy focus on the automated pod creating and destroying, could you add the check for the pod? launch -> check the existence and number of specific pod. destroy(close) -> check the nonexistence of the pod.
return False | ||
return True | ||
|
||
def _deploy_vineyard_deployment_if_not_exist(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exists
really got me.. Why check if exists after it exists?
After some study, I think vineyard_deployment_exists
in this context is ambiguous. in self. vineyard_deployment_exists()
you mean the value of self.vineyard_deployment
is not None, in self._check_if_vineyard_deployment_exist()
, you mean the deployment of vineyard pods is exists (This is more natural).
I think you could just use self._vineyard_deployment_name
is not None, get rid of this method self.vineyard_deployment_exists()
# Set the engine pod info | ||
setattr(self, f"_{engine_type}_pod_name_list", self._pod_name_list) | ||
setattr(self, f"_{engine_type}_pod_ip_list", self._pod_ip_list) | ||
setattr(self, f"_{engine_type}_pod_host_ip_list", self._pod_host_ip_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you use these 3 variables extensively, have you consider extract them to a high level data structure? like a class.
The main logic is much more clearer, just some minor issues. |
8dc663b
to
25f7424
Compare
57cae3e
to
224c9a5
Compare
…engine instances. * Deploy the vineyard deployment if not exist. * Add a new option for KubernetesLauncher to deploy the engines in the eager mode or lazy mode. * Split the create_engine_instances into two steps. - Allocate the engine instances. - Distribute the relevant process to the engine instances. * Add a new pytest on kubernetes to test the lazy mode. * Add a new label for engine pods to indicate the specific engine. * Set the engine_selector of all engine kubernetes resources. * Create different engine statefulset based on object_id for gae and gle. * Delete all engine kubernetes resources based on engine_selector for gae and gie. * Bump up the upload-artifact to v3 Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
224c9a5
to
13dda85
Compare
What do these changes do?
🤖 Generated by Copilot at 84e0afe
This pull request adds a new feature to graphscope that allows deploying engines on demand in a lazy mode, and refactors some existing code to improve the quality and consistency. It modifies the
EngineCluster
,KubernetesClusterLauncher
, andSession
classes, and adds a new test case and fixture totest_demo_script.py
.Related issue number
Fixes some parts of #2539