New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service with sklearn model fails on my EKS cluster #2371
Comments
Update: in my case, the problem seems to lie in |
Hi @amelki - did you set a resource limit/request for this this pod? if so, could you share the config? |
Hi @parano , I didn't set any request/limit at first, no. But note that even if I set request/limit manuallly on a deployment, it would change |
@parano FYI, I tried several things: 1/ I added the following resources to the Deployment resource created in my resources:
limits:
cpu: 2000m
memory: 2048Mi
requests:
cpu: 1000m
memory: 1024Mi Then, off course, the values of the cpu/quota/period files have a better shape:
but as I told you, since 2/ Added the resources in my values file so that they are applied at Yatai install time. I was expecting to retrieve these resources in the generated Deployment resource, but it's not the case 3/ I also specified the resources at deployment time, within the Yatai console (more precisely, I left things as is). Interestingly, I don't see any resource in the generated Deployment resource either. I can report issues 2/ and 3/ in a separate issue in the Yatai repo if you think it's more appropriate |
#2372 just got merged, which should address this issue. |
@aarnphm I have been able to properly test the fix with version |
OK @aarnphm @parano I have some more information:
So it means that you've introduced some code in the main branch that breaks sklearn runners... Shall I open a new issue and close that one ? |
@amelki Are you sure you're using the right branch? #1 has been an error we've seen a couple times recently but have thought we dealt with it: #2369 Perhaps the branch that you're deploying to your pod is the latest release, and does not contain this fix which I don't think we've release yet. Would that make sense? |
Or have you walked through the steps to deploy your local fixed branch through yatai? |
@timliubentoml thanks to get back to me. I'm 99% positive that I'm testing the correct version (main). I tried 3 times. git clone https://github.com/bentoml/BentoML.git
python -m venv .bentoml-main
source .bentoml-main/bin/activate
pip install -e .
export BENTOML_BUNDLE_LOCAL_BUILD=True
export SETUPTOOLS_USE_DISTUTILS=stdlib
pip install -U setuptools
pip install sklearn
cd path/to/mybento
bentoml build
bentoml push mybento:myid Here is the complete stack trace:
If I build a bento using https://github.com/bentoml/BentoML/releases/tag/v1.0.0-a6 or https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372, I don't have the problem, my pods are starting correctly. So I would say there might be a regression in one of these commits: v1.0.0-a6...main |
@timliubentoml I found the commit that is causing the issue : f30d529. I tested the commit just before ( |
Oh, awesome, was about to respond. @larme I think we've identified the commit which is breaking this issue. Could you take a look at a fix? |
Also, not sure if it's related, but I find this line suspicious:
model_store does not seem to be used... shouldn't it be passed to bentoml.models.create ?
|
One of our developers thinks we've identified the issue. Please standby for commit and release. Will get back to you with an eta. Thanks for the help in identifying this issue!!! |
Hi @amelki! We just issued the a7 release to pipy last night. Could you try upgrading to the latest release? It should have fixed this issue |
@timliubentoml @parano I could finally test my model on BentoML 1.0.0a7 with Yatai 0.2.1 on my EKS cluster and it is working just fine ! |
great to hear :) Let me know if you run into any other trouble |
I have created a simple service:
When I run it on my laptop (MacBook Pro M1), using
everything works fine when I invoke the generated
classify
API.Now when I push this service to my Yatai server as a bento and deploy it to my K8s cluster (EKS), I get the following error when I invoke the API:
Looking at the code, the problem lies in
BentoML/bentoml/_internal/frameworks/sklearn.py
Line 163 in 119b103
In my case,
_num_threads
answers 0.Digging a bit further,
resource_quota.cpu
is computed here:BentoML/bentoml/_internal/runner/utils.py
Line 208 in 119b103
Here are the values I get on the pod running the API:
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
/sys/fs/cgroup/cpu/cpu.cfs_period_us
/sys/fs/cgroup/cpu/cpu.shares
os.cpu_count()
Given those values,
query_cgroup_cpu_count()
will return0.001953125
, which once rounded will end up as 0, meaningn_jobs
will alway be 0. So the call will always fail on my pods.The text was updated successfully, but these errors were encountered: