Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service with sklearn model fails on my EKS cluster #2371

Closed
amelki opened this issue Mar 24, 2022 · 17 comments
Closed

Service with sklearn model fails on my EKS cluster #2371

amelki opened this issue Mar 24, 2022 · 17 comments
Assignees

Comments

@amelki
Copy link

amelki commented Mar 24, 2022

I have created a simple service:

model_runner = bentoml.sklearn.load_runner("mymodel:latest")
svc = bentoml.Service("myservice", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    return model_runner.run(input_series)

When I run it on my laptop (MacBook Pro M1), using

bentoml serve ./service.py:svc --reload

everything works fine when I invoke the generated classify API.

Now when I push this service to my Yatai server as a bento and deploy it to my K8s cluster (EKS), I get the following error when I invoke the API:

image

Looking at the code, the problem lies in

return int(round(self.resource_quota.cpu))

In my case, _num_threads answers 0.
Digging a bit further, resource_quota.cpu is computed here:
cfs_quota_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_quota_us")
.
Here are the values I get on the pod running the API:

source value
file /sys/fs/cgroup/cpu/cpu.cfs_quota_us -1
file /sys/fs/cgroup/cpu/cpu.cfs_period_us 100000
file /sys/fs/cgroup/cpu/cpu.shares 2
call to os.cpu_count() 2

Given those values, query_cgroup_cpu_count() will return 0.001953125, which once rounded will end up as 0, meaning n_jobs will alway be 0. So the call will always fail on my pods.

@amelki amelki changed the title Prediction service with sklearn model fails on my EKS cluster Service with sklearn model fails on my EKS cluster Mar 24, 2022
@amelki
Copy link
Author

amelki commented Mar 24, 2022

Update: in my case, the problem seems to lie in cpu.shares. If it is lower than or equal to 512 and the quota is -1, then n_jobs will always be 0

@parano
Copy link
Member

parano commented Mar 25, 2022

Hi @amelki - did you set a resource limit/request for this this pod? if so, could you share the config?

@amelki
Copy link
Author

amelki commented Mar 25, 2022

Hi @parano , I didn't set any request/limit at first, no. But note that even if I set request/limit manuallly on a deployment, it would change cpu.shares to say, 512, but there is still a bug in BentoML and n_jobs will still be 0 - @aarnphm is aware of the issue and told me he is working at a fix:)
I just saw I can customize resources for all new pods here: https://github.com/bentoml/yatai-chart/blob/9dfea715a7297d4bcdd2cdc353d9b0a9c130af37/values.yaml#L77. Will give it a try, thanks !

@amelki
Copy link
Author

amelki commented Mar 28, 2022

@parano FYI, I tried several things:

1/ I added the following resources to the Deployment resource created in my yatai namespace:

resources:
  limits:
    cpu: 2000m
    memory: 2048Mi
  requests:
    cpu: 1000m
    memory: 1024Mi

Then, off course, the values of the cpu/quota/period files have a better shape:

source value
file /sys/fs/cgroup/cpu/cpu.cfs_quota_us 100000
file /sys/fs/cgroup/cpu/cpu.cfs_period_us 100000
file /sys/fs/cgroup/cpu/cpu.shares 512

but as I told you, since cpu.shares is still below 1024, the serve API still does not work, because n_jobs is still 0 => it just confirms the bug in the BentoML code.

2/ Added the resources in my values file so that they are applied at Yatai install time.

I was expecting to retrieve these resources in the generated Deployment resource, but it's not the case

3/ I also specified the resources at deployment time, within the Yatai console (more precisely, I left things as is). Interestingly, I don't see any resource in the generated Deployment resource either.
image

I can report issues 2/ and 3/ in a separate issue in the Yatai repo if you think it's more appropriate

@aarnphm
Copy link
Member

aarnphm commented Mar 28, 2022

#2372 just got merged, which should address this issue.

@amelki
Copy link
Author

amelki commented Mar 28, 2022

@aarnphm I have been able to properly test the fix with version 1.0.0a6.post13+gd77e009c.
Your fix is working since I don't see the n_jobs = 0 error anymore.
Unfortunately, I stumbled upon a new issue further in the stack:
image
Does this ring any bell on your side ?
As a reminder, my service works perfectly when I serve it on my laptop.

@amelki
Copy link
Author

amelki commented Mar 29, 2022

OK @aarnphm @parano I have some more information:

  1. the new error ('bool' object has no attribute 'get') does not occur at prediction time, but at pod startup time ! Seems to be a problem when initializing the runner

  2. I tried @aarnphm fix on top of v1.0.0-a6 (see https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372) and the good news is that my service does work now !

So it means that you've introduced some code in the main branch that breaks sklearn runners... Shall I open a new issue and close that one ?

@amelki amelki closed this as completed Mar 31, 2022
@amelki amelki reopened this Mar 31, 2022
@timliubentoml
Copy link
Collaborator

@amelki Are you sure you're using the right branch? #1 has been an error we've seen a couple times recently but have thought we dealt with it: #2369

Perhaps the branch that you're deploying to your pod is the latest release, and does not contain this fix which I don't think we've release yet. Would that make sense?

@timliubentoml
Copy link
Collaborator

Or have you walked through the steps to deploy your local fixed branch through yatai?

@amelki
Copy link
Author

amelki commented Mar 31, 2022

@timliubentoml thanks to get back to me. I'm 99% positive that I'm testing the correct version (main). I tried 3 times.
If I request the version on the pod I get: bentoml, version 1.0.0a6.post14+gc6a50e6b
Here is how I build my bento:

git clone https://github.com/bentoml/BentoML.git
python -m venv .bentoml-main
source .bentoml-main/bin/activate
pip install -e .
export BENTOML_BUNDLE_LOCAL_BUILD=True
export SETUPTOOLS_USE_DISTUTILS=stdlib
pip install -U setuptools
pip install sklearn
cd path/to/mybento
bentoml build
bentoml push mybento:myid

Here is the complete stack trace:

                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 624, in lifespan               
                               async with self.lifespan_context(app):           
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 521, in __aenter__             
                               await self._router.startup()                     
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 603, in startup                
                               handler()                                        
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/runner/local.py", line 16, in setup      
                               self._runner._setup()  # type:                   
                           ignore[reportPrivateUsage]                           
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 170, in     
                           _setup                                               
                               self._model = load(self._tag,                    
                           model_store=self.model_store)                        
                             File "/opt/conda/lib/python3.9/site-packages/simple
                           _di/__init__.py", line 139, in _                     
                               return func(*_inject_args(bind.args),            
                           **_inject_kwargs(bind.kwargs))                       
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 68, in load 
                               model = model_store.get(tag)                     
                           AttributeError: 'bool' object has no attribute 'get' 

If I build a bento using https://github.com/bentoml/BentoML/releases/tag/v1.0.0-a6 or https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372, I don't have the problem, my pods are starting correctly.

So I would say there might be a regression in one of these commits: v1.0.0-a6...main

@amelki
Copy link
Author

amelki commented Apr 1, 2022

@timliubentoml I found the commit that is causing the issue : f30d529. I tested the commit just before (e403eee9a9d436e92ce52dc49986cf30e9ea43dc), and startup is OK.
Starting from this commit (f30d5290e8efb0e242727e47640e7619b13607c7), startup is KO.

@timliubentoml
Copy link
Collaborator

Oh, awesome, was about to respond. @larme I think we've identified the commit which is breaking this issue. Could you take a look at a fix?

@amelki
Copy link
Author

amelki commented Apr 1, 2022

Also, not sure if it's related, but I find this line suspicious:

model_store: "ModelStore" = Provide[BentoMLContainer.model_store],
.
model_store does not seem to be used... shouldn't it be passed to bentoml.models.create ?

@timliubentoml
Copy link
Collaborator

One of our developers thinks we've identified the issue. Please standby for commit and release. Will get back to you with an eta.

Thanks for the help in identifying this issue!!!

@timliubentoml timliubentoml self-assigned this Apr 1, 2022
@timliubentoml
Copy link
Collaborator

Hi @amelki! We just issued the a7 release to pipy last night. Could you try upgrading to the latest release? It should have fixed this issue

@amelki
Copy link
Author

amelki commented Apr 7, 2022

@timliubentoml @parano I could finally test my model on BentoML 1.0.0a7 with Yatai 0.2.1 on my EKS cluster and it is working just fine !
Many thanks to you and the team !

@amelki amelki closed this as completed Apr 7, 2022
@aarnphm
Copy link
Member

aarnphm commented Apr 7, 2022

great to hear :) Let me know if you run into any other trouble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants