Service with sklearn model fails on my EKS cluster #2371

amelki · 2022-03-24T15:09:20Z

I have created a simple service:

model_runner = bentoml.sklearn.load_runner("mymodel:latest")
svc = bentoml.Service("myservice", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    return model_runner.run(input_series)

When I run it on my laptop (MacBook Pro M1), using

bentoml serve ./service.py:svc --reload

everything works fine when I invoke the generated classify API.

Now when I push this service to my Yatai server as a bento and deploy it to my K8s cluster (EKS), I get the following error when I invoke the API:

Looking at the code, the problem lies in

BentoML/bentoml/_internal/frameworks/sklearn.py

Line 163 in 119b103

return int(round(self.resource_quota.cpu))

In my case, _num_threads answers 0.
Digging a bit further, resource_quota.cpu is computed here:

BentoML/bentoml/_internal/runner/utils.py

Line 208 in 119b103

cfs_quota_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_quota_us")

.
Here are the values I get on the pod running the API:

source	value
file `/sys/fs/cgroup/cpu/cpu.cfs_quota_us`	-1
file `/sys/fs/cgroup/cpu/cpu.cfs_period_us`	100000
file `/sys/fs/cgroup/cpu/cpu.shares`	2
call to `os.cpu_count()`	2

Given those values, query_cgroup_cpu_count() will return 0.001953125, which once rounded will end up as 0, meaning n_jobs will alway be 0. So the call will always fail on my pods.

The text was updated successfully, but these errors were encountered:

amelki · 2022-03-24T15:55:17Z

Update: in my case, the problem seems to lie in cpu.shares. If it is lower than or equal to 512 and the quota is -1, then n_jobs will always be 0

parano · 2022-03-25T15:06:20Z

Hi @amelki - did you set a resource limit/request for this this pod? if so, could you share the config?

amelki · 2022-03-25T22:16:09Z

Hi @parano , I didn't set any request/limit at first, no. But note that even if I set request/limit manuallly on a deployment, it would change cpu.shares to say, 512, but there is still a bug in BentoML and n_jobs will still be 0 - @aarnphm is aware of the issue and told me he is working at a fix:)
I just saw I can customize resources for all new pods here: https://github.com/bentoml/yatai-chart/blob/9dfea715a7297d4bcdd2cdc353d9b0a9c130af37/values.yaml#L77. Will give it a try, thanks !

amelki · 2022-03-28T14:13:44Z

@parano FYI, I tried several things:

1/ I added the following resources to the Deployment resource created in my yatai namespace:

resources:
  limits:
    cpu: 2000m
    memory: 2048Mi
  requests:
    cpu: 1000m
    memory: 1024Mi

Then, off course, the values of the cpu/quota/period files have a better shape:

source	value
file `/sys/fs/cgroup/cpu/cpu.cfs_quota_us`	100000
file `/sys/fs/cgroup/cpu/cpu.cfs_period_us`	100000
file `/sys/fs/cgroup/cpu/cpu.shares`	512

but as I told you, since cpu.shares is still below 1024, the serve API still does not work, because n_jobs is still 0 => it just confirms the bug in the BentoML code.

2/ Added the resources in my values file so that they are applied at Yatai install time.

I was expecting to retrieve these resources in the generated Deployment resource, but it's not the case

3/ I also specified the resources at deployment time, within the Yatai console (more precisely, I left things as is). Interestingly, I don't see any resource in the generated Deployment resource either.

I can report issues 2/ and 3/ in a separate issue in the Yatai repo if you think it's more appropriate

aarnphm · 2022-03-28T14:43:37Z

#2372 just got merged, which should address this issue.

amelki · 2022-03-28T23:41:22Z

@aarnphm I have been able to properly test the fix with version 1.0.0a6.post13+gd77e009c.
Your fix is working since I don't see the n_jobs = 0 error anymore.
Unfortunately, I stumbled upon a new issue further in the stack:

Does this ring any bell on your side ?
As a reminder, my service works perfectly when I serve it on my laptop.

amelki · 2022-03-29T13:06:15Z

OK @aarnphm @parano I have some more information:

the new error ('bool' object has no attribute 'get') does not occur at prediction time, but at pod startup time ! Seems to be a problem when initializing the runner
I tried @aarnphm fix on top of v1.0.0-a6 (see https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372) and the good news is that my service does work now !

So it means that you've introduced some code in the main branch that breaks sklearn runners... Shall I open a new issue and close that one ?

timliubentoml · 2022-03-31T19:03:13Z

@amelki Are you sure you're using the right branch? #1 has been an error we've seen a couple times recently but have thought we dealt with it: #2369

Perhaps the branch that you're deploying to your pod is the latest release, and does not contain this fix which I don't think we've release yet. Would that make sense?

timliubentoml · 2022-03-31T19:05:45Z

Or have you walked through the steps to deploy your local fixed branch through yatai?

amelki · 2022-03-31T22:25:05Z

@timliubentoml thanks to get back to me. I'm 99% positive that I'm testing the correct version (main). I tried 3 times.
If I request the version on the pod I get: bentoml, version 1.0.0a6.post14+gc6a50e6b
Here is how I build my bento:

git clone https://github.com/bentoml/BentoML.git
python -m venv .bentoml-main
source .bentoml-main/bin/activate
pip install -e .
export BENTOML_BUNDLE_LOCAL_BUILD=True
export SETUPTOOLS_USE_DISTUTILS=stdlib
pip install -U setuptools
pip install sklearn
cd path/to/mybento
bentoml build
bentoml push mybento:myid

Here is the complete stack trace:

                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 624, in lifespan               
                               async with self.lifespan_context(app):           
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 521, in __aenter__             
                               await self._router.startup()                     
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 603, in startup                
                               handler()                                        
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/runner/local.py", line 16, in setup      
                               self._runner._setup()  # type:                   
                           ignore[reportPrivateUsage]                           
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 170, in     
                           _setup                                               
                               self._model = load(self._tag,                    
                           model_store=self.model_store)                        
                             File "/opt/conda/lib/python3.9/site-packages/simple
                           _di/__init__.py", line 139, in _                     
                               return func(*_inject_args(bind.args),            
                           **_inject_kwargs(bind.kwargs))                       
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 68, in load 
                               model = model_store.get(tag)                     
                           AttributeError: 'bool' object has no attribute 'get'

If I build a bento using https://github.com/bentoml/BentoML/releases/tag/v1.0.0-a6 or https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372, I don't have the problem, my pods are starting correctly.

So I would say there might be a regression in one of these commits: v1.0.0-a6...main

amelki · 2022-04-01T00:03:25Z

@timliubentoml I found the commit that is causing the issue : f30d529. I tested the commit just before (e403eee9a9d436e92ce52dc49986cf30e9ea43dc), and startup is OK.
Starting from this commit (f30d5290e8efb0e242727e47640e7619b13607c7), startup is KO.

timliubentoml · 2022-04-01T00:04:01Z

Oh, awesome, was about to respond. @larme I think we've identified the commit which is breaking this issue. Could you take a look at a fix?

amelki · 2022-04-01T00:13:31Z

Also, not sure if it's related, but I find this line suspicious:

BentoML/bentoml/_internal/frameworks/sklearn.py

Line 86 in d77e009

model_store: "ModelStore" = Provide[BentoMLContainer.model_store],

.
model_store does not seem to be used... shouldn't it be passed to bentoml.models.create ?

timliubentoml · 2022-04-01T20:13:33Z

One of our developers thinks we've identified the issue. Please standby for commit and release. Will get back to you with an eta.

Thanks for the help in identifying this issue!!!

timliubentoml · 2022-04-06T14:53:54Z

Hi @amelki! We just issued the a7 release to pipy last night. Could you try upgrading to the latest release? It should have fixed this issue

amelki · 2022-04-07T19:03:46Z

@timliubentoml @parano I could finally test my model on BentoML 1.0.0a7 with Yatai 0.2.1 on my EKS cluster and it is working just fine !
Many thanks to you and the team !

aarnphm · 2022-04-07T19:06:18Z

great to hear :) Let me know if you run into any other trouble

amelki changed the title ~~Prediction service with sklearn model fails on my EKS cluster~~ Service with sklearn model fails on my EKS cluster Mar 24, 2022

aarnphm mentioned this issue Mar 24, 2022

fix: cgroups for cpu should be 1 when <= 0 #2372

Merged

5 tasks

amelki mentioned this issue Mar 28, 2022

How to install Yatai with the latest BentoML version ? bentoml/Yatai#206

Closed

amelki closed this as completed Mar 31, 2022

amelki reopened this Mar 31, 2022

timliubentoml self-assigned this Apr 1, 2022

amelki mentioned this issue Apr 6, 2022

Cannot deploy with Yatai 0.2.1 bentoml/Yatai#210

Closed

amelki closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service with sklearn model fails on my EKS cluster #2371

Service with sklearn model fails on my EKS cluster #2371

amelki commented Mar 24, 2022 •

edited

amelki commented Mar 24, 2022 •

edited

parano commented Mar 25, 2022

amelki commented Mar 25, 2022 •

edited

amelki commented Mar 28, 2022 •

edited

aarnphm commented Mar 28, 2022

amelki commented Mar 28, 2022 •

edited

amelki commented Mar 29, 2022 •

edited

timliubentoml commented Mar 31, 2022

timliubentoml commented Mar 31, 2022

amelki commented Mar 31, 2022 •

edited

amelki commented Apr 1, 2022

timliubentoml commented Apr 1, 2022

amelki commented Apr 1, 2022 •

edited

timliubentoml commented Apr 1, 2022

timliubentoml commented Apr 6, 2022

amelki commented Apr 7, 2022

aarnphm commented Apr 7, 2022

Service with sklearn model fails on my EKS cluster #2371

Service with sklearn model fails on my EKS cluster #2371

Comments

amelki commented Mar 24, 2022 • edited

amelki commented Mar 24, 2022 • edited

parano commented Mar 25, 2022

amelki commented Mar 25, 2022 • edited

amelki commented Mar 28, 2022 • edited

aarnphm commented Mar 28, 2022

amelki commented Mar 28, 2022 • edited

amelki commented Mar 29, 2022 • edited

timliubentoml commented Mar 31, 2022

timliubentoml commented Mar 31, 2022

amelki commented Mar 31, 2022 • edited

amelki commented Apr 1, 2022

timliubentoml commented Apr 1, 2022

amelki commented Apr 1, 2022 • edited

timliubentoml commented Apr 1, 2022

timliubentoml commented Apr 6, 2022

amelki commented Apr 7, 2022

aarnphm commented Apr 7, 2022

amelki commented Mar 24, 2022 •

edited

amelki commented Mar 24, 2022 •

edited

amelki commented Mar 25, 2022 •

edited

amelki commented Mar 28, 2022 •

edited

amelki commented Mar 28, 2022 •

edited

amelki commented Mar 29, 2022 •

edited

amelki commented Mar 31, 2022 •

edited

amelki commented Apr 1, 2022 •

edited