Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

Starred interaction on config and rules changes #4716

Closed
mstormi opened this issue Dec 11, 2017 · 25 comments · Fixed by #4785
Closed

Starred interaction on config and rules changes #4716

mstormi opened this issue Dec 11, 2017 · 25 comments · Fixed by #4785
Labels

Comments

@mstormi
Copy link

mstormi commented Dec 11, 2017

Whenever I change an items or rules file, to process this update in OH can take up to 2 or 3 minutes (for a 1k lines rules file on a Pi2). java is runnning at 100% CPU (i.e. 1 core@100%, but 3 idle cores).
Meanwhile, the interaction via REST or Web UI is apparently HALTED until the current file is processed.
As requested by @kaikreuzer, here's a large rules file to help you reproduce it, but it happens on any rule und items file (re)load: it takes the API and UI to react exactly as long as it takes to process the updated file, UI comes back to life as soon as the 'refreshing xxx.rules' log entry shows up.

reloadtest.TXT

@htreu
Copy link
Contributor

htreu commented Dec 12, 2017

Hi @mstormi did you already have a discussion about this in the community forum? Please provide the link.
A quick search did bring up some information about the older RPi2 kernels not being compiled with "Symmetric multi-processing support". I guess we should start the diagnosis on OS level and work our way through to the JVM and on to ESH.
wdyt?

@kaikreuzer
Copy link
Contributor

Just for clarification: Is this a RPi specific behavior or do we see sluggish rule parsing also on x86 PCs (then maybe not 3 minutes, but 10 seconds, but still "too slow and blocking")?

@mstormi
Copy link
Author

mstormi commented Dec 12, 2017

Well I've been having this problem for a long time, on RPi2 but also on my today's RPi3 based on openHABian, on various kernel, JVM and OH versions. My current setup provides multi-processing for sure.
No, no community discussion thus no link, sorry.
Haven't run OH on a PC yet, so can't tell. @lolodomo can you ? You mentioned you saw that symptom, too.

I agree it's the right approach to 'work up the stack' but we quickly arrive at the code layer.
It's not RPi or HW specific, and I don't think it is JVM related either. We shouldn't have this behavior even if we did have a single CPU-core only because we have multithreading, don't we ?
And parsing and API/UI should not be done one after the other by the same thread, right?
So even with a large rule and a single slow CPU core, it would possibly seem sluggish but still the OS would keep switching between threads and these tasks would run in parallel and not block each other.

Let's make it clear again: While the system behavior from a user point of view is 'sluggish', this is not because of lack of processing ressources to be available at that moment in time. It is because the system processes updated rules and does NOT process API input while doing the former.
Rule parsing ok, it's slow, probably the parser is inefficient, and if it was fast (or if you run on a fast machine) we probably wouldn't notice the other issue. But that's not the main point.
That main point is the other issue that while OH is parsing a rule (BTW: or items file), API (+ UI) are blocked/starred.
So there's a locking issue somewhere in the code.

@lolodomo
Copy link
Contributor

My feeling is that all this is unfortunately right.
I think it can be reproduced in Eclipse even on a fast PC by updating the demo rules and immediately try to load a new page in Basic UI. Normally you should see the UI blocked few seconds (not minutes).

@kaikreuzer
Copy link
Contributor

as long as it takes to process the updated file, UI comes back to life as soon as the 'refreshing xxx.rules' log entry shows up.

Using your example file in my openHAB IDE:

2017-12-12 12:21:11.306 [DEBUG] [.s.m.c.i.folder.FolderObserver:231  ] - Going to validate: demo.rules, ScriptServiceUtil: {org.eclipse.smarthome.model.script.ScriptServiceUtil}={service.id=345, service.bundleid=23, service.scope=bundle, component.name=org.eclipse.smarthome.model.script.ScriptServiceUtil, component.id=5}, ModelParsers: [things, rules, persist, items, sitemap, script]
2017-12-12 12:21:11.759 [INFO ] [.e.s.m.c.i.ModelRepositoryImpl:137  ] - Refreshing model 'demo.rules'

So this file takes 400ms to parse on my Mac. This does not seem to be any exceptionally long time, does it?
Can you enable debug logging on org.eclipse.smarthome.model and provide a similar log?

@mstormi
Copy link
Author

mstormi commented Dec 12, 2017

Sure, but what would that tell you ? I feel you're still missing the point. If you had used the UI within these 400ms, you would have noticed it was blocked. Now on a slow machine with a large rule, these 400ms could become 40 seconds or more, and that's an annoying lag then.
EDIT: no output on my box during this time even with ESH debug enabled and even though I accessed the UI.

@kaikreuzer
Copy link
Contributor

My point is that parsing should never take longer than a very few seconds. And if this is the case, it does not matter if access to items and sitemaps is blocked for that moment. If you had to restart the runtime after a configuration file change, the situation would be worse...

I'd rather want to fight the root cause and not the symptoms.

@mstormi
Copy link
Author

mstormi commented Dec 12, 2017

me too. But while yes, parsing should be faster, that's just an 'accelerator' but it isn't the root.
I consider the locking to be the root cause.

rules and items files still can be large and machines slow, you can never know how long that'll take in the end if it's not your ruleset running on your Mac.

@mhilbush
Copy link
Contributor

Here's a data point from a rule that has 1395 lines. Looks like it took almost 12 seconds. This is on a 1.6 GHz 4-core Intel Celeron with 8 GB 1600 MHz RAM.

The file contains a bunch of lambdas and rules.

2017-12-12 07:32:38.649 [DEBUG] [.core.internal.folder.FolderObserver] - Going to validate: russound.rules, ScriptServiceUtil: {org.eclipse.smarthome.model.script.ScriptServiceUtil}={component.name=org.eclipse.smarthome.model.script.ScriptServiceUtil, component.id=144, service.id=253, service.bundleid=149, service.scope=bundle}, ModelParsers: [things, rules, persist, items, script, sitemap]
2017-12-12 07:32:50.269 [INFO ] [el.core.internal.ModelRepositoryImpl] - Refreshing model 'russound.rules'

I will add that I have not observed the behavior with the UI blocking as described above. I will try to test for that while the rule file is being processed. Hopefully my fingers will have the dexterity needed to do this. :-)

@mhilbush
Copy link
Contributor

I can confirm that Basic UI blocks during the 12 seconds that my rule is loading. By blocks, I mean that if I click on an object, such as to go to another frame, the new frame doesn't load until the rule has finished processing.

@kaikreuzer kaikreuzer added the DSL label Dec 12, 2017
@kaikreuzer
Copy link
Contributor

Ok then, let's define this issue to look at the blocking then.
In the code, I see that accessed to Xtext models (i.e. item, sitemap, rule, persistence, etc. files) are synchronized over the Xtext ResourceSet - this is very likely the cause of this observed behavior. Note that such accesses only seem to happen during "regular" runtime in the SitemapProvider, so my assumption would be that requests to /rest/sitemaps are blocked, while other rest endpoints still work (i.e. there should be no issue to load the dashboard, use HABPanel or the Paper UI, etc.) The problem should only appear for Basic/Classic UI and the Android+iOS app, can you confirm this?

As a solution, the sitemap provider should probably best keep an internal copy of the models for itself. I would not dare to relax the synchronization on the resource set as I remember that there were many issues in that spot when doing concurrent accesses.

@mhilbush
Copy link
Contributor

there should be no issue to load the dashboard, use HABPanel or the Paper UI

I see no issue with dashboard, HABpanel, and PaperUI. I'll check the Android app shortly.

@mhilbush
Copy link
Contributor

Confirming that the Android app blocks.

@kaikreuzer
Copy link
Contributor

Thanks for checking, good to hear that my analysis seems to be correct then.

@mstormi
Copy link
Author

mstormi commented Dec 17, 2017

FWIW, I've just encountered a state where Basic UI was blocked again while java was at 100% doing whatever (I did NOT change any DSL files this time, so it must have been doing something different but I don't know what that was) .

@mstormi
Copy link
Author

mstormi commented Jan 15, 2018

Is this supposed to fix the 'UI hangs on processing rules' ?
I've just upgraded to snapshot #1186 (which is including this one, right ?) but it doesn't work any better than before. UI still is blocking until the current rules file is done processing.

@kaikreuzer
Copy link
Contributor

Yes, #4785 is included in distro 1186.
@htreu Could you check why it isn't solving the issue?

@htreu
Copy link
Contributor

htreu commented Jan 16, 2018

When switching frames in Basic UI there are other interactions with the ModelRepository: during frame rendering the category for a widget is calculated. In case the icon is not set directly in the widget (i.e. in the sitemap) the attached item is queried to provide an item. This will access the item model from the GenericItemUIProvider which in turn is loaded from the ModelRepository.
I will propose a fix which also caches item models in the GenericItemUIProvider.

@htreu
Copy link
Contributor

htreu commented Jan 16, 2018

The fix just gets the item from the also present ItemRegistry and removes the ModelRepository from the GenricItemUIProvider. This fixes it for me.

@htreu
Copy link
Contributor

htreu commented Jan 16, 2018

With the next ESH stable this should be fixed. @mstormi & @mhilbush, please report further issues here, thanks!

@mstormi
Copy link
Author

mstormi commented Jan 23, 2018

Just installed OH #1194 (to include yesterday's ESH update) and yes, looks like it is fixed now.
Thanks!

@mhilbush
Copy link
Contributor

1194 looks good from here. Yay!

@mstormi
Copy link
Author

mstormi commented Feb 4, 2018

Now while this one seems to be fixed, I've encountered a similar thing on rule execution:
Execution gets delayed, too, until any ongoing rule change got processed.
Worse even, seems if I do a second and third change to a rule file while the first change is still being processed, processing of changes seems to queue up.
Can anyone confirm either of these behaviors ?
Now with a lot of people using slow machines (RPi) and doing edits of files in-place, this turns into a serious problem as OH feels as 'laggy' as in this original issue.

@htreu
Copy link
Contributor

htreu commented Feb 5, 2018

@mstormi, thanks for the ongoing reports. From the code I can see that org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob#execute indeed does need the modelRepository (which is right now busy reloading the rules models) also tries to load the rule model for the current execution.
I didn't analyse this further yet, but this seems to be the place where synchronisation needs to be worked around somehow.

@htreu
Copy link
Contributor

htreu commented Mar 6, 2018

I created #5186 to further track this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants