Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

Introduce a "ESH start level" functionality #1896

Open
tomhoefer opened this issue Jul 22, 2016 · 32 comments
Open

Introduce a "ESH start level" functionality #1896

tomhoefer opened this issue Jul 22, 2016 · 32 comments

Comments

@tomhoefer
Copy link
Contributor

Hi all,

in our project we have a lot of event subscribers and registry change listeners implemented which are called during startup / shutdown of ESH as a matter of course. In shutdown phase these services will update their model accordingly which results in the problem that the model cannot be re-built after next startup (because the subscribers / listeners have been assumed that the item / thing / link has been really deleted). We should distinguish between event sending / listener notification for adding / removal of items / things / links during framework startup / shutdown phase and normal runtime.

For this reason I would like to disable in our project that events are sent / listeners are notified during startup / shutdown. I could imagine two ways to implement this:

  1. Providing a RuntimeStateService that can be requested in order to get the information if the runtime is started. So for the beginning the service would only consist of a single operation boolean : isStarted() and it will be injected as dynamic dependency into the registries (things, items, links, rules). Then I would skip sending events / notification if the service is present and the runtime / framework is not fully started.
  2. Each service that requires the runtime state has to implement a RuntimeStateListener interface which is tracked by a central RuntimeStateService to be provided by solutions based on ESH. As soon as the runtime state has changed to started the service will inform all listeners about this. Once the runtime left the started state again all listeners are informed about this. For the beginning I will implement the new RuntimeStateListener interface by the AbstractRegistry (concrete registries can decide if the runtime state listener is to be provided as a service).

I think that 2 is the only valid option to implement this requirement. In 1 the runtime state service could unregister too early so that the events are sent / listeners are notified again.

What do you think?

@maggu2810
Copy link
Contributor

How do you want to decide that the framework is started or shut down?
If I update one Eclipse SmartHome bundle that triggers some restarts (deactivate, activate) of bundles and / or services. Which one is responsible for a whole "startup" / "shutdown" state?

Isn't this product specific which services needs to be available to signal "all present" / "normal runtime"?

@tomhoefer
Copy link
Contributor Author

tomhoefer commented Jul 22, 2016

In our product we can rely on our OSGi runtime implementation that the framework is started properly. For solutions running e.g. on Equinox I thought to use a framework listener and listen on the FrameworkEvent.STARTED

@maggu2810
Copy link
Contributor

I have assumed you are talking of the Startup and Shutdown phase of the Eclipse SmartHome framework. But you refer to the start and stop of the OSGi Framework. Correct?

So, there are some options, using a FrameworkListener, using a SynchronousBundleListener for bundle 0 to handle on the stopping event, ...

Is the intention that the Eclipse SmartHome framework does not fire any event as long as the OSGi Framework is not fully started (starting up or shutting down)?
But I assume we also need to react on restarts of special bundles / services etc.
If a bundle is updated or services are restarted, which Framework or Bundle Events are triggered?
The OSGi framework is still "started" but ESH bundles could disappear (but I am could be wrong, I never watched all the events).

Which one needs to be observed to differ between "normal runtime" and non-normal one?

@tomhoefer
Copy link
Contributor Author

But you refer to the start and stop of the OSGi Framework. Correct?

Yes

Is the intention that the Eclipse SmartHome framework does not fire any event as long as the OSGi Framework is not fully started (starting up or shutting down)?

Yes

If a bundle is updated or services are restarted, which Framework or Bundle Events are triggered?

I think this depends on the used OSGi runtime. In our project we dont want to be informed if entities are added or removed during startup and shutdown. We have already a dedicated state that declares the framework as started.

Which one needs to be observed to differ between "normal runtime" and non-normal one?

Especially ItemAddedEvent, ItemRevomedEvent, ThingAddedEvent and ThingRemovedEvent

Can you give me an example for which bundle / service you think we need to react on its restart?

@maggu2810
Copy link
Contributor

Can you give me an example for which bundle / service you think we need to react on its restart?

No, not ATM.
I need to think about the whole topic in more details.

You have written you are doing this already, so I assume you know that it is working and how it is working (the architecture). I don't. 😉 Give me some time.

@tomhoefer
Copy link
Contributor Author

Haven´t yet started with the implementation ;) But because it is urgent I think that I will provide a PR in the following week

@kaikreuzer
Copy link
Contributor

I agree that it isn't easy to say whether the system is up or not up. What does it mean if the OSGi framework keeps running, but ALL ESH bundles are fully stopped and restarted? I would consider this that ESH is NOT up - hence the feature should not about the OSGi framework, but about ESH itself.

"Up" means for me that certain services have started and are available. How can this be determined and others be notified about reaching (or leaving) this state?

I see several use cases of such a feature (from recent discussions):

  • avoid Item/Thing/etc added/removed events when the system is only started/stopped and hence only reconstructs the status quo from the last up-time. I have seen myself the log being cluttered on shutdown with 1000 "item removed" events, which clearly makes no sense. Usually, "item removed" should mean that it has been removed from the system and won't re-appear automatically again. This is the use case @tomhoefer describes above.
  • we recently introduced the XML processing vetoing (Reworked XML bundle processing and thing handler initialization #1856) - this also just tries to make sure that a certain state (XMLs loaded) has been reached before starting other services (the thing handlers) (the tricky thing here might be that it is more fine-grained as it blocks single bundles depending on more detailed processing information)
  • Very frequently the right moment for the startup rules is discussed. So far, they are potentially triggered when not all items have been restored yet in the registry, which causes all kinds of problems. For this it would also helpful to be notified about some "system up" state, so that the rules can be safely executed.

@kaikreuzer kaikreuzer changed the title Avoid sending of events and notification of listeners during startup / shutdown of ESH Introduce a "ESH start level" functionality Aug 1, 2016
@sjsf
Copy link
Contributor

sjsf commented Aug 4, 2016

IMHO, a single state will not fit all of our needs. As @kaikreuzer pointed out, we e.g. have services that require other services to be up and running and fully loaded (whatever that means). Then again, there might be other services which depend on the previous ones to be started. So we will end up having several different levels of "active", like e.g. the start levels for bundles in OSGi. Additionally, the definition of these levels is going to differ for every solution built on top of ESH.

Generally, the introduction of such a framework state in a dynamic system usually is a workaround to cover up for maybe-not-so-ideal design decisions in other places. I would suggest to first look into the individual use-cases and see if we somehow can fix the root causes.

Regarding the Item/Thing/etc added/removed events, the root cause is that we cannot distinguish whether they were loaded or newly created (or removed/unloaded respectively). I'd suggest that we fix this and also let listeners/subscribers decide what kind of event/notification they actually require by either introducing new event types (i.e. ThingLoadedEvent, etc...) plus RegistryLoadListener interface, or amending the existing events and RegistryChangeListener with the corresponding information.

kaikreuzer added a commit to kaikreuzer/smarthome that referenced this issue Dec 14, 2016
…on until eclipse-archived#1896 is solved

Signed-off-by: Kai Kreuzer <kai@openhab.org>
maggu2810 pushed a commit that referenced this issue Dec 15, 2016
…on until #1896 is solved (#2656)

Signed-off-by: Kai Kreuzer <kai@openhab.org>
chaton78 pushed a commit to chaton78/smarthome that referenced this issue Dec 23, 2016
@sjsf sjsf mentioned this issue Apr 12, 2017
@sjsf
Copy link
Contributor

sjsf commented May 10, 2017

Okay, it has been a while now... As we can see, there recently have been quite some topics which relate to this issue, therefore I'd like to get back to it now. I still think we should avoid using such a "startup level" construct wherever possible! But I have to admit that there are some use-cases which won't really work without it (e.g. related to the rule engine).

There recently was a blog post by @pkriens which addresses this very topic. And I think we could realize our requirements with exactly this idea, using the OSGi means for our purpose. The relevant services that we need to wait for (e.g. XML processing per binding, providers being up and running) would somehow denote that they are "finished" by registering a marker service into the SCR, carrying some defined properties.

Our "AggregateStateService" however must be configurable, as not all the services are available in every solution. Imagine there would be a solution without support for DSL based configuration, then it really does not make sense to wait for the GenericProviders to finish their loading. I'd suggest using config admin for that purpose.

As a first step, I would drop the BundleProcessorVetoManager and use such OSGi services to mark fully loaded bindings accordingly.

As a next step, I would create an AggregateStateService and make all relevant entities denote that they are finished loading. The idea would be that every service that somehow needs waiting (e.g. a SystemStartupTrigger) would create a dependency to such an aggregated state only, not to the services themselves. By that we would decouple the dependency from a concrete service into a configurable one with a semantic meaning. At the same time this allows us to define different levels of "readyness" of the system. Of course, we need to carefully define all the required properties and states, as they somehow become "API for solution providers", i.e. they must not be a big pain to maintain and should change as seldom as possible.

Does this make sense to you? Any thoughts on this?

@pkriens
Copy link

pkriens commented May 10, 2017

Aren't there any companies that can run this through OSGi? This is a very foundational service and it belongs somewhere low in the stack like Equinox or Apache Felix?

I could provide an initial implementation since I got it already running

@pkriens
Copy link

pkriens commented Jan 22, 2018

@SJKA I think I share your view. The danger is that you start thinking global and that always falls apart in a component model. In general, you need to handle the dependency on the requirer side that has the actual knowledge of what it needs. I.e. a rule that need X should not be evaluated before X is present. This is much better than waiting to start the rule engine until all devices have started. You need to address these things where you have concrete information (like X.1) instead of trying to handle it global.
Hope this helps.

@maggu2810
Copy link
Contributor

I considered things, rules, ... ready that the framework stuff is ready (thing handler could start doing its work, rules could be proceeded, etc.).
Is waiting for "all things need to be present" e.g. to be ready to execute rules possible at all?
Thing about a binding / thing handler, that is fully initialized itself, but needs an undefined time until it could detect its things (if they are online) and communicate with this one.
Should the whole rule trigger "system started" wait for an undefined time?
If a rule needs to access that things, perhaps is should be triggered by "thing online" instead.

What are the main "wait conditions" we need at all -- and which part should wait?

@sjsf
Copy link
Contributor

sjsf commented Jan 22, 2018

What are the main "wait conditions" we need at all -- and which part should wait?

Looking at the tons of issues which are linked against this one, I'm about to say: pretty much everything 😉 But that's exactly why I'd like to avoid - as tempting as it is.

However, in the end I think it's mainly about the rule engine(s). The other cases need to be looked more deeper into, and hopefully can be solved locally.

In the rule engine(s), the major pain-point are the "system started" triggers - all other triggers won't be triggered or executed anyway, because the system simply is not "ready enough" to generate and/or receive such events (e.g. ItemStateChangeEvent), so no problem there.

The linked issues mostly refer to "items not present" because this is the most obvious error when the language model cannot infer item references - but as you pointed out, this won't be enough: Once the items are there, we will run into the next problem: the linked things (as well as the links themselves, obviously) also need to be there - otherwise the items can be nicely resolved but any sent command ends up in nirvana. Speaking about that, the corresponding ThingHandlers obviously also need to be finished initializing. If they end up being OFFLINE because they cannot reach their devices: tough luck, this might always happen.

In an ideal world, we could analyze the rule actions for the items which are referenced and wait for their things to become ONLINE/OFFLINE/UNKNOWN. This however seems pretty much impossible with more advanced, dynamic scripts where e.g. items are looked up dynamically from the ItemRegistry. And even if we overcome this problem by only considering hard-referenced items and build a 90% approximation, it might still be surprising to users if e.g. multiple items are changed in a rule but one will never become "useable" because the corresponding binding is missing. Why doesn't it execute it for the others? Can't the computer "know" that this binding is missing?

Is waiting for "all things need to be present" e.g. to be ready to execute rules possible at all?

This indeed is the key question! If we build something that isn't capable of solving this, then we won't win anything and don't even need to start.

@jboeddeker
Copy link

In the rule engine(s), the major pain-point are the "system started" triggers - all other triggers won't be triggered or executed anyway, because the system simply is not "ready enough" to generate and/or receive such events (e.g. ItemStateChangeEvent), so no problem there.

No, from my opinion it's not just "system started". More problems are created from the ItemStateChanged triggers triggered for example by the persistence engines.
And some bindings take more time to initialize than others.

@maggu2810
Copy link
Contributor

More problems are created from the ItemStateChanged triggers triggered for example by the persistence engines.

Can you add more details? A persistence service can access the item registry on service activation and persist all non UnDefType.NULL states (WRT the discussion who is allowed to set the NULL state but that is currently mostly used by the framework on item creation only) to its storage. After it has been activated, it could store every item state change to the storage, too.

@jboeddeker
Copy link

Sorry, i think it was misunderstandable. It's not the persisting of items but restoring (strategy = restoreOnStartup) which causes the ItemChanged trigger to be fired. In my case this was a major problem, which was mainly solved when i excluded the change from Null from the trigger condition.

//Item someitem changed
Item someitem changed from X to Y or 
Item someitem changed from Y to X

This change removed much from the startup exceptions.

@mherwege
Copy link
Contributor

I would add two more cases that could cause issues with rules when the system is started. I have seen all of these when starting openHab. A few restarts usually gets me over the problem, but that’s not very nice.

  • cron triggered rules, triggered when the system has not fully initialized all its items yet

  • a mix of items defined in items files and through pape UI: this can cause issues if one set is loaded, and the other set is not loaded yet. The rule could be triggered on the item from the loaded set, but still fail because it does not find another item referenced in the rule body. If this happens, the rule engine may generate a syntax error and never run the rule again.

@maggu2810
Copy link
Contributor

Should a rule be triggered at all if

  • items are used that are not available
  • items are available but not linked
  • items are available, linked, but thing has no handler assigned
  • items are available, linked, handler assigned, but thing is offline
  • ...

Isn't the rule engine a special use case? I don't think that could be solved with a global "system is started and rules could be executed" state at all.
Isn't it something that could be known by the rule writer only if the items need to have linked channels (and so things) or not, if the things does need already a handler or not, if the thing itself needs to be online or not, ...?
Do you really think that every "user" wants the same stuff for the same usecase (especially WRT thing communication should be established)?

@adimova
Copy link
Contributor

adimova commented Feb 26, 2018

Should a rule be triggered at all if

I agree with @maggu2810, such rules should nod become IDLE. The problem is that currently the ModuleHandlers - which have the needed information - have no way to inform the RuleEngine of their state, and the changes in their state. I've proposed a solution in may comment in #4468.

@kaikreuzer
Copy link
Contributor

@maggu2810 for this issue here, we are only talking about services that need to be fully started in the first place as a pre-condition to consider any kind of rule execution. Whatever might happen during normal operation time (items not there, things offline, whatever) is not relevant for this issue here, but is indeed something that needs to be handled in the appropriate components.

@lolodomo
Copy link
Contributor

Bump 6 months later.
Is there really no solution we could implement ?
The different problems caused by rules started to much earlier is the most important issue in openHAB. Hopefully, it is not a blocking issue.
Is there no way to add a setting to delay the startup of the rule engine ? With such a setting, I will delay the startup of 2 minutes and 99% of problems are solved.

@maggu2810
Copy link
Contributor

maggu2810 commented Aug 24, 2018

@lolodomo For ESH itself we need a clean solution.

For downstream project or at least for your setup at home you can delay the startup of the automation part easily by adding a bundle that does nothing than delay the automation activation.
I tested a simple demo here that delays the bundle start:

You can improve it to start the delay as soon as e.g. smarthome core has been started, special services are available, ...

-- edit --

I improved the implementation to delay the activation of the automation bundle IF other service references are satisfied and stopping the bundle if that references are not available anymore.
See e.g.
https://github.com/maggu2810/shk/blob/delayed-start/bundles/shk-addon-delayed-automation-start/src/main/java/de/maggu2810/shk/addon/das/impl/automationcore/CheckAutomationRequirements.java
if the thing registry and the item registry is available the automation core bundle will be started with a delay of 15 seconds, otherwise the automation bundle is stopped.

@kaikreuzer
Copy link
Contributor

I just came across https://github.com/apache/felix/tree/trunk/systemready - this sounds like a very nice fit for our issue and probably worth to further investigate.
@cschneider As you seem to be the main author of that project, please feel free to comment/advise here - if you do not think that it fits or that it is still in an too early phase, this would be a helpful input as well 😎.

@cschneider
Copy link

Systemready is still in an early stage. We currently mainly use it to report ready and alive for kubernetes. There is also a similar concept in sling called health checks. Last Wednesday I talked with the creators of this and we found quite a few things that should be added to systemready.

The main missing thing we found is having tags for system checks. Each tag could then represent one of the subsystems you talked about. This tags might then replace the ready and alive types.
Other things are executing each check separately and failing it if it takes too long or blocks. I will create some issues on systemready. Any help with that is welcome.
So I think systemready should be usable soon.

Generally for determining readiness it is not good enough to look at framework started or the fact that all bundles are started. Especially with declarative services a service might appear completely asynchronous from the bundle start. So a list of required services is the only stable way. Unfortunately we are having quite some difficulties creating and managing such a list for AEM. I wonder if a special annotation could help with that (like adding a tag to a service) that is then reflected in the Manifest.

I am not sure though if I would use this for switching on/off the internal eventing of esh. Maybe there is a different solution for this. How about having different events for a thing that really appears on the binding and a thing that is merely recreated because of a startup. In the same way when shutting down it should be clear if a thing is removed externally or just because of shutdown.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Backlog
  
To do
Development

No branches or pull requests

10 participants