Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with necessary gateway restart #54

Closed
melsom opened this issue Sep 3, 2017 · 35 comments

Comments

@melsom
Copy link

commented Sep 3, 2017

Hello,

As many others are experiencing. When you reach a certain number of bulbs, the gateway needs restarting every 1-2 days. I use IKEA Trådfri with Home Assistant, and feature wise it works great.

Thing is, whenever I reboot the gateway to be able to reach the bulbs, all light groups are switched to off in Home Assistant. Even though they are on. This happens every time.

Home Assistant says it supports local push, which means updates should be pushed automatically. Any way to fix this? Waiting for IKEA to fix their gateway could take some time.. This just started happening when reaching bulb number 20.

@lwis

This comment has been minimized.

Copy link
Collaborator

commented Sep 3, 2017

I've always thought the gateway firmware has a memory leak somewhere. Particularly when you're making requests every 15 seconds, this is quite noticeable.

As for the groups, HA will query the gateway, if the gateway reports the group is off, that's all the information we have.

However, if the app (after a restart) is reporting them as on, there may be a bug in the library.

With the switch to observations imminent, I'm hoping this will be more reliable, and the potential memory leak will be triggered less.

I've heard IKEA are developing a public API for the gateway, so we hopefully also have that to look forward to.

@lwis lwis closed this Sep 3, 2017

@lwis lwis reopened this Sep 3, 2017

@melsom

This comment has been minimized.

Copy link
Author

commented Sep 3, 2017

When checking the TRÅDFRI app after reboot, it shows the groups as being on, and those off obviously off. But HA does not seem to be able to pull this information.

So there might be a bug somewhere. Please let me know if there is anything I can do to help.

@hanpal

This comment has been minimized.

Copy link

commented Sep 16, 2017

One strange thing is that reboot (power off/on) worked fine for me for a long time, fully automated. But since the last month it doesn't, only factory reset works. I've automated this also but this required soldering wires to the circuit board. My automation always tries power off/on and then waits a while and checks if it works, if not, factory reset. Factory reset is initiated almost every day. It may be connected to some pytradfri problem as well but each time that connection is lost, it is also lost using the app so it seems that the gateway has been affected at least.

Memory leaks are highly suspected, probably both in RAM and perhaps also some problems with logs written to flash memory, missing or faulty round robin handling.

@lwis

This comment has been minimized.

Copy link
Collaborator

commented Sep 16, 2017

Have you tried not running pytradfri for a few days to see if the issue persists?

@hanpal

This comment has been minimized.

Copy link

commented Sep 17, 2017

Well, not running pytradfri would most certainly remove the issue but also completely remove the functionality I want. Not interested in using only a controller (remote) or the app, since I have a fully automated lighting system using Tradfri.

@spektren

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2017

reboot and factory reset can be issued by network.

REBOOT:
coap-client -m post -u "Client_identity" -k "key" "coaps://192.168.178.71:5684/15011/9030"

!!!RESET!!! (everything like pairing, groups, etc. will be gone):
coap-client -m post -u "Client_identity" -k "key" "coaps://192.168.178.71:5684/15011/9031"

Maybe this would be a handy extension to pytradfri...
__

@hanpal are you using a kind of automated reconfiguration of your GW after a factory reset or do you reconfigure everything manually?

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Oct 7, 2017

@spektren perhaps this could be added as a method to gateway.py? Please feel free to give it a try!

@spektren

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2017

Thanks for the hint. I'll give it a shot next week.
Cheers

@ioangogo

This comment has been minimized.

Copy link

commented Oct 7, 2017

Memory leaks are highly suspected, probably both in RAM and perhaps also some problems with logs written to flash memory, missing or faulty round robin handling.

Has anyone hasn't contacted them yet? I am tempted to contact ikea support about this, just to see if they are aware or are working on the issue, Has anyone found any more evidence that there it is caused by a memory issue(I do agree that it is a memory leak, but ikea may need convicing).

@lwis

This comment has been minimized.

Copy link
Collaborator

commented Oct 8, 2017

@jareware

This comment has been minimized.

Copy link

commented Oct 8, 2017

FWIW, I'm now on my 3rd GW, and I've of course let them know how it keeps failing on both returns. I find it hard to believe they're not aware of the issue, just seems like there's no-one actively working on it.

@ioangogo

This comment has been minimized.

Copy link

commented Oct 8, 2017

@lwis Yeah, if we contact them about a suspected problem they may have the tools to debug it and to confirm out suspicions, that is if they were the ones doing the development

@chemicalstorm

This comment has been minimized.

Copy link

commented Oct 12, 2017

Just wanted to say that I switched to the dev version of Home Assistant which uses resource observation last week and have not got any issue with the gateway so far (while I needed to cold reboot it every two days before).
You may consider this as a workaround until Ikea properly fixes the issue.

@jareware

This comment has been minimized.

Copy link

commented Oct 13, 2017

@chemicalstorm

which uses resource observation

Can you elaborate on that?

@ioangogo

This comment has been minimized.

Copy link

commented Oct 13, 2017

@chemicalstorm

This comment has been minimized.

Copy link

commented Oct 13, 2017

@jareware as @ioangogo said. I think all the code related to this functionality has been developed and merged in this PR home-assistant/home-assistant#7815

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Oct 13, 2017

@chemicalstorm for the pytradfri-lib lib support for observation was added in #20. The PR you referenced is for the Home Assistant-implementation (which is stand-alone from pytradfri.) 😄

@chemicalstorm

This comment has been minimized.

Copy link

commented Oct 13, 2017

@ggravlingen thanks for the precision. I was only focusing on HA since OP mentioned using it :)
I don't think it is worth polluting this thread anymore so here are some potential workarounds:

  • If you use pytradfri directly, you can use whatever version which has #20 merged in.
  • If you use Home Assistant, you can use whatever version which has home-assistant/home-assistant#7815 merged in.
@jareware

This comment has been minimized.

Copy link

commented Oct 13, 2017

FWIW, I don't use anything besides the official Android/iOS apps, and my GW still requires a daily restart to be reliable.

@ioangogo

This comment has been minimized.

Copy link

commented Oct 13, 2017

@grischard

This comment has been minimized.

Copy link

commented Oct 26, 2017

Those of you who experience gateway crashes, what's your power supply? Using Ikea's supplied 2A USB power adapter instead of the USB port at the back of my router has solved the issue for me.

@jareware

This comment has been minimized.

Copy link

commented Oct 26, 2017

Using the supplied one since day 1, still crashy.

@hanpal

This comment has been minimized.

Copy link

commented Oct 26, 2017

Has anyone hasn't contacted them yet? I am tempted to contact ikea support about this, just to see if they are aware or are working on the issue, Has anyone found any more evidence that there it is caused by a memory issue(I do agree that it is a memory leak, but ikea may need convicing).

I have actually an informal contact in the Trådfri development team. The instabilities are known and there are no known hardware related problems with the gateways. My contact didn't answer when I asked if they have prioritised development of new features (HomeKit/Alexa/Google) over bug correction but this is my impression. The market wants the new features, otherwise I'd expect that the stability problem had been corrected for several month's ago.

The memory issue is rather obvious. I've done some experimenting by sending high traffic to the gateway for several hours. The gateway survives in the short run but after some amount of traffic it hangs. If I send at half the rate it will still hang but after about twice the amount of time. I can reproduce a hanging within 1-2 hours with very high traffic. The system is also designed for 100 devices so I don't think that this is a CPU overload issue. I have only 6 bulbs.

The last I heard from my contact is about a firmware release next week.

@hanpal

This comment has been minimized.

Copy link

commented Oct 29, 2017

Is it possible to have an API to pytradfri that determines if the gateway is unresponsive and may need reset/power cycle? I have added some exception handling in my scripts that call pytradfri and by experimenting I have a method to decide if the gateway is unresponsive. I would prefer if this could be determined in a better way, e.g a call to gateway_responsive() for monitoring. Maybe a combination of ping (gateway is "active" on the LAN) and some calls to determine that the gateway is unresponsive even though pingable.

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Oct 29, 2017

@hanpal please feel free to raise a PR with a suggested method!

@hanpal

This comment has been minimized.

Copy link

commented Oct 29, 2017

I've learned that if I call api_factory, an exception is raised when the gateway is unavailable:

try:
    api = api_factory(ip, key)
except:
    print ("### Trådfri: No contact with the gateway")
    sys.exit(1)

There are som other cases also, it happens that api_factory passes without exception but there are no devices:

try:
    lights = [dev for dev in devices if dev.has_light_control]
except:
    print(" ### Trådfri: No devices")
    sys.exit(2)
@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Oct 29, 2017

@hanpal I guess something that mimics this would then need to be added to the GatewayInfo-class (https://github.com/ggravlingen/pytradfri/blob/master/pytradfri/gateway.py). Perhaps as a is_available()-method?

@hanpal

This comment has been minimized.

Copy link

commented Oct 29, 2017

Yes, would be nice to have, at least for the first case.

The other case with no devices is not obvious how to handle. Why an exception in this case? There might be OK cases where there are no devices with has_light_control. The average user should not have to consider exception handling unless this is documented as a prerequisite for using the method.

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Oct 29, 2017

@hanpal 👍 Please feel free to implement the code and then submit a pull request for the main branch.

@hanpal

This comment has been minimized.

Copy link

commented Oct 29, 2017

Unfortunately I'm at beginner level regarding python so I don't think this would be a great idea...

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Nov 5, 2017

Are people still having troubles with instability on the new firmware version or can this issue be closed now?

@melsom

This comment has been minimized.

Copy link
Author

commented Nov 5, 2017

Haven't had any issues yet. Just been a few days though. Hopefully the fixes IKEA has implemented has resolved this issue!

@migromao

This comment has been minimized.

Copy link

commented Nov 5, 2017

@jareware

This comment has been minimized.

Copy link

commented Nov 6, 2017

Also seeing a dramatic improvement in reliability 👍

@ggravlingen

This comment has been minimized.

Copy link
Owner

commented Nov 6, 2017

Anecdotal evidence suggests stability has increased. Closing issue.

@ggravlingen ggravlingen closed this Nov 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.