Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webfrontend freezes after some time #1523

Closed
christianlupus opened this issue Sep 29, 2016 · 24 comments

Comments

Projects
None yet
4 participants
@christianlupus
Copy link

commented Sep 29, 2016

What were you doing?

I was surfing in the net and had a tab with octoprint open.

What did you expect to happen?

The tab keeps updating information (like temperature, print status, etc) on a regular basis.

What happened instead?

On a longer print (1h or more I guess), the site simple stalls. No information is updated anymore. (Neither temp nor printing time, which should clearly increase.)

In this situation I also cannot use any of the GUI. This can be critical for example if I need to stop a wild printer from getting crazy and I have no control.

Only a reload of the browser page can reactivate the GUI and updates the information again. However this takes some time (depending on hardware and network connection) which is bad in case of trouble.

Branch & Commit or Version of OctoPrint

Version: 1.2.16 (master branch): 354042b

Browser and Version of Browser, Operating System running Browser

Firefox 49.0 under Ubuntu 14.04 and under 15.10

Link to octoprint.log

http://paste.debian.net/844935/

The issue aruse at 15:50 for sure with the machine 10.1.100.6

Link to contents of terminal tab or serial.log

http://paste.debian.net/844947/

Link to contents of Javascript console in the browser

There was nothing logged as far as I can tell. Potentially it was deactivated but I am unsure here.

Screenshot(s) showing the problem:

This is a screenshot of just now but the printer should be further in the process.
screenshot-printer saar 2016-09-29 16-02-00
This is a screenshot a few minutes later after reload of the page:
screenshot-printer saar 2016-09-29 16-06-56
You see the jump of about 15% and 45 min printing time.

I have read the FAQ and a I did not find anything related to my problem.

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 4, 2016

Nothing at all in the Javascript console? That sounds wrong, it will at the very least log framework initialization messages there. Could you please take another look when it reproduces? The Javascript console and also the network tab will probably be crucial to getting to the ground of this.

Also please clarify: while you did have the tab open, did you also have it focused? OctoPrint will detect if the tab is focused or not and at disable e.g. the gcode viewer updates from happening when it isn't (since those are quite expensive client side). That wouldn't explain the lack of general updates, but it would still be interesting to know if that was the case or not.

Looking at your log:

2016-09-29 14:29:39,145 - octoprint.server.util.sockjs - WARNING - Could not send message to client 10.1.100.6: 
2016-09-29 14:39:14,223 - octoprint.server.util.sockjs - WARNING - Could not send message to client 10.1.100.6: 

This looks like something went seriously wrong in sending data from the server via the websocket to your browser. This message gets produced when an exception is triggered in the underlying sending code and usually the message of said exception should be logged right after the : here. However instead we got an empty string. Might be that the originating exception was thrown due to a closed web socket.

How reliably can you reproduce this? And how's your client and server connected to your LAN, cable or wifi?

@christianlupus

This comment has been minimized.

Copy link
Author

commented Oct 4, 2016

OK, just for clarification: You suggest to use the Webdeveloper tools in firefox, right? I will let it run once my pi is online again and I am printing. This might happen this evening or tomorrow.

The tab was in background. I had it open to be able to abort in case of a problem with the printer. I was surfing the net while it was running.

I would guess that a closed websocket will not lead to a complete breakdown in python, will it? I would rather expect to get an exception with the possibility to handle (and log) it. Or am I mislead?

The server is in my LAN (cable). The named IP is a connection from outside using openvpn there. But the same problem arises only in the LAN (I have to admit using my laptop with WiFi). I could go with a cable to the laptop. Further I wand to tell that most of the time I have a SSH connection to the server open (through the VPN) and it is configured to disconnect on ping errors. Thus the network seems reliable to me (also no lag or something).
The reproducibility seems to be given. After a time period of 1~60min the timer stops working.

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 4, 2016

You suggest to use the Webdeveloper tools in firefox, right?

Yes

I would guess that a closed websocket will not lead to a complete breakdown in python, will it?

No, but since the websocket is the transport used to push updates to the frontend it will lead to the updates stopping. You also should get an offline dialog when the client realizes that the socket was closed, and since you didn't get that apparently something happened causing the communication to stop getting through but with the socket still appearing open to both sides.

Usually it goes like this: socket goes down, both sides realize that, python server logs a message to the log ("connection closed"), javascript client triggers the "Server is offline" overlay (with a short delay to keep it from flickering on unreliable connections) and starts attempting to reconnect. What appears to have happened here is that the server ran into some hard trouble sending the usual updates to the client at first but didn't detect anything wrong with the connection, the client still stopped receiving updates though but didn't detect anything wrong with the connection either. Since the server logged neither "client closed" nor further warnings or even errors after the two quoted ones, I'm leaning towards the issue lying with the client in your browser here. I remember someone else running into similar issues a while back, but they appear to have resolved with 91b5fc6 - that's present in 1.2.16 however, so it's probably not the same reason.

So something is fishy here and any logs or network traces from the client side might help to clear it up.

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 4, 2016

As a side note, I have never seen that happen myself, and I have OctoPrint tabs whose istances are busy printing open pretty much constantly. Hence I need as much data from your side as possible to have even a remote chance to maybe understand what's happening here :)

@christianlupus

This comment has been minimized.

Copy link
Author

commented Oct 4, 2016

If you like I can also let wireshark sniff all the traffic if the data is interesting for you...

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 4, 2016

Might give us some additional insights :) Right now anything might give the final clue as to what is happening.

I just sincerely hope it's not another case of "browser decided to do weird stuff with the web socket connection after an update" :/

foosel added a commit that referenced this issue Oct 4, 2016

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 5, 2016

@christianlupus Forgot to mention, I also pushed a commit to the maintenance branch yesterday that should provide some more verbose logging output in case of issues on the websocket IFF octoprint.server.util.sockjs runs on DEBUG logging configured in logging.yaml:

loggers:
  octoprint.server.util.sockjs:
    level: DEBUG

If you could reproduce the same issue on maintenance with the logging upped accordingly, we might also get some clue what happens before communication shuts down (the mysterious unlogged exception).

@christianlupus

This comment has been minimized.

Copy link
Author

commented Oct 6, 2016

Hello,
I did everything but I did not find a possibility to export the logs of the web developer plug-in in firefox. Can you give me advice?
Thanks

@christianlupus

This comment has been minimized.

Copy link
Author

commented Oct 6, 2016

OK, I tried my best but I was unable to do an export of the javascript console.

First here you can have a wireshark sniff from the client filtered for only the relevant IP.

Further a plain text copy of the logger in firefox with network and javascript enabled.

In the ending you see three clicks to load a OUTDOOR file. This was not possible/not reflected by the GUI. Nevertheless these POST requests for the webserver were successful (?). For example the first of these led to the answering header, which indicates at least that the octoprint server is still responding, isn't it?

The information is routed through haproxy as suggested by the wiki page. In the logs of haproxy I do not find anything interesting.

Here you see, that the javascript logging is enabled and nothing is shown:
bildschirmfoto vom 2016-10-06 13 09 09
Please note that before these lines a small warning was issued: getPreventDefault() sollte nicht mehr verwendet werden. Verwenden Sie stattdessen defaultPrevented. So I think in general logging is active....

The recent octoprint.log can be found here.
There is not much debug output in the file. Do I need to register the logging.yaml anywhere except for putting it in ~/.octoprint?

OK, I think this all information I can directly give to you. If you need more, please tell me.

PS: My version is 6c21ee3.

@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 10, 2016

Hm. In the recent octoprint.log, there's not the same error sending a message to the client that we saw earlier (and which I upped the logging for to see details on in the first place in the form of a stack trace). But the updates still stalled? If so that would be bad because it means that the errors when sending a message to the stalled client in question that we saw in the earlier log are unrelated to the issue after all. Which means we'd be back at square one.

As a side note, it looks like you modified logging.yaml correctly. You don't need to register it anywhere.

I'm not surprised that you could still issue POST requests - what's going over the board here appears to be only the push connection (which is what pushes the progress data and all that stuff to the frontend). Clicking buttons and all that goes over the regular API however, the websocket is only used for pushing "real time" (more or less) data from the backend to the frontend.

I guess the next step will be to up the logging on the frontend, to try to see what is happening there. But first I'll try if your PCAP file (thanks for that) doesn't contain some hint.

foosel added a commit that referenced this issue Oct 10, 2016

Some client logging to help with #1523
  * Log heartbeat messages
  * Log disconnects including error code
  * Log when messages are received
  * Log when messages are NOT received after 5s
@foosel

This comment has been minimized.

Copy link
Owner

commented Oct 10, 2016

I sadly was not able to glimpse anything from the pcap file so far, but I made a branch with some extra logging in the client for you, improve/clientLogging1523.

That branch logs to the javascript console when

  • a heartbeat message is received
  • a regular message is received
  • the connection is closed (including the closing message which contains a code for the reason)
  • no regular message was received for the last 5s

Should look somewhat like this:

image

Please test if you can reproduce the issue with that branch (the commit linked above is the only thing that makes that branch differ from 1.2.17rc1 btw) and if so provide a similar screenshot. It's important that the line "Didn't get a message from the backend for the past 5 seconds, that's unusual!" is on the screenshot then (it's not part of my own screenshot above) - well, and if that's not to be found, that's some interesting info as well! What I want to see is what happens right before (any weird message/heartbeat/close behaviour or something) and also what happens after (are there still heartbeats etc).

@christianlupus

This comment has been minimized.

Copy link
Author

commented Oct 10, 2016

Hello again,

I updated to the named branch and added one more configuration update: I inserted the webcam. Now I had the web frontend open for more than 2 hours and no problem arose. Now I go back and unconfigured the webcam. This way I can see if the webcam made any difference. Unfortunately I cannot give any more information for today as I have to leave soon and it is unlikely that the problem arises again in the remaining time.
I will post as soon as I have new information.

@foosel

This comment has been minimized.

Copy link
Owner

commented Nov 16, 2016

@christianlupus I've since seen this too (once or twice) but haven't been able to figure out much more. @Salandora appears to be able to reproduce it somewhat reliable, but it so far vanishes once debug probes are inserted.

Once thing we noticed in your pcap file is that the socket runs in permessage-deflate mode. Normally the stock OctoPrint install shouldn't support this. Hence a question: could you provide the output of ~/oprint/bin/pip freeze | grep -i tornado and also provide some more information regarding your network setup, specifically what proxies might be involved in front of OctoPrint (besides the stock haproxy bundled with OctoPi)?

PS: Your log files on paste.debian.net vanished - my guess is they were auto deleted. In the future better use a more reliable pasting service such as gist.github.com

@christianlupus

This comment has been minimized.

Copy link
Author

commented Nov 16, 2016

Hello again,
I have to admit I had mechanical issues with my printer and therefore did not too much experimenting with octoprint. The requested gist is generated.

Regarding the setup: First, I do not use OctoPi. I use a plain raspbian image and installed octoprint inside. The server is proxied via the proposed haproxy. The octoprint server listens on port 5000 and haproxy is configured to listen on port 80 and redirecting everything except for some single path prefixes to the port 5000 on localhost.
The RPI has one network interface and is connected to the LAN at my home. In the LAN the default gateway is configured to redirect some IP ranges (namely the LAN of my parents etc) to a VPN client machine that is connected to a personal VPN. I can connect to this VPN and reach the octoprint server. (At least this was the way of working the issue was reported lastly. The very same effect I had already only in the LAN at home but there no further investigation was done.) Do you need all IP ranges?

I am sorry about the vanished logs. I was not aware of this. I used the tool pastebinit and did not see that the pastes were published only temporary.

@foosel

This comment has been minimized.

Copy link
Owner

commented Nov 16, 2016

Ok, that's the tornado version that should be installed, just as expected. So it's not that.

And your setup (btw, that description up there is sufficient, thanks, no need for a full fledged network plan with ip ranges :)) doesn't sound that weird either.

What somewhat troubles me is the permessage_deflate header in your websocket response, because stock OctoPrint shouldn't set it and right now I have no explanation where that might be originating. I first suspected that haproxy might be adding that for whatever reason, but at least in my setup here that was not the case. Just to make sure though, is your haproxy configured as proposed in the install guide or are there any large deviations? And what version of haproxy is that?

In any case, considering that @Salandora is seeing the same (or at least a virtually indistinguishable) behaviour but does not have that permessage_deflate header set, it's probably not the culprit here (and in fact, the header itself wouldn't have been the culprit either but rather hinted at a tornado bug to potentially being in play here).

Of course, it's still impossible for me to reproduce this issue on a whim. I only saw it a couple weeks ago during a semi-long print, and once I tried to look further into it it of course stopped showing up again. That makes pinning this down extremely hard. Might be a timing issue/race condition somewhere

@christianlupus

This comment has been minimized.

Copy link
Author

commented Nov 16, 2016

I mainly walked along the wiki of octoprint to configure haproxy. I only added one additional prefix for lighttpd for other printer-related issues. This should not be the difference here. haproxy is version 1.5.8-3+deb8u2 in dpkg.

@foosel

This comment has been minimized.

Copy link
Owner

commented Nov 16, 2016

I agree.

@foosel

This comment has been minimized.

Copy link
Owner

commented Nov 24, 2016

Small update. It might actually be the permessage-deflate after all. I just by mere coincidence figured out that the tornado requirement on current stable & devel was able to pull in a beta version of tornado 4.1, which has support for permessage-deflate which is known to be buggy with the websocket implementation in use (potentially shaking up messages if they exceed a certain size and making stuff just stop). See this ticket for more info on this.

I've pinned tornado to 4.0.2 (soon to be pushed to devel) and will push out 1.3.0rc2 with that change included within the next couple of hours. Since my goal is to release 1.3.0 within the next few weeks (in the hopes that nothing problematic pops up that prevents this from working out) I'll not backport it to 1.2.x for now.

It would be interesting though if once 1.3.0rc2 is out if you could test with that if you can still observe the issue.

@bolsoncerrado

This comment has been minimized.

Copy link

commented Jan 20, 2017

Not sure if 100% related to this or it's the acual same issue. I just posted this on TouchUI github:

"Not sure who to blame here, but I'll try it here first since it mainly happens on TouchUI LOCAL pi+touch display....

Thing is, when prints finish overnight and printer stays idle for a long long time, raspi is lagggish as hell, and takes ages to respond to touch commands...

Happens on both pi2 and pi3. Restarting the service seems to solve the issue, so i wonder if there's some sort of memory hog increased by the idle status, like "terminal" log buildup on MEMORY instead than on "disk" cache, that makes Pi go slow as hell??

Obviously, Pi reacts fine when accessed remotelly via a computer in the LAN."

And I'm on the latest stable 1.3.0 master :(

EDIT: Just to add, once again, that i notice this only in LOCAL Pi+Touch display+TouchUI. And it does not completely freeze, it just laggggs big time.

@BillyBlaze

This comment has been minimized.

Copy link
Collaborator

commented Jan 21, 2017

Not related 👍 But after reading this issue, I am wondering, @christianlupus is this resolved for you in 1.3.0??

@christianlupus

This comment has been minimized.

Copy link
Author

commented Jan 21, 2017

@BillyBlaze I am not sure I have to rebuild most of my printer and need to wait for spare parts (yet ordered but not arrived). I broke the head due to a wrong assembly. Until then it does not make much sense to use Octoprint... I will report soon I hope.

@foosel

This comment has been minimized.

Copy link
Owner

commented Mar 8, 2017

@christianlupus any chance to check? :)

@christianlupus

This comment has been minimized.

Copy link
Author

commented Apr 4, 2017

Not yet, sorry. Cable had wrong pinning. Reordered...

@christianlupus

This comment has been minimized.

Copy link
Author

commented Jun 8, 2017

OK, I finally got my printer running again. Yay!
I just updated to the rc/maintenance channel in order to get a recent version. Yet there was no problem in the process of printing. So I have good hopes that the problem no longer exists as multiple prints have successfully worked without lags or problems in the browser.

I will close this issue and will reopen it in case the problem reappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.