Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statuspage-error on some TP-Link WR1043v1 with gluon-v2021.1 #2256

Closed
tackin opened this issue Jun 30, 2021 · 4 comments · Fixed by #2262
Closed

Statuspage-error on some TP-Link WR1043v1 with gluon-v2021.1 #2256

tackin opened this issue Jun 30, 2021 · 4 comments · Fixed by #2262

Comments

@tackin
Copy link

tackin commented Jun 30, 2021

With TP-Lnik WR1043v1-nodes on gluon-v2021.1-1-g0f9a6334+ the call for the statuspage randomly ends with a:

XML-Verarbeitungsfehler: nicht wohlgeformt
Adresse: http://[2001:bf7:fc0f:1:76ea:3aff:fee4:d1da]/cgi-bin/status
Zeile Nr. 7, Spalte 12:
 		<title><!DOCTYPE html>

curl tells:

The called action terminated with an exception:
/usr/lib/lua/gluon/web/template.lua:43: Failed to execute template &#39;status-page&#39;.
A runtime error occurred: [string &#34;/lib/gluon/status-page/view/status-page.htm...&#34;]:90: attempt to index global &#39;nodeinfo&#39; (a nil value)</pre>

Seems as if nodeinfo sometimes is not defined because of a timeout of a "gluon-neighbour-info"-call.

Usually we see this only on small and older nodes-hardware with a higher load. But this one was deeply relaxed.

@mweinelt
Copy link
Contributor

Question is if we want to relax the timeout a bit, or if that has downsides.

local nodeinfo = json.parse(util.exec('exec gluon-neighbour-info -d ::1 -p 1001 -t 1 -c 1 -r nodeinfo'))

@grische
Copy link
Contributor

grische commented Jul 2, 2021

I was about to report a similar bug on my wr841n v13, but those might be related.

Whenever the router boots without WiFi (disabled via the physical button), I am unable to find any status-page/provider processes:

root@test:~# ps  | grep status-page/provider
30986 root      1216 S    grep status-page/provider

That, in turn, causes the gluon-neighbour-info to return error code 1 and print no output

root@test:~# gluon-neighbour-info -d ::1 -p 1001 -t 60 -r nodeinfo
root@test:~# echo $?
1

In comparison to a working machine, I can see plenty of status-page/provider processes:

root@itworkshere:~# ps  | grep "status-page/provider"
10408 root       704 S    sse-multiplex exec /lib/gluon/status-page/providers/stations 'mesh0'
10414 root      1328 S    /lib/gluon/status-page/providers/stations mesh0
10415 root       704 S    sse-multiplex exec /lib/gluon/status-page/providers/stations 'mesh1'
10416 root       704 S    /usr/sbin/sse-multiplex exec /lib/gluon/status-page/providers/neighbours-batadv
10422 root      1328 S    /lib/gluon/status-page/providers/stations mesh1
19920 root      1328 S    /lib/gluon/status-page/providers/stations mesh0
...

Where is this status page provider started? How can I continue debugging here?


There is actually a second bug overlapping with the above, causing the browser error message to be so cryptic at first. The HTML is malformed and contains half an HTML page which ends where the nodeinfo call is and a second (well-formed) HTML page after that, containing the error message (probably this one). The embedding of both together cannot be rendered of course.

But I would open a second bug report for this if there are no objections.

@blocktrron
Copy link
Member

@mweinelt et al

Can you open a PR for increasing the timeout? So we can evaluate on a common base whether or not it improves our situation.

@grische

This issue is now tracked as #2260

Please try 19381a2, it should fix the segfault. freifunk-gluon/packages@825aa0c#commitcomment-53000911 is still valid

Please try 19381a2, it should fix the segfault. freifunk-gluon/packages@825aa0c#commitcomment-53000911 is still valid

mweinelt added a commit that referenced this issue Jul 3, 2021
It was found that a one second timeout for nodeinfo data may be too low,
so that when a node is otherwise occupied that timeout may be reached
too often.

The nodeinfo query response is also vital to the status-page base
template, so that when it times out, the site will be turned in a broken
state, that it cannot recover from.

Fixes: #2256
@grische
Copy link
Contributor

grische commented Jul 4, 2021

@blocktrron The problem with the missing nodeinfo is fixed with the most recent commit. I was also able to confirm that it was all due to a crashing respondd.

neocturne pushed a commit that referenced this issue Jul 12, 2021
It was found that a one second timeout for nodeinfo data may be too low,
so that when a node is otherwise occupied that timeout may be reached
too often.

The nodeinfo query response is also vital to the status-page base
template, so that when it times out, the site will be turned in a broken
state, that it cannot recover from.

Fixes: #2256
neocturne pushed a commit that referenced this issue Feb 3, 2022
It was found that a one second timeout for nodeinfo data may be too low,
so that when a node is otherwise occupied that timeout may be reached
too often.

The nodeinfo query response is also vital to the status-page base
template, so that when it times out, the site will be turned in a broken
state, that it cannot recover from.

Fixes: #2256
(cherry picked from commit 76185e3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants