Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rare segfault on wait_for_update #84

Merged

Conversation

LinusCDE
Copy link
Collaborator

Found this, when using this implementation for plato:

grafik

The cause was that sem_timedwait was called on sem_open's result SEM_FAILED. That call failed with errno == 22 ("Invalid value"). Seems that the rust code called the code too fast and and probably made more likely by more frequent wait()'s in plato due to the ui system. Plato uses a bus system for almost everything, so afaik, calls should never be happen in parallel.

The implemenation was 1:1 like the reference implementation. While this didn't ever happen on the official rm2fb client, the cause might still be interesting to investigate and could possibly hint to a problem in the original impl. (cc @raisjn)

Added a way to do several attempts with delays which should provide a good enough workaround. Also turned later occuring segfault into a proper panic, should 3 attempts/10ms delay be not enough for some app.

@raisjn
Copy link

raisjn commented Dec 21, 2021

interesting find! it's a bit surprising that it happens here. rm2fb should have NULL ptr handling added to make sure it doesn't also segfault, i wonder what's happening. EINVAL for sem_open() indicates something weird is going on.

@fenollp
Copy link
Collaborator

fenollp commented Dec 21, 2021

LGTM!
Just maybe use a u8 instead of usize for the counter...

@raisjn
Copy link

raisjn commented Dec 23, 2021

i'm still curious about the cause here - can you add print statements to help diagnose / see what the requested shm path is? if it isn't happening in rm2fb (we would see segfaults in plato if it was, right?), it is likely better to figure out and fix the underlying cause.

EINVAL is given back when the name is invalid in shm_open()

@LinusCDE
Copy link
Collaborator Author

Reverted the attempt stuff and added a lot of print debugging (commit LinusCDE@ae72c35).

Built current plato (master branch) against this lib: dist.tar.gz. You can just extract this on device and run the .plato.sh.

If you wish to see the exact sysv messages sent, you can add this spy lib between: libswtfb_sysv_spy.so.xz (Source)
Is also a libchook, you can add with LD_PRELOAD=/home/root/libsysv_spy.so. You can also add both this and the rm2fb client like this: LD_PRELOAD="/home/root/libsysv_spy.so /opt/lib/librm2fb_client.so.1" (used this originally for fixing bugs when implementing the client).

This really seems to be something that is really timing sensitive. Because running the binary with print statements works perfectly fine. You can see for yourself by replacing the "plato" binary with this one. The only change is that I just commented out every new println! statement.

It is also probably noteworthy to mention, that println! on rust by default locks a mutex. So it will probably cause a slightly longer delay than std::cout or printf because of the extra synchronization.

The O_CREAT option needs the mode parameter specified.
The server code did do this, but not the client side,
resulting in an error only if the client actually got
to creating the semaphore first.
@LinusCDE
Copy link
Collaborator Author

LinusCDE commented Dec 23, 2021

The new commit should fix the issue for good. Seems that the O_CREAT needs the mode for creating a shared memory file if it doesn't exist. If the code waited a bit longer after the sysv message, the update would be complete and the server side would have created the semaphore correctly. The client didn't specify these arguments and if it actually needed to create it (because executed first), the code would error, possibly on missing/wrong arguments for creating the file.

Not sure why the c++ client/wrapper didn't cause this. My only guess would be that the wrapper uses Entware's libc (currently 2.27) which defaults to some correct values if they are not specified. Since we usually use a generic compiler for rust on arm, the libc crate is most likely the system one (currently 2.31) which might just behave differently when the arguments.

Specifying 0o644 as on the server side fixes the issue. @raisjn this might also be a nice upstream change. Why not mandatory this is probably better to make it explicit to prevent problems when compiling with/for different glibc/pthread versions. Also, do you know the use of the 4th argument (0). I doesn't seem to affect anything so I left it out for now.

screenshot4413

Edit: Mixed up which libc is which version. This seems to be more like a, we removed this default behaviour for clarification. So updating it for the c++ client is definitely recommended as it might be a case of future proofing, should entware's libc update.

@LinusCDE
Copy link
Collaborator Author

Had some weird behaviour. Not sure if really needed, added the 0 argument just to be sure.

LinusCDE added a commit to LinusCDE/remarkable2-framebuffer that referenced this pull request Dec 23, 2021
raisjn pushed a commit to ddvk/remarkable2-framebuffer that referenced this pull request Dec 23, 2021
@fenollp fenollp merged commit 22cea77 into canselcik:master Jan 6, 2022
@LinusCDE LinusCDE added this to the 0.6.0 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants