Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

11 is broken on NFS #735

Closed
1 task done
aep opened this issue Jul 18, 2023 · 6 comments
Closed
1 task done

11 is broken on NFS #735

aep opened this issue Jul 18, 2023 · 6 comments

Comments

@aep
Copy link

aep commented Jul 18, 2023

Regression description

when upgrading from 10 to 11 and /data is on NFS you'll get

FoundryVTT | 2023-07-18 18:01:49 | [error] Database is not open

foundryvtt/foundryvtt#9766

mentiones this is due to the change to leveldb.
leveldb upstream works fine on NFS so this is suspicious.

according to discord this bug is intentional to cause issues for professional setups.

However, that's discord, and i feel like the more likely explanation is that they're wrapping leveldb in some exclusive locking to avoid accidentally starting multiple instances.

opening this issue here because it's more likely there's other people in this community with the same issue.

unfortunately their javascript is somewhat obfuscated, making it harder to find the issue.

Worked up to version

10

Stopped working in version

11

Relevant log output

oundryVTT | 2023-07-18 18:01:49 | [error] Failed to connect to database "messages": Database is not open
FoundryVTT | 2023-07-18 18:01:49 | [error] Failed to connect to database "messages": Database is not open
Error: Failed to connect to database "messages": Database is not open
    at maybeOpened (/home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:133:18)
    at /home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:160:13
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
FoundryVTT | 2023-07-18 18:01:49 | [error] Database is not open
Error: Database is not open
    at LevelDatabase.keys (/home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:697:13)
    at LevelDatabase.compactFull (file:///home/foundry/resources/app/dist/database/backend/level-database.mjs:1:1647)
    at LevelDatabase.close (file:///home/foundry/resources/app/dist/database/backend/level-database.mjs:1:1171)
    at Module.disconnect (file:///home/foundry/resources/app/dist/database/database.mjs:1:1658)
    at World.deactivate (file:///home/foundry/resources/app/dist/packages/world.mjs:1:11206)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@aep
Copy link
Author

aep commented Jul 18, 2023

strace reveals its not even using flock. its creating a LOCK file which suceeds

/home/foundry # strace -f -p 87 2>&1 | grep messages
[pid    87] access("/data/Data/worlds/kmkm/data/messages", F_OK) = 0
[pid    94] mkdir("/data/Data/worlds/kmkm/data/messages", 0777 <unfinished ...>
[pid    96] statx(AT_FDCWD, "/data/Data/worlds/kmkm/data/messages", AT_STATX_SYNC_AS_STAT, STATX_ALL,  <unfinished ...>
[pid    95] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid    95] rename("/data/Data/worlds/kmkm/data/messages/LOG", "/data/Data/worlds/kmkm/data/messages/LOG.old" <unfinished ...>
[pid    95] open("/data/Data/worlds/kmkm/data/messages/LOG", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid    95] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid    95] open("/data/Data/worlds/kmkm/data/messages/LOCK", O_RDWR|O_CREAT|O_LARGEFILE, 0644 <unfinished ...>
[pid    95] access("/data/Data/worlds/kmkm/data/messages/CURRENT", F_OK <unfinished ...>
[pid    95] open("/data/Data/worlds/kmkm/data/messages/CURRENT", O_RDONLY|O_LARGEFILE) = 39
[pid    95] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000002", O_RDONLY|O_LARGEFILE) = 39
[pid    95] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 39
[pid    95] open("/data/Data/worlds/kmkm/data/messages/000003.log", O_RDONLY|O_LARGEFILE) = 39
[pid    95] readv(39, [{iov_base="|\360\244_ \r\1\1\0\0\0\0\0\0\0\2\0\0\0\1\32!messages!J"..., iov_len=32767}, {iov_base="", iov_len=1024}], 2) = 3367
[pid    95] open("/data/Data/worlds/kmkm/data/messages/000005.ldb", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 39
[pid    95] writev(39, [{iov_base="", iov_len=0}, {iov_base="\242\32\200\0\"\253\f!messages!JMUZweuXaLnRGbw"..., iov_len=1950}], 2) = 1950
[pid    95] open("/data/Data/worlds/kmkm/data/messages/000005.ldb", O_RDONLY|O_LARGEFILE) = 39
[pid    95] stat("/data/Data/worlds/kmkm/data/messages/000005.ldb", {st_mode=S_IFREG|0644, st_size=2107, ...}) = 0
[pid    95] open("/data/Data/worlds/kmkm/data/messages/000005.sst", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid    95] unlink("/data/Data/worlds/kmkm/data/messages/000005.ldb") = 0

compared with a working tmpfs mount, i see no difference

[pid    87] access("/data/Data/worlds/kmkm/data/messages", F_OK) = 0
[pid    96] mkdir("/data/Data/worlds/kmkm/data/messages", 0777 <unfinished ...>
[pid    94] statx(AT_FDCWD, "/data/Data/worlds/kmkm/data/messages", AT_STATX_SYNC_AS_STAT, STATX_ALL,  <unfinished ...>
[pid    97] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid    97] rename("/data/Data/worlds/kmkm/data/messages/LOG", "/data/Data/worlds/kmkm/data/messages/LOG.old" <unfinished ...>
[pid    97] open("/data/Data/worlds/kmkm/data/messages/LOG", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid    97] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid    97] open("/data/Data/worlds/kmkm/data/messages/LOCK", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 39
[pid    97] access("/data/Data/worlds/kmkm/data/messages/CURRENT", F_OK) = 0
[pid    97] open("/data/Data/worlds/kmkm/data/messages/CURRENT", O_RDONLY|O_LARGEFILE) = 40
[pid    97] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000004", O_RDONLY|O_LARGEFILE) = 40
[pid    97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 40
[pid    97] open("/data/Data/worlds/kmkm/data/messages/000007.log", O_RDONLY|O_LARGEFILE) = 40
[pid    97] open("/data/Data/worlds/kmkm/data/messages/000011.log", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 40
[pid    97] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000010", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 41
[pid    97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE) = 42
[pid    97] open("/data/Data/worlds/kmkm/data/messages/000010.dbtmp", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 42
[pid    97] rename("/data/Data/worlds/kmkm/data/messages/000010.dbtmp", "/data/Data/worlds/kmkm/data/messages/CURRENT") = 0
[pid    97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 42
[pid    97] unlink("/data/Data/worlds/kmkm/data/messages/000007.log") = 0
[pid    97] unlink("/data/Data/worlds/kmkm/data/messages/MANIFEST-000004") = 0
[pid    87] access("/data/Data/worlds/kmkm/data/messages.db", F_OK <unfinished ...>
[pid    97] open("/data/Data/worlds/kmkm/data/messages/000009.ldb", O_RDONLY|O_LARGEFILE <unfinished ...>
[pid    97] stat("/data/Data/worlds/kmkm/data/messages/000009.ldb",  <unfinished ...>
[pid    96] open("/data/Data/worlds/kmkm/data/messages/000012.log", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid    98] unlink("/data/Data/worlds/kmkm/data/messages/000013.ldb") = -1 ENOENT (No such file or directory)
[pid    98] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE <unfinished ...>
[pid    98] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid    98] unlink("/data/Data/worlds/kmkm/data/messages/000011.log") = 0

@felddy
Copy link
Owner

felddy commented Jul 18, 2023

I saw your message on #v11-update-troubleshooting.

Interestingly my production install has always used an NFS share and it seems to still be working correctly with 11.305.
I'm on Linux using NFS 4.2.

If it helps, my mount point looks like this:

nas:/homes/username on /home/username type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y)

Thanks for digging into this in more detail.

@aep
Copy link
Author

aep commented Jul 31, 2023

i got a workaround that works for me. but i'm not sure if its safe yet.

the underlying issue is that the way they use the db library (wrong, according to the libs dev),
causes a weird combination of syscalls that may lead to data loss on NFS.

but i really dont know and my head hurts.

i'll keep this open in the hopes that i remind myself to report after some production tests.

@felddy
Copy link
Owner

felddy commented Aug 2, 2023

Thank you for your work on this. It is quite a rabbit hole you've gone down. I agree that you should keep this issue open in hopes that a solution if found.

Cross-linking your work with the leveldb folks for posterity:

@github-actions
Copy link

This issue has been automatically marked as stale because it has been inactive for 28 days. To reactivate the issue, simply post a comment with the requested information to help us diagnose this issue. If this issue remains inactive for another 7 days, it will be automatically closed.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

This issue has been automatically closed due to inactivity. If you are still experiencing problems, please open a new issue.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants