Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows docker volume lock file issue #385

Closed
awakecoding opened this issue Feb 6, 2020 · 11 comments
Closed

Windows docker volume lock file issue #385

awakecoding opened this issue Feb 6, 2020 · 11 comments

Comments

@awakecoding
Copy link
Contributor

I am using "library/mongo:4.2-windowsservercore-1809" on Windows Server 2019 with a docker volume. When launching the container on an empty volume, it all works fine, but I am unable to restart my container without encountering a lock file issue.

This issue is similar to this older issue described here: StefanScherer/dockerfiles-windows#322

I am using a docker volume instead of a bind mount to avoid file system issues, but it looks like docker volumes are problematic as well. What should I be verifying? Being unable to use MongoDB in a Windows container is a serious release blocker for me.

Here is the docker log:
2020-02-06T00:19:53.651+0000 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none' 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] MongoDB starting : pid=5768 port=27017 dbpath=C:\data\db\ 64-bit host=224c0ea470d2 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] db version v4.2.3 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] git version: 6874650b362138df74be53d366bbefc321ea32d4 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] allocator: tcmalloc 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] modules: none 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] build environment: 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] distmod: 2012plus 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] distarch: x86_64 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] target_arch: x86_64 2020-02-06T00:19:53.991+0000 I CONTROL [initandlisten] options: { net: { bindIp: "*" } } 2020-02-06T00:19:54.066+0000 I STORAGE [initandlisten] Detected data files in C:\data\db\ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'. 2020-02-06T00:19:54.067+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=3583M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress], 2020-02-06T00:19:54.070+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:70320][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device Raw: [1580948394:70320][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device 2020-02-06T00:19:54.070+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:70320][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device Raw: [1580948394:70320][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device 2020-02-06T00:19:54.071+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:71316][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device Raw: [1580948394:71316][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device 2020-02-06T00:19:54.071+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:71316][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device Raw: [1580948394:71316][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device 2020-02-06T00:19:54.072+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:72314][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device Raw: [1580948394:72314][5768:140703784850016], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file. : Resource device 2020-02-06T00:19:54.072+0000 E STORAGE [initandlisten] WiredTiger error (16) [1580948394:72314][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device Raw: [1580948394:72314][5768:140703784850016], wiredtiger_open: __conn_single, 1674: WiredTiger database is already being managed by another process: Resource device 2020-02-06T00:19:54.072+0000 W STORAGE [initandlisten] Failed to start up WiredTiger under any compatibility version. 2020-02-06T00:19:54.072+0000 F STORAGE [initandlisten] Reason: 16: Resource device 2020-02-06T00:19:54.073+0000 F - [initandlisten] Fatal Assertion 28595 at src\mongo\db\storage\wiredtiger\wiredtiger_kv_engine.cpp 789 2020-02-06T00:19:54.073+0000 F - [initandlisten] ***aborting after fassert() failure

And the corresponding docker run command:
docker run --name den-mongo -d --network=den-network -v den-mongodata:c:\data\db

@tianon
Copy link
Member

tianon commented Feb 6, 2020

Is the initial container shutting down cleanly? (Do the MongoDB logs show that it was able to shut down and close out the database files successfully?)

@awakecoding
Copy link
Contributor Author

I shutdown the container with a simple "docker stop". My current code does not pass "--rm" to the docker run command, so the container lingers one after it's been stopped (easier to run docker logs afterwards). To compensate for not using --rm, I run docker stop + docker rm before docker run for all of my containers, so maybe something in this sequence messes things up.

Would it be possible that Windows accesses the files from the docker volume, causing them to be potentially locked by another process when it shouldn't?

I managed to find a temporary workaround by finding the real file path on the disk where the docker volume is stored, and deleting the WiredTiger.lock file while the container is not running. If the container is running, Windows won't let me delete it because another process is accessing it.

PS C:\Users\wayk-admin\den-test> $volume = $(docker volume inspect den-mongodata) | ConvertFrom-Json PS C:\Users\wayk-admin\den-test> $volume.MountPoint C:\ProgramData\docker\volumes\den-mongodata\_data PS C:\Users\wayk-admin\den-test> $WiredTigerLock = Join-Path $volume.MountPoint 'WiredTiger.lock' PS C:\Users\wayk-admin\den-test> Remove-Item $WiredTigerLock -Force

@awakecoding
Copy link
Contributor Author

awakecoding commented Feb 6, 2020

I launched my container, then instead of calling docker rm on it, I stopped it, then re-launched it with docker start (probably the best way to do it). I encountered the same issue, but since I didn't remove the container I could get a clean log of the previous shutdown. It all looks good to me, so I'm thinking it might be a timing issue related to docker and the fact that the lock file is one of the last operation done on the container shutdown:

2020-02-06T01:01:32.517+0000 I  CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] MongoDB starting : pid=8480 port=27017 dbpath=C:\data\db\ 64-bit host=875be735f5ee
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] db version v4.2.3
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] git version: 6874650b362138df74be53d366bbefc321ea32d4
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] allocator: tcmalloc
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] modules: none
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] build environment:
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten]     distmod: 2012plus
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten]     distarch: x86_64
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten]     target_arch: x86_64
2020-02-06T01:01:32.603+0000 I  CONTROL  [initandlisten] options: { net: { bindIp: "*" } }
2020-02-06T01:01:32.640+0000 I  STORAGE  [initandlisten] Detected data files in C:\data\db\ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2020-02-06T01:01:32.641+0000 I  STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3583M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],
2020-02-06T01:01:32.922+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950892:922421][8480:140704011735648], txn-recover: Recovering log 2 through 3
2020-02-06T01:01:33.034+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950893:34523][8480:140704011735648], txn-recover: Recovering log 3 through 3
2020-02-06T01:01:33.194+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950893:194483][8480:140704011735648], txn-recover: Main recovery loop: starting at 2/6528 to 3/256
2020-02-06T01:01:33.477+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950893:477432][8480:140704011735648], txn-recover: Recovering log 2 through 3
2020-02-06T01:01:33.702+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950893:702420][8480:140704011735648], txn-recover: Recovering log 3 through 3
2020-02-06T01:01:33.813+0000 I  STORAGE  [initandlisten] WiredTiger message [1580950893:812425][8480:140704011735648], txn-recover: Set global recovery timestamp: (0, 0)
2020-02-06T01:01:34.124+0000 I  RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(0, 0)
2020-02-06T01:01:34.794+0000 I  STORAGE  [initandlisten] Timestamp monitor starting
2020-02-06T01:01:34.842+0000 I  CONTROL  [initandlisten]
2020-02-06T01:01:34.842+0000 I  CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2020-02-06T01:01:34.842+0000 I  CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2020-02-06T01:01:34.842+0000 I  CONTROL  [initandlisten]
2020-02-06T01:01:34.911+0000 I  SHARDING [initandlisten] Marking collection local.system.replset as collection version: <unsharded>
2020-02-06T01:01:34.947+0000 I  STORAGE  [initandlisten] Flow Control is enabled on this deployment.
2020-02-06T01:01:34.948+0000 I  SHARDING [initandlisten] Marking collection admin.system.roles as collection version: <unsharded>
2020-02-06T01:01:34.948+0000 I  SHARDING [initandlisten] Marking collection admin.system.version as collection version: <unsharded>
2020-02-06T01:01:35.000+0000 I  SHARDING [initandlisten] Marking collection local.startup_log as collection version: <unsharded>
2020-02-06T01:01:35.572+0000 I  FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'C:/data/db/diagnostic.data'
2020-02-06T01:01:35.577+0000 I  SHARDING [LogicalSessionCacheRefresh] Marking collection config.system.sessions as collection version: <unsharded>
2020-02-06T01:01:35.578+0000 I  SHARDING [LogicalSessionCacheReap] Marking collection config.transactions as collection version: <unsharded>
2020-02-06T01:01:35.579+0000 I  NETWORK  [listener] Listening on 0.0.0.0
2020-02-06T01:01:35.579+0000 I  NETWORK  [listener] waiting for connections on port 27017
2020-02-06T01:01:36.004+0000 I  SHARDING [ftdc] Marking collection local.oplog.rs as collection version: <unsharded>
2020-02-06T01:02:02.547+0000 I  CONTROL  [thread1] CTRL_SHUTDOWN_EVENT signal
2020-02-06T01:02:02.547+0000 I  CONTROL  [consoleTerminate] got CTRL_SHUTDOWN_EVENT, will terminate after current cmd ends
2020-02-06T01:02:02.548+0000 I  NETWORK  [consoleTerminate] shutdown: going to close listening sockets...
2020-02-06T01:02:02.587+0000 I  -        [consoleTerminate] Stopping further Flow Control ticket acquisitions.
2020-02-06T01:02:02.587+0000 I  CONTROL  [consoleTerminate] Shutting down free monitoring
2020-02-06T01:02:02.587+0000 I  FTDC     [consoleTerminate] Shutting down full-time diagnostic data capture
2020-02-06T01:02:02.595+0000 I  STORAGE  [consoleTerminate] Deregistering all the collections
2020-02-06T01:02:02.596+0000 I  STORAGE  [consoleTerminate] Timestamp monitor shutting down
2020-02-06T01:02:02.596+0000 I  STORAGE  [consoleTerminate] WiredTigerKVEngine shutting down
2020-02-06T01:02:02.623+0000 I  STORAGE  [consoleTerminate] Shutting down session sweeper thread
2020-02-06T01:02:02.623+0000 I  STORAGE  [consoleTerminate] Finished shutting down session sweeper thread
2020-02-06T01:02:02.623+0000 I  STORAGE  [consoleTerminate] Shutting down journal flusher thread
2020-02-06T01:02:02.688+0000 I  STORAGE  [consoleTerminate] Finished shutting down journal flusher thread
2020-02-06T01:02:02.688+0000 I  STORAGE  [consoleTerminate] Shutting down checkpoint thread
2020-02-06T01:02:02.688+0000 I  STORAGE  [consoleTerminate] Finished shutting down checkpoint thread
2020-02-06T01:02:02.906+0000 I  STORAGE  [consoleTerminate] shutdown: removing fs lock...
2020-02-06T01:02:02.906+0000 I  CONTROL  [consoleTerminate] now exiting
2020-02-06T01:02:02.906+0000 I  CONTROL  [consoleTerminate] shutting down with code:12

@awakecoding
Copy link
Contributor Author

I just implemented the hack of deleting the WiredTiger.lock file in my PowerShell script. It's far from ideal, but at least it is no longer release blocking for me. However, we should definitely go to the bottom of this. Since "shutdown: removing fs lock..." is apparently the last operation made by the process, I highly suspect it might be a problem with file operations done right before docker terminates the container. I wouldn't be surprised that adding a 1 second sleep there would "solve" the problem.

@tianon
Copy link
Member

tianon commented Feb 7, 2020

I grabbed a Windows machine so I could test this out and I'm not actually able to reproduce -- I started MongoDB with a volume and did docker stop, then started a new container from the same volume and it started up without any issues. 😕

(I also played with db.shutdownServer(), but that test worked fine too, which is less surprising.)

@awakecoding
Copy link
Contributor Author

I can easily reproduce the issue with my cmdlet (https://github.com/devolutions/waykden-ps). I will see if I can make a smaller reproduction test case. Otherwise, just starting and stopping with my cmdlet that wraps docker containers is sufficient if you're willing to give it a shot. My last version has the workaround to delete the lock file but it prints it out when it does so you'd see it.

@awakecoding
Copy link
Contributor Author

I am quite sure I've encountered this issue in the past on various Windows environments without hitting it consistently. Maybe it would help to know that I could see it happen every time in a very specific Azure VM instance, a dev/test sized Windows Server 2019 Core + containers exactly like the one from this guide I wrote: https://github.com/Devolutions/WaykDen-ps/blob/master/docs/getting-started-azure.adoc

@awakecoding
Copy link
Contributor Author

@tianon I have made a simple script to run a MongoDB Windows container with a docker volume, stop it, then check for the remaining WiredTiger.lock file and delete it if present. It happens every single time I run it, no effort required: https://gist.github.com/awakecoding/f24608fad3ed6e867df624cef62e28ce

Let me know if you can reproduce the issue with the script, but I'm guessing it won't happen since you already tried. If you're willing to spend some time on it, I would be willing to create a virtual machine in Azure on our company subscription and give you access to it. I'd make sure that the issue is easy to reproduce on the virtual machine I'd prepare for you. I'd create an identical Windows Server 2019 Core + Containers virtual machine as the one where it happens all the time.

@awakecoding
Copy link
Contributor Author

Bingo, someone investigated this very same issue and got a smaller program to reproduce it: moby/moby#39088

It sounds like there is an issue with Windows containers and the APIs used for releasing the WiredTiger.lock file.

@mattewre
Copy link

mattewre commented Oct 7, 2020

Is there a workaround for this? I can't restart MongoDB container without manually deleting .lock files.
I'm experiencing the same problem on Windows Server 2019 Host + MongoDB Windows-based container.
Tested on clean VM with Docker CE or Azure IoT Edge (Moby).

@yosifkit
Copy link
Member

yosifkit commented Oct 7, 2020

The only workarounds are to automate deleting the lock file or to use MongoDB on a linux host.

Closing this in favor of the upstream issue in Windows Containers (microsoft/Windows-Containers#37)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants