-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug Singularity failure on Sherlock #5
Comments
@vsoch Thanks again for all your help. I updated the repo to package and name things a little better. One change within the containers themselves was to re-use the .wine configuration created in the Dockfile's packaging, rather than recreating it. I also made everything batch-only for now. I'll add back the interactive work, if it's asked for. Everything worked great on my machine. Then I moved to the Singularity container to Sherlock, and got the error above -- I tried both on a dev node and a login node. Wine makes me sad. |
I think it would make sense that you need to use the wine environment generated by the container at runtime, given the specific environment. Who knows how the Docker environment (created at build time within the container as root) compares to the Singularity container, using the environment from the hpc node and not as root. I don't have insight for why it works locally vs on the cluster, but I would suggest reverting back to generation on the fly (on sherlock in /tmp) and then only if that works look into this different approach. |
And don't be sad :) We will figure things out! |
Sorry if I wasn't clear -- while the Dockerfile build creates the .wine directory, I am able to successfully run via Singularity on my Ubuntu 18.04 box with Singularity 3.6. The Sherlock machines are CentOS 7.6.1810 and are using Singularity 3.5.3-1.1.el7. Are there Singularity guarantees regarding portability across OS and Singularity versions? |
I understood what you said! My suggestion is still to try the original strategy with creating the wine prefix directory at runtime. Singularity can’t technically promise anything, but for the most part can guarantee the “same” container - but that doesn’t account for variables that might leak into the environment. |
Alas, I'm getting the same error. Switching the wineprefix content creation from buildtime to runtime (in /tmp) gave the same results. I tried interactively with the same result, as well. But interactively, Windows popped up a window with backtrace, which I'm including here. |
Okay so it’s good we could rule out those two being different! Let’s try adding more isolation - did you try with —containall and/or —cleanenv? |
Are you installing 64 bit wine prefix? |
I don't see in your updated recipe where you are calling |
Can you please try the unedited version that we merged from my PR? The difference is that:
I would step back and take the simplest approach possible. I think the entrypoint of the previous container is overly complicated for what is needed and probably giving you a bug. I would remove that, and try just using wineprefix --init after the variable export, installing what you need, and see if that doesn't error out. |
I have been trying both with and without wineboot --init/--end-session. I removed it originally because it was not necessary for running on my machine. I believe there are still some 32-bit pieces of Windows. The wine installation does include the i386 architecture to make a multiarch setup. It's not clear it's an architecture size problem because the container runs fine on my 64-bit machine -- but until we solve this, I won't rule anything out. I hadn't used the cleanenv or contain related flags before. --cleanenv gives the same results. Using --contain or --containall fails due to all sorts of I/O errors. Too much isolation?! I also tried updating to wine 5.16 (development version) from wine 5.02, sticking with the on-the-fly winetricks and using the wineboot. Still getting the same fault. I'm done for this evening! |
I believe I have tried the simplest possibilities at this point -- the current setup is exactly what you suggest. I have removed the entrypoint based winetricks from the Docker image (as noted in this previous comment , and ran it instead in the runscript. I have also experimented with wineboot --init. I have also let wine create the directory as my own user on Sherlock, skipping the temp directory altogether. At this point, I've methodically gone through the combinations I can think of. It's not clear to me what to focus on. The fact that it runs on my 64-bit machine seems to indicate that the container is OK and the wineprefix contents are not likely the issue. My next approach will be to try building the image on CentOS instead of Ubuntu. I tried earlier with a docker-in-docker approach, but failed. I'll likely bring up a cloud VM. |
Yeah no worries! So my gut is saying that the issue is using the wineprefix generated in the container and then having the wrong architecture, and since you are done for the evening I went went ahead and tested the version that was merged from my branch on Sherlock. I just cloned my branch again, and transferred the singularity image and data to sherlock. I tested on an interactive node, both with X11 (e.g., SINGULARITYENV_DISPLAY=:95 SINGULARITYENV_XVFB_RESOLUTION=320x240x8 SINGULARITYENV_XVFB_SCREEN=0 SINGULARITYENV_XVFB_SERVER=:95 singularity run --bind ${PWD}/profiles:/PROFILES --bind ${PWD}/overview-23:/data two-photon.sif But other than that, It seems to work equivalently to on my host. Is this what you saw?
And the tiff is generated too. $ ls overview-23/
overview-023_Cycle00001_Ch3_000001.ome.tif overview-023.env overview-023.xml References So this is really good news, because it means we do have a working solution! You must have accidentally added some tiny change that broke the build, and it wasn't apparent on your host. If you feel strongly about keeping your changes, you should probably start from the version I made, and then make one change at a time, test as you go, and you'll know when it breaks (and perhaps have better information to work to fix it). I'd be really interested to know what the bug turned out to be! You can also just use the verison above, which appears to work as expected. Anyhoo, hopefully this will be some good news to wake up to. I should be off to bed soon too, night! |
I was intrigued by your update, so hopped back on. I build at 4adbb20, and still have the same result on Sherlock. Can you verify what OS you are using and what version of Singularity you built with? |
Ubuntu 18.04 with 3.6.0! |
I'll upload the container for you if you want to reproduce what I did. I would suggest having an automated build of the Docker container to a registry (using CI, github workflows would work well for this) and then pulling down to Sherlock (using it's singularity). |
Thanks. Can you just transfer the sif to an OAK location and make it world readable? |
I don’t have the same superpowers as my colleagues (eg writing to your oak space) but I can send you a Google Drive link to download and transfer... low tech but gets the job done! It’s uploading now. |
Not my space... just yours (and chmod it)... if you have any OAK space or access to SCRATCH. |
I can try that - I've never used OAK before. Can you tell me the command for chmod to get the correct permission? |
Oh -- no worries then. But it would |
Ah, that didn't work. I sometimes get confused how locked down things are. Maybe that drive link after all? |
(I guess if you made your home directory world readable.... but let's not go there) |
Yeah, I always run into these issues with permissions, I think you have to make the directory world readable, which isn't something I want to do. I gave you access to the drive file, hopefully that will do the trick! |
It still does not work for me. Hmph. I used your command and your sif file. Perhaps its something in our setups? (BTW, the XVFB env vars aren't to be used for Singularity, so the only one that is needed is the one you rely on in runscript. They are only used by the Docker entrypoint script.) I removed my dotfiles (i.e. .bashrc and .profile, to give me a "standard" Sherlock environment) and my local .ssh/config (to be sure the connection wasn't the issue, due to all the display wrangling). If it's not too much to ask, can you do the same and try again? And can you run on a login node and tell me which one you use? I will recreate the same env as you, and if it still fails, I'll file something with SRCC. |
I'll give you again complete instructions to reproduce what I did above, and you can run it by others / use for a bug report. ContainerThe container is built first with Docker from the branch here. It's then pulled down to Singularity with the docker daemon. Reproducing this isn't as important because the container is provided for testing on the cluster, if someone else needs access please have them reach out to me (and you can also share the one you are using). ConnectionI make sure to connect to the cluster with ssh. ssh -XY <login>@<userame>.edu And then request an interactive node with x11 too srun --x11 --pty bash I don't think this is actually necessary, but I had first cloned the branch. Getting Code and Files$ git clone -b croat-singularity https://github.com/researchapps/two-photon
cd two-photon I also scp'd the container and the data (to unzip) to the node. $ scp two-photon.sif <username>@login.sherlock.stanford.edu:/scratch/users/<username>/two-photon/two-photon.sif
$ scp brucker.zip <username>@login.sherlock.stanford.edu:/scratch/users/<username>/two-photon/brucker.zip Back in the folder, I unzip the brucker and rename to Running the containerI chose to still provide XVFB since we are in a headless environment - that said I did see the wine GUI work fine since I had x11. I did not try it without that. SINGULARITYENV_DISPLAY=:95 SINGULARITYENV_XVFB_RESOLUTION=320x240x8 SINGULARITYENV_XVFB_SCREEN=0 SINGULARITYENV_XVFB_SERVER=:95 singularity run --bind ${PWD}/profiles:/PROFILES --bind ${PWD}/overview-23:/data two-photon.sif It worked as it should - I am not able to reproduce the error you are seeing, but I suspect there is some difference in our environments. I would also check that you don't have anything extra on your python path, and that you have exported the variable to unset the user site (in $HOME/.bin usually)
I can confirm that the tif file is generated. $ ls overview-23/
overview-023_Cycle00001_Ch3_000001.ome.tif overview-023.env overview-023.xml References Good luck! |
As previous explained, I am using the stock Sherlock environment -- I have removed all my personal dotfile configs. What's worse is that now when I run your command, I get errors about xvfb problems and about it creating files on my local system.
It is bothersome to me that Singularity isn't hermetic. I am fighting environment issues and some sort of problem with local files. It feels like the state of the system keeps changing under my feet. And to be fair, I also partly blame the fact that we're forced to use Windows via wine. For the moment, I'll likely put this on hold. I'd really like to use Sherlock/OAK for this, but if I cannot get things reproducible (even in the failure mode), then I'd likely not be able to support a dozen labmates using it with their various configurations. Maybe if I come back in the future, |
Sorry @chrisroat, I know how frustrating these things are, and I've still been unable to reproduce your issues so I feel helpless to help. It is true that it's not perfect, and adding Windows/wine to the mix makes it much harder. Perhaps you could take some time away, and come back with fresh eyes? And in the meantime, see if you have colleagues that can try to reproduce the working one? The error hints that you still have some setting, somewhere, pointing to your user site (the .local folder in home) so I suspect that might be an issue. if you have wine installed somewhere on the cluster, and any envars for it, I'd clear those up. You could also look critically at what this windows app is doing, and recreate it in a simple binary (that's probably the best solution moving forward. |
I'm about to lose it. Over lunch I got this hunch it was the filesystem. My PWD was on OAK, which is Lustre. I think you were using your home directory, which is NFS. So I ran on my home directory, and it worked! I thought I was on to something, but it was a red herring.... I started trying SCRATCH and also running a script to copy the data before execution. SCRATCH (which is also Lustre, but a different configuration) seemed to work. But then I accidentally ran on OAK again.... and it worked. So I was confused. Then I realized it -- the sample data I had set aside was marked readonly. Depending on where/how I copied, it would remain so or not. So in the end, that big error dump with scary "Native Crash Reporting" and no mention (that I see) of I/O is really about the ripper needing write privileges. I do want people to mark their raw data as readonly, so we do make a copy prior to ripping. We will need to take care to set permissions appropriately. Sigh. |
What a puzzle! And such persistence! I'm so happy that you figured it out! I will make sure to note next time where I'm running things - I was using data/container/code in my scratch space, and using the sample data that you provided to me from Google Drive. I incorrectly made the assumption that all datasets were equivalent. Anyway, woohoo! And happy Friday! |
Woah, great troubleshooting @chrisroat and @vsoch!! Great job working through all of these super subtle issues! |
The container runs fine on the local Ubuntu machine where I built it, but fails when running on Sherlock:
The text was updated successfully, but these errors were encountered: