Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify version number for installed software #32

Merged
merged 12 commits into from
May 2, 2022
Merged

Conversation

arisp99
Copy link
Member

@arisp99 arisp99 commented Mar 16, 2022

This PR specifies the version number for most of the installed software in the MIPTools container.

Closes #31.


Checklist for software

Installed via apt-get

  • A long list of software...
  • We do not specify the version number for these tools

Installed via git

Installed via wget

  • When we install miniconda, the version of conda is automatically updated. However, as we use mamba as our package manager, it is fine to leave this as is. We do specify the version number for mamba.

Installed via install.packages()

Installed via conda

  • mamba

Installed via mamba

  • A long list of software...

@arisp99 arisp99 marked this pull request as ready for review March 22, 2022 18:53
@aydemiro
Copy link
Contributor

@arisp99 I tried specifying versions before but did not work quite well. In almost all cases, if you have such a big software list and try to specify versions, they will not be compatible. If versions are not specified, mamba does a good job of resolving dependencies and installing the packages. The idea that if we specify versions, each build will be exactly the same, hence better reproducibility, is very appealing but did not work for me before. What do you think? The advantages overweigh the troubles in your opinion?
If this is more about documenting which versions are installed, we can create the installed packages after each build by conda list or something similar.

@JeffAndBailey
Copy link
Member

So, with a new build initially you let conda resolve, but when you freeze a version can you lock the versions of the underlying code (or at list dump a list that we then add so someone could force the versions) This would provide real reproducability. The other option is to be able to build proprietary stuff like bcl2fastq outside and then add to another directory.

@arisp99
Copy link
Member Author

arisp99 commented Mar 22, 2022

That is a very valid point @aydemiro—I hadn't thought of that. But you are correct I can totally see instances when you want to update a package and then you have dependency conflicts. To be honest, I am usually more of a fan of updating software as there are new features and bug fixes that be useful.

Thinking about what @JeffAndBailey proposed, I see that you can install packages using a requirements.txt file using mamba install --file requirements.txt. Something that we could do is that on the initial build let mamba resolve all conflicts and when our build is complete, save a requirements.txt somewhere that lists all of the package versions. To do this we can use mamba list --export. Users may then be able to rebuild the container using this saved requirements.txt file.

We could even have a check in the definition file to see if requirements.txt exists. If it does, then install using the file, whereas otherwise install and let mamba resolve conflicts.

Some quick questions thinking about this more:

  • How can we save a file during the build process to the original directory?
  • How to read a file during the build process? We might be able to use the %files section.

@JeffAndBailey
Copy link
Member

Let's see if we can download and build externally bcl2fastq and install it as a working version with any need libraries or accessary files. if that is possible then really our fixed builds san bcl2fastq will be fine for reproduciblility.

@aydemiro
Copy link
Contributor

aydemiro commented Mar 23, 2022

I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository, instead of listing all packages without the versions in the definition file. We can then employ something like this:

  1. If file environment_versioned.yml exists
mamba env create -f environment_versioned.yml
  1. If versioned file doesn't exist:
mamba env create -f environment.yml
conda activate base
mamba env export > environment_versioned.yml

As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.

@aydemiro
Copy link
Contributor

aydemiro commented Mar 23, 2022

@arisp99 I think we have to use the %setup section for copying from the container to the host and %files for copying to the container from the host.

@arisp99
Copy link
Member Author

arisp99 commented Mar 23, 2022

I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository.

This seems similar to just using a requirements.txt file. Do you think an environment-based system would be more beneficial, @aydemiro?

@arisp99 I think we have to use the %setup section for copying from the container to the host and %files for copying to the container from the host.

Yes! Yes looks right. So hashing this out a bit further, in our %files section we would have the following line of code:

%files
  # could be either requirements or environment
  environment* /opt/conda

Then as you write:

  1. If file environment_versioned.yml exists
mamba env create -f /opt/conda/environment_versioned.yml
  1. If versioned file doesn't exist:
mamba env create -f /opt/conda/environment.yml
conda activate base
mamba env export > /opt/conda/environment_versioned.yml

Lastly, in the %setup section, we have

cp ${SINGULARITY_ROOTFS}/opt/conda/environment_versioned.yml environment_versioned.yml 

As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.

I agree with all this re the bcl2fastq installation. It would awesome if you could just plop the binary into the container. I think that it makes sense to address this as a separate issue for now as it seems a bit complex... For now, let's try to finalize if we want a requirements.txt or an environment.yml file to move ahead and revisit bcl2fastq in a separate issue.

@arisp99
Copy link
Member Author

arisp99 commented Mar 23, 2022

I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository.

This seems similar to just using a requirements.txt file. Do you think an environment-based system would be more beneficial, @aydemiro?

I explore this question a bit more and it seems that an environment.yml is actually better as it gives us more options to configure the conda environment. We can specify the channels we want to install packages from and even install pip packages using this framework.

After installation, we save an `environment_versioned.yml` file that contains all the installed versions of our software
@arisp99
Copy link
Member Author

arisp99 commented Mar 23, 2022

I have now configured MIPTools to install mamba packages using an environment file. In the definition file, we first check to see if an environment_versioned.yml file exists. If it does, we use that for installation. Otherwise, we install given our clean environment.yml file that does not contain the package versions for software.


One important thing to note is that we are actually unable to copy files to the host during the building of our container. The %setup section is executed before the %post section so we will not have installed our packages yet. Given this, I think the best course of action is to include a note somewhere in the documentation indicating that the user can copy the environment_versioned.yml from the container using singularity exec:

singularity exec miptools.sif cat /opt/environment_versioned.yml > environment_versioned.yml

and that if this environment_versioned.yml file is present in the directory when building, it will be used to specify package versions for software.

@arisp99
Copy link
Member Author

arisp99 commented Apr 28, 2022

@aydemiro and @JeffAndBailey, if you have no additional comments, I will go ahead and merge this PR early next week.

@arisp99 arisp99 merged commit c4001d2 into master May 2, 2022
@arisp99 arisp99 deleted the version-numbers branch May 2, 2022 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify version number for installed software
3 participants