Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockfile implementation of categories instead of category #278

Open
maresb opened this issue Nov 11, 2022 · 14 comments
Open

Lockfile implementation of categories instead of category #278

maresb opened this issue Nov 11, 2022 · 14 comments

Comments

@maresb
Copy link
Contributor

maresb commented Nov 11, 2022

In order to make it possible to, for instance, reliably install packages in both dev and main, I proposed in mamba-org/mamba#1209 changing the category attribute from a string to a list of strings called categories.

There hasn't been any movement yet (which I'm aware of) towards an implementation. I have an idea which might offer an easy way forward, purely in conda-lock at first...

What would happen if we duplicate the entries in the lockfile, so that the lockfile looks roughly like:?

- name: common-package
  category: dev
- name: common-package
  category: main

If we are lucky, then perhaps the conda-lock and micromamba implementations will follow the expected behavior. (To be tested!) This way we wouldn't need a new lockfile format version.

If this works, this is still rather redundant, but not catastrophically so. We should aim to do better in lockfile format v2. In the meantime, since the micromamba folks believe that extra attributes will be ignored, we could start generating interoperable lockfiles which look like

- name: common-package
  category: dev
  categories:
  - main
  - dev
- name: common-package
  category: main
  categories:
  - main
  - dev

so that this can be read as v1 by ignoring categories, or alternatively read as v2 by ignoring category (effectively containing duplicate entries). This way, we could first develop everything in conda-lock, and then only once we're ready, work on implementing v2 in micromamba.

@maresb
Copy link
Contributor Author

maresb commented Dec 4, 2022

Preliminary experiment with micromamba indicates success. 🚀

@srilman
Copy link
Contributor

srilman commented Dec 12, 2022

Is there an existing branch or implementation to test?

@maresb
Copy link
Contributor Author

maresb commented Dec 12, 2022

Sorry, not yet. I just did by hand the experiment with micromamba described above. Would you like to try an implementation?

@srilman
Copy link
Contributor

srilman commented Dec 13, 2022

Sure! One question: how should duplicate entires in separate input files be handled. Say, for example, we are running conda-lock -f main.yml -f test.yml on the following instances:

# main.yml
category: main
dependencies:
  python 3.10
  pandas 1.4
# test.yml
category: test
dependencies:
  pandas 1.5

Currently, Conda Lock with solve for ['python 3.10.*', 'pandas 1.5'] and produce a lock-file where Pandas's category is test. Should it be in both main and test or just test? What if pandas 1.5 was specified in main.yml; should it be in both categories then?

I would argue that it should be only in the test category in the first case, but both categories in the second case.

@maresb
Copy link
Contributor Author

maresb commented Dec 13, 2022

As written, that would be a dependency conflict.

The idea is to merge the dependencies into a single simultaneous solve. So what would happen logically is:

dependencies:
- python =3.10
- pandas =1.4
- pandas =1.5

and this goes to the solver, which in this case would fail because there's no Pandas package with version 1.4 and 1.5 simultaneously.

But let's suppose the versions were compatible, for example if main had pandas =1.5.1. Then the simultaneous solve would receive

dependencies:
- python =3.10
- pandas =1.5.1
- pandas =1.5

Then the solver would find the solution python3.10.8, pandas1.5.1, and all their corresponding dependencies.

To compute which things should go in the main category, look at the list of the packages in main.yml: python and pandas. These two packages, plus all their dependencies, should be in main.

Similarly, to compute which things should go in test, it is pandas and all its dependencies. Since python happens to be a dependency of pandas, they both go in test.

Therefore, in this case, all packages go into both main and test.

Does this logic make sense?

@srilman
Copy link
Contributor

srilman commented Dec 13, 2022

Yes, that does. Concatenating them together seems like the least error-prone solution.

Just to clarify though, do we expect conda-lock to currently fail, or is this the wanted future behavior? Because when I ran the previous example on the current main branch, it did not fail.

@maresb
Copy link
Contributor Author

maresb commented Dec 13, 2022

It is a bug that solving succeeds. I'm guessing that there is some failed merging behavior. I haven't looked at the code, but I expect in pseudocode (class and parameter names are probably incorrect!!!) that the Pandas dependencies should look something like:

From main.yml: Dependency(name="pandas", specification="=1.4", categories=["main"])
From test.yml: Dependency(name="pandas", specification="=1.5", categories=["test"])

They should merge into: Dependency(name="pandas", specification="=1.4,=1.5", categories=["main", "test"]). I strongly suspect that rather being merged, the dependency from test is simply overwriting the other.

@srilman
Copy link
Contributor

srilman commented Dec 18, 2022

So after some research and planning, I decided to split the implementation into 2 pieces.

First, as you said, the current behavior when seeing repeated dependencies in multiple input files is to overwrite the version constraint rather than merge it. Since this is unwanted behavior, I wrote a mini PR to merge the version strings instead.

I have a second branch with an implementation for supporting multiple categories in lockfiles (built on top of the merging versions branch) located here: https://github.com/srilman/conda-lock/tree/multiple-categories

@maresb
Copy link
Contributor Author

maresb commented Dec 18, 2022

This sounds excellent!!! Thanks so much for digging into this! I will try to review this soon.

@maresb
Copy link
Contributor Author

maresb commented Jan 4, 2023

@g-rutter astutely points out in #306 that when we implement categories in the lockfiles, we need to ensure that categories propagate to all subdependencies, and provides a very nice test case. Thanks!

I haven't yet found a chance to verify whether a4d58b6 achieves this.

@maresb
Copy link
Contributor Author

maresb commented Jan 4, 2023

I made a first pass at thinking through the details. (I might have some misconceptions about the implementation, so please correct any nonsense I write!) I get the impression that we may want to refactor a few things in order to implement this correctly...

First we need to understand the context of aggregate_lock_specs. In make_lock_spec we call parse_source_files to read the list of files into a corresponding list of LockSpecifications. Then we call aggregate_lock_specs to merge these into a single LockSpecification. Moreover, for each file processed within parse_source_files, we parse each platform individually and then call aggregate_lock_specs to merge the platform-specific results into a single LockSpecification. Later in the pipeline, create_lockfile_from_spec splits this fully-merged spec into individual platforms for solving.

It seems to me like we don't need to, and shouldn't, merge platforms in the LockSpecification. Indeed, my mental model (not the current implementation) of what a LockSpecification should be is a dict with platforms as keys, and platform-specific specifications as values.

One minor complication is that we don't know the list of platforms before running parse_source_files. If no platform_overrides is given via --platform, then we need to search the individual files for a platform specification, i.e. the top-level platforms: key. (In case none is found, then we fallback to the default platforms.) For this reason, perhaps we want to compute the list of platforms in a first-pass.

Perhaps one way to proceed is to refactor source-file-specific functionality into a SourceFile class. Then we can do something like

platforms = platform_overrides or union(sf.platforms for sf in source_files) or DEFAULT_PLATFORMS
spec = {platform: aggregate_lock_specs([sf.spec(platform) for sf in source_files]) for platform in platforms}

This way aggregate_lock_specs only needs to deal with specs for a single platform.

Does it make sense what I'm writing?

@srilman
Copy link
Contributor

srilman commented Jan 14, 2023

Sorry for the delay @maresb! Overall, I agree with the approach you suggested, especiall

  1. Treating LockSpecifications as a dict of platform to dependencies.
  2. Having aggregate_lock_specs deal with specs for a single platform.

But to clarify, how will we parse source files in this approach? Right now, I believe we parse a source file given a platform, in order to handle things like os preprocessing selectors. In this new approach, would we parse the source file in a platform-independent fashion, and then apply a platform to it? I would much prefer that method, since that would make it easier to add more selector-related features in the future, such as

  1. and, or, or not operations
  2. Selectors for other parameters, such as the python version.

Either way, I would be happy to take a first pass implementing this. Any ideas on how to split this into smaller tasks in order to reduce the size of the overall PR?

@maresb
Copy link
Contributor Author

maresb commented Jan 15, 2023

No worries, great to hear back from you @srilman! I am not sure I understand your question:

In this new approach, would we parse the source file in a platform-independent fashion, and then apply a platform to it?

I think that multiple parsing passes are required because we don't know at the beginning what is the final list of platforms. So I think we need to parse unfiltered to extract the platforms key, and then reparse for each platform by applying the corresponding filter. Is this consistent with what you have in mind?

@maresb
Copy link
Contributor Author

maresb commented Jan 15, 2023

Feel free to take a stab at it. I think you are thinking about this more deeply than I am.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants