Refactor into monorepo structure #230

rlrs · 2024-02-06T14:31:46Z

Making this PR now so that we can discuss how to structure the potential monorepo. I've already moved some things around, but this probably broke several things that then need to be fixed. Please write any relevant thoughts you have.

rlrs · 2024-02-08T09:09:08Z

Please comment and let's see if we can get this locked down. It only gets more complicated the longer we wait.
@peterbjorgensen @saattrupdan @MartinBernstorff @HLasse @peter-sk

saattrupdan · 2024-02-08T10:00:17Z

Please comment and let's see if we can get this locked down. It only gets more complicated the longer we wait. @peterbjorgensen @saattrupdan @MartinBernstorff @HLasse @peter-sk

Can you describe the setup and your thoughts behind it?

I'm for instance a bit confused about how the llm-foundry fork interacts with the repo post-PR. Is it an installed module? Is it part of the code base?

Further, is there any reason why we cannot have the standard code structure of a src folder, having separate data_processing and training modules (that can then be installed separately), and a single pyproject.toml in the project root, designating what requirements belong to what projects? Having separate config files floating around seems a bit silly - in that case we're not really mono repo'ing, and we could just as well have two separate repos?

rlrs · 2024-02-08T10:20:49Z

A monorepo is exactly having several different but related projects in one repo. If you just have one big thing in the root, that's a monolith and exactly what this PR is about moving away from. If you think we should stick with a monolith, that's a fair position.

Fundamentally, the training code does not care for nor need the dependencies from the preprocessing code. When I clone this repo on LUMI, I have no reason to install from the pyproject.toml in the root because the setup is completely different and relates to training. Same goes for evaluation. Nonetheless, there are some dependencies between these things, which is the purpose of putting them in the same repo.

MartinBernstorff · 2024-02-08T10:21:21Z

Can you describe the setup and your thoughts behind it?

I'm for instance a bit confused about how the llm-foundry fork interacts with the repo post-PR. Is it an installed module? Is it part of the code base?

Agreed on this 👍

Having separate config files floating around seems a bit silly - in that case we're not really mono repo'ing, and we could just as well have two separate repos?

Just a nitpick, but having different pyproject.toml-files does not make it less of a monorepo. It means we have multiple packages in the same repo, but that's exactly what we're looking for 👍

@rlrs The motivation here was that separate parts of DFM needed the same dependency, but with different versions, right? We don't need to have separate release pipelines or anything?

If that is the case, I think @saattrupdan is right, that the easiest transition is to have a pyproject.toml with different dependency groups. As long as the external APIs of the dependencies is stable (perhaps a big if?), this shouldn't cause problems with type-checking.

We could also go separate packages. That would just require a bit of CI-work, specifically ensuring that each project is tested independently with the right dependencies.

EDIT: @rlrs came swooping in before me 😉

rlrs · 2024-02-08T10:23:12Z

llm-foundry was and still is a submodule. I've put it as a submodule because I not only want to install it using pip, but I want to develop on it and perhaps upstream changes using git. That's what a git submodule is for.

If you really prefer sticking with Poetry to manage the entire repo, perhaps this discussion is relevant python-poetry/poetry#936

saattrupdan · 2024-02-08T10:31:56Z

llm-foundry was and still is a submodule. I've put it as a submodule because I not only want to install it using pip, but I want to develop on it and perhaps upstream changes using git. That's what a git submodule is for.

No reason to get snappy here, I just didn't know what a git submodule was in that case, but thanks for the explanation!

I found this explanation of a submodule:

It often happens that while working on one project, you need to use another project from within it. Perhaps it’s a library that a third party developed or that you’re developing separately and using in multiple parent projects. A common issue arises in these scenarios: you want to be able to treat the two projects as separate yet still be able to use one from within the other.

Here’s an example. Suppose you’re developing a website and creating Atom feeds. Instead of writing your own Atom-generating code, you decide to use a library. You’re likely to have to either include this code from a shared library like a CPAN install or Ruby gem, or copy the source code into your own project tree. The issue with including the library is that it’s difficult to customize the library in any way and often more difficult to deploy it, because you need to make sure every client has that library available. The issue with copying the code into your own project is that any custom changes you make are difficult to merge when upstream changes become available.

Git addresses this issue using submodules. Submodules allow you to keep a Git repository as a subdirectory of another Git repository. This lets you clone another repository into your project and keep your commits separate.

That sounds great.

Also, @rlrs, would keeping the dependencies of the data processing and training separate resolve the repo issue?

rlrs · 2024-02-08T10:39:42Z

No reason to get snappy here

Wasn't meant that way, was just intended as a shorthand for what you linked!

would keeping the dependencies of the data processing and training separate resolve the repo issue?

Splitting dependencies probably goes a long way here. Training requires a separate virtualenv from preprocessing requires a separate virtualenv from evaluation. As for training (and eval on lumi) it all has to run in a specific container, too. Even the venv has to be installed while in this container.

I think the essence is that I don't believe it to be cleaner to have one src/ directory than to have a functional split based on purpose.

saattrupdan · 2024-02-08T10:46:48Z

Wasn't meant that way, was just intended as a shorthand for what you linked!

It's all good 🙂

Splitting dependencies probably goes a long way here. Training requires a separate virtualenv from preprocessing requires a separate virtualenv from evaluation. As for training (and eval on lumi) it all has to run in a specific container, too. Even the venv has to be installed while in this container.

I think the essence is that I don't believe it to be cleaner to have one src/ directory than to have a functional split based on purpose.

Ah, I think I understand it now - the different environments and especially the need to run in a container for the training bit makes it more complicated than simply having the same core setup and installation method. I think I get why a monorepo structure would make things easier in that case, when the setup is really quite different on the two different sides.

Will there be any need to import things across the data processing and training, or are they completely separate? And if so, is that possible with the monorepo structure?

rlrs · 2024-02-08T12:20:39Z

We talked about this at the office, and I believe we achieved a common understanding of this monorepo structure making sense, even though the different projects really aren't connected.
We could probably do a separate repo for each project now, but in the future we would like to have a joint pipeline for data -> training -> evaluation.
@HLasse @KennethEnevoldsen any comments from you besides the above?

KennethEnevoldsen · 2024-02-08T12:27:15Z

We talked about this at the office, and I believe we achieved a common understanding of this monorepo structure making sense, even though the different projects really aren't connected.
We could probably do a separate repo for each project now, but in the future we would like to have a joint pipeline for data -> training -> evaluation.

So what is the conclusion from the discussion above - I am unsure?

edit: Discussed this with @MartinBernstorff - it seems like we can clarify this much better in a 20 min call.

saattrupdan · 2024-02-08T12:54:10Z

Regarding the CI thing, it seems like it's super easy to run separate parts of CI for different subfolders.

See here: https://stackoverflow.com/a/67941131/3154226

rlrs · 2024-02-08T13:03:00Z

Great, do we fix this in another PR? Also there might be some paths that are broken now, and misc. things like the Makefile not making sense.

saattrupdan · 2024-02-08T13:05:49Z

Great, do we fix this in another PR? Also there might be some paths that are broken now, and misc. things like the Makefile not making sense.

Yep, let's fix these things in other PR(s).

saattrupdan

LGTM

rlrs added 3 commits February 6, 2024 14:42

move LUMI training scripts to a directory of its own

fa297e5

move data processing code to data-processing directory

bb77aaa

move pyproject.toml into data-processing

85f9980

rlrs linked an issue Feb 6, 2024 that may be closed by this pull request

Refactor into monorepo structure #221

Closed

rlrs marked this pull request as ready for review February 8, 2024 13:03

saattrupdan approved these changes Feb 8, 2024

View reviewed changes

Merge branch 'main' into 221-refactor-into-monorepo-structure

6799013

rlrs merged commit 19cf8ec into main Feb 8, 2024
1 check failed

rlrs deleted the 221-refactor-into-monorepo-structure branch February 8, 2024 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor into monorepo structure #230

Refactor into monorepo structure #230

rlrs commented Feb 6, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

MartinBernstorff commented Feb 8, 2024 •

edited

rlrs commented Feb 8, 2024 •

edited

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024 •

edited

rlrs commented Feb 8, 2024

KennethEnevoldsen commented Feb 8, 2024 •

edited

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024

saattrupdan left a comment

Refactor into monorepo structure #230

Refactor into monorepo structure #230

Conversation

rlrs commented Feb 6, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

MartinBernstorff commented Feb 8, 2024 • edited

rlrs commented Feb 8, 2024 • edited

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024 • edited

rlrs commented Feb 8, 2024

KennethEnevoldsen commented Feb 8, 2024 • edited

saattrupdan commented Feb 8, 2024

rlrs commented Feb 8, 2024

saattrupdan commented Feb 8, 2024

saattrupdan left a comment

Choose a reason for hiding this comment

MartinBernstorff commented Feb 8, 2024 •

edited

rlrs commented Feb 8, 2024 •

edited

saattrupdan commented Feb 8, 2024 •

edited

KennethEnevoldsen commented Feb 8, 2024 •

edited