Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into monorepo structure #230

Merged
merged 4 commits into from Feb 8, 2024
Merged

Conversation

rlrs
Copy link
Collaborator

@rlrs rlrs commented Feb 6, 2024

Making this PR now so that we can discuss how to structure the potential monorepo. I've already moved some things around, but this probably broke several things that then need to be fixed. Please write any relevant thoughts you have.

@rlrs rlrs linked an issue Feb 6, 2024 that may be closed by this pull request
@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

Please comment and let's see if we can get this locked down. It only gets more complicated the longer we wait.
@peterbjorgensen @saattrupdan @MartinBernstorff @HLasse @peter-sk

@saattrupdan
Copy link
Collaborator

Please comment and let's see if we can get this locked down. It only gets more complicated the longer we wait. @peterbjorgensen @saattrupdan @MartinBernstorff @HLasse @peter-sk

Can you describe the setup and your thoughts behind it?

I'm for instance a bit confused about how the llm-foundry fork interacts with the repo post-PR. Is it an installed module? Is it part of the code base?

Further, is there any reason why we cannot have the standard code structure of a src folder, having separate data_processing and training modules (that can then be installed separately), and a single pyproject.toml in the project root, designating what requirements belong to what projects? Having separate config files floating around seems a bit silly - in that case we're not really mono repo'ing, and we could just as well have two separate repos?

@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

A monorepo is exactly having several different but related projects in one repo. If you just have one big thing in the root, that's a monolith and exactly what this PR is about moving away from. If you think we should stick with a monolith, that's a fair position.

Fundamentally, the training code does not care for nor need the dependencies from the preprocessing code. When I clone this repo on LUMI, I have no reason to install from the pyproject.toml in the root because the setup is completely different and relates to training. Same goes for evaluation. Nonetheless, there are some dependencies between these things, which is the purpose of putting them in the same repo.

@MartinBernstorff
Copy link
Collaborator

MartinBernstorff commented Feb 8, 2024

Can you describe the setup and your thoughts behind it?

I'm for instance a bit confused about how the llm-foundry fork interacts with the repo post-PR. Is it an installed module? Is it part of the code base?

Agreed on this 👍

Having separate config files floating around seems a bit silly - in that case we're not really mono repo'ing, and we could just as well have two separate repos?

Just a nitpick, but having different pyproject.toml-files does not make it less of a monorepo. It means we have multiple packages in the same repo, but that's exactly what we're looking for 👍

@rlrs The motivation here was that separate parts of DFM needed the same dependency, but with different versions, right? We don't need to have separate release pipelines or anything?

If that is the case, I think @saattrupdan is right, that the easiest transition is to have a pyproject.toml with different dependency groups. As long as the external APIs of the dependencies is stable (perhaps a big if?), this shouldn't cause problems with type-checking.

We could also go separate packages. That would just require a bit of CI-work, specifically ensuring that each project is tested independently with the right dependencies.

EDIT: @rlrs came swooping in before me 😉

@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

llm-foundry was and still is a submodule. I've put it as a submodule because I not only want to install it using pip, but I want to develop on it and perhaps upstream changes using git. That's what a git submodule is for.

If you really prefer sticking with Poetry to manage the entire repo, perhaps this discussion is relevant python-poetry/poetry#936

@saattrupdan
Copy link
Collaborator

llm-foundry was and still is a submodule. I've put it as a submodule because I not only want to install it using pip, but I want to develop on it and perhaps upstream changes using git. That's what a git submodule is for.

No reason to get snappy here, I just didn't know what a git submodule was in that case, but thanks for the explanation!

I found this explanation of a submodule:

It often happens that while working on one project, you need to use another project from within it. Perhaps it’s a library that a third party developed or that you’re developing separately and using in multiple parent projects. A common issue arises in these scenarios: you want to be able to treat the two projects as separate yet still be able to use one from within the other.

Here’s an example. Suppose you’re developing a website and creating Atom feeds. Instead of writing your own Atom-generating code, you decide to use a library. You’re likely to have to either include this code from a shared library like a CPAN install or Ruby gem, or copy the source code into your own project tree. The issue with including the library is that it’s difficult to customize the library in any way and often more difficult to deploy it, because you need to make sure every client has that library available. The issue with copying the code into your own project is that any custom changes you make are difficult to merge when upstream changes become available.

Git addresses this issue using submodules. Submodules allow you to keep a Git repository as a subdirectory of another Git repository. This lets you clone another repository into your project and keep your commits separate.

That sounds great.

Also, @rlrs, would keeping the dependencies of the data processing and training separate resolve the repo issue?

@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

No reason to get snappy here

Wasn't meant that way, was just intended as a shorthand for what you linked!

would keeping the dependencies of the data processing and training separate resolve the repo issue?

Splitting dependencies probably goes a long way here. Training requires a separate virtualenv from preprocessing requires a separate virtualenv from evaluation. As for training (and eval on lumi) it all has to run in a specific container, too. Even the venv has to be installed while in this container.

I think the essence is that I don't believe it to be cleaner to have one src/ directory than to have a functional split based on purpose.

@saattrupdan
Copy link
Collaborator

saattrupdan commented Feb 8, 2024

Wasn't meant that way, was just intended as a shorthand for what you linked!

It's all good 🙂

Splitting dependencies probably goes a long way here. Training requires a separate virtualenv from preprocessing requires a separate virtualenv from evaluation. As for training (and eval on lumi) it all has to run in a specific container, too. Even the venv has to be installed while in this container.

I think the essence is that I don't believe it to be cleaner to have one src/ directory than to have a functional split based on purpose.

Ah, I think I understand it now - the different environments and especially the need to run in a container for the training bit makes it more complicated than simply having the same core setup and installation method. I think I get why a monorepo structure would make things easier in that case, when the setup is really quite different on the two different sides.

Will there be any need to import things across the data processing and training, or are they completely separate? And if so, is that possible with the monorepo structure?

@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

We talked about this at the office, and I believe we achieved a common understanding of this monorepo structure making sense, even though the different projects really aren't connected.
We could probably do a separate repo for each project now, but in the future we would like to have a joint pipeline for data -> training -> evaluation.
@HLasse @KennethEnevoldsen any comments from you besides the above?

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Feb 8, 2024

We talked about this at the office, and I believe we achieved a common understanding of this monorepo structure making sense, even though the different projects really aren't connected.
We could probably do a separate repo for each project now, but in the future we would like to have a joint pipeline for data -> training -> evaluation.

So what is the conclusion from the discussion above - I am unsure?

edit: Discussed this with @MartinBernstorff - it seems like we can clarify this much better in a 20 min call.

@saattrupdan
Copy link
Collaborator

Regarding the CI thing, it seems like it's super easy to run separate parts of CI for different subfolders.

See here: https://stackoverflow.com/a/67941131/3154226

@rlrs
Copy link
Collaborator Author

rlrs commented Feb 8, 2024

Great, do we fix this in another PR? Also there might be some paths that are broken now, and misc. things like the Makefile not making sense.

@rlrs rlrs marked this pull request as ready for review February 8, 2024 13:03
@saattrupdan
Copy link
Collaborator

Great, do we fix this in another PR? Also there might be some paths that are broken now, and misc. things like the Makefile not making sense.

Yep, let's fix these things in other PR(s).

Copy link
Collaborator

@saattrupdan saattrupdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rlrs rlrs merged commit 19cf8ec into main Feb 8, 2024
1 check failed
@rlrs rlrs deleted the 221-refactor-into-monorepo-structure branch February 8, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor into monorepo structure
4 participants