Skip to content

Conversation

@thomasfsteeples
Copy link

Currently, ggml.c and ggml.h are maintained as separate files in both whisper.cpp and llama.cpp. This pull request adds the ggml project as a submodule to whisper.cpp, allowing ggml development to occur in one location only, and minimising the risk of diverging branches.

I propose taking the same approach with llama.cpp as well, and will follow this up separately if this pull request is approved.

Happy to tweak or discuss if needed.

@prusnak
Copy link

prusnak commented Mar 15, 2023

My experience with submodules, to put it mildly, is "far from optimal" because they make development much more tedious, especially for rapidly evolving projects.

Imagine needing to change the API; you must modify it in the submodule, then update submodules in all projects using the library, and finally adjust the API usage in all these projects. The worst aspect of this is that it obscures the changes made by splitting them across multiple commits.

I suggest using a monorepo where ggml, whisper.cpp, and llama.cpp coexist. This approach makes it much easier to implement changes in a single commit (e.g., a lib API change) and greatly improves the process of writing CI scripts that check everything together in one step.

If necessary, individual projects can still be exported from the monorepo to separate repositories, but this is honestly rarely needed.

@ggerganov have you considered switching to monorepo?

@thomasfsteeples
Copy link
Author

My fear with a monorepo is that is promotes ggml as just the plumbing of whisper.cpp and llama.cpp, rather than a tensor framework in its own right, with wider applicability. I also guess this is a matter of personal opinion, but I believe having this separation of repos is a good thing - it encourages classic software development principles like good API design and separation of concerns.

However, I'll go along with whatever @ggerganov and the community deems best.

@prusnak
Copy link

prusnak commented Mar 15, 2023

My fear with a monorepo is that is promotes ggml as just the plumbing of whisper.cpp and llama.cpp, rather than a tensor framework in its own right, with wider applicability. I also guess this is a matter of personal opinion, but I believe having this separation of repos is a good thing - it encourages classic software development principles like good API design and separation of concerns.

However, I'll go along with whatever @ggerganov and the community deems best.

Right. These are good arguments to consider too. If the goal is to decouple ggml from whisper.cpp and llama.cpp and include it in these projects via release process, this makes sense.

@Mike-Bell
Copy link
Contributor

My two cents is that when you have a closely-related dependency tree like this, it makes sense to just put them in the same repo. You can always put them in separate folders with separate readme's if you want them to "feel separate" while also getting the benefits that @prusnak explained. Dependency management is hard.

@thomasfsteeples
Copy link
Author

My two cents is that when you have a closely-related dependency tree like this, it makes sense to just put them in the same repo. You can always put them in separate folders with separate readme's if you want them to "feel separate" while also getting the benefits that @prusnak explained. Dependency management is hard.

I feel git submodules offer the ideal solution here: they're separate repos, but still embedded within a project. Now that there are two projects based on ggml, my concern would be the risk/cost associated with having two or three separate versions of it being maintained at once. Moreover, as ggml is a logical unit, my feeling is that it should be maintained separately, and changes to it should be considered with respect to it as an independent project, rather than to the immediate aims of the downstream project in question.

Dependency management is hard, but traditional versioning and good practice work well, and similar projects manage just fine with such an approach.

@ggerganov
Copy link
Member

Hey all, I am still thinking how to organize the projects to make it easier to work with.
Submodules are definitely an option.

@Green-Sky
Copy link
Contributor

@ggerganov personally been using submodules since forever, but i recently came across this post https://diziet.dreamwidth.org/14666.html

@asmaloney
Copy link
Contributor

Just to put in my two cents - I'm in favour of using submodules for this.

Imagine needing to change the API; you must modify it in the submodule, then update submodules in all projects using the library, and finally adjust the API usage in all these projects.

I would argue this is a feature. It forces you to think about the changes to the API and how they affect other projects (not just whisper and llama) and to think of ggml (for example) as a separate project.

In my experience this is a Good Thing even if slightly inconvenient at times.

A monorepo with ggml, whisper.cpp, llama.cpp and then whatever else comes up might be easier for whisper.cpp and llama.cpp, but it's less usable by other projects. If I'm just interested in embedding ggml as a submodule in my own work, I don't want to carry around the whisper.cpp and llama.cpp stuff (+ any new things), and then have to sift through all the changes for those projects whenever there are updates. Keeping them separate also makes documentation easier to follow.

Small and simple wins IMHO.

@prusnak
Copy link

prusnak commented Mar 27, 2023

Btw, these are nice write-ups about git subtree if this is also considered: https://www.atlassian.com/git/tutorials/git-subtree + https://blog.developer.atlassian.com/the-power-of-git-subtree/

@mqy
Copy link

mqy commented Jun 18, 2023

@fire
Copy link

fire commented Jul 26, 2023

I've had the best experience with https://github.com/ingydotnet/git-subrepo.

We have an opensource game project and we use git subrepo to pull in addons https://github.com/V-Sekai/v-sekai-game/tree/main/addons.

We also do this with c++ godot engine.

@DeveloperPaul123
Copy link

DeveloperPaul123 commented Sep 13, 2023

If you're open to sticking with CMake as the primary build system, this could easily be done with the FetchContent module or with something like CPM.cmake. This would allow you to pull a specific commit of the ggml library and it will still be built alongside whisper.cpp.

The downside is that it cause your configuration time via cmake to go up; albeit slightly in this case, and potentially only initially since the downloaded sources are cached.

@thomasfsteeples thomasfsteeples closed this by deleting the head repository Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants