-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] add option to force storing of script code when running from inside a git repository #340
Comments
Hi @markus289 , How do you think this feature should be active? as part of the |
To be honest, I am using clearml since yesterday and the goal is to execute tasks on a set of remote machines. It was really difficult for me to figure out how my code even reaches the remote machines. It was trail and error to figure out what happens. I would have expected a discussion about the topic (code reaching remote machines) in the section about Fundamentals of the documentation. In other words, I don't think I am qualified to suggest how this feature would fit into clearml. So, what I would have expected is this: clearml ignores that my code lives in a git repository and creates a snapshot of the code to be submitted to the remote machine. Also, I would have also assumed that I could take a look at said snapshot of the code in the web interface. In my opinion this would be important to reproduce results, but may not be the goal of clearml in the first place. To elaborate on the above mentioned use case (having the git repository only locally). This is not the complete picture. Currently, clearml always fails when the remote machine cannot reach the git repository. In our case, the remote machines are set up by another person who cannot add my ssh keys, such that clearml can check out the git repositories on the remote machine using SSH. A way forward would be to let clearml generates ssh keys for each project (?) which then can be added (read access) to the git repository (Github/Gitlab/whatever). In any case, I am wondering why I seem to be the first person with these problems. |
I also would like to make a suggestion in this regard, ClearML does not seem to automatically pick up new files if they are not added to Git (even if the repository is git repository), I would like to have an option to make ClearML pick up these new files through a command option even if they are not part of the Git Repo. For example, this could be really useful for local pretrained model weights. Another suggestion would be to have a config option to ignore Git completely and just store entire project folder for experiment. For example, when launching a new experiment I'm using following script
In this command, I'm already telling ClearML which folder that I want to sync (--folder Swin-Transformer-Object-Detection). I would suggest in this case additional argument, for example, --no_git , that would disable the git system entirely and just store everything under project folder. 1st suggestion is useful when we want to run a new experiment with uncommitted changes that have a new file on them, but pre_trained model weights are better handled by treating them as extra dataset. 2nd suggestion would help a lot because it means we do not have to add Git credentials to clearml-agent's as well. This could be beneficial for new teams that try to use ClearML (it makes integration more straightforward, we just give it folder, python file to run and it runs without having to deal with all the Git stuff), and It could be useful for big companies that have multiple teams working for them that have more complex Git setups and they don't have a single Git account with access to all the repositories. |
@markus289 and @ErenBalatkan unfortunately I'm not actually sure we have a "place" to store the entire folder snapshot. This basically means zipping the folder, uploading it, and making sure the agent can pull & unzip the file...
That might be possible, the reason |
Wouldn't it be possible to treat project folders similar to the way datasets work? I'm not sure about how git diff part could be handled. |
Yes it would, but it also means we need to add support for the
It might be possible to combine the two and just pack everything that is not in the git repo into a zip file, then have the |
Yeah, Git support is already pretty good from what I have tested so far, but ability to run projects that are not connected to Git would be a huge plus for us. Also, another idea came to my mind, would it be possible to have Git credential keys per-experiment instead of per-agent? This would mean that we do not have to have 1 account that has access to all repositories that we use for ML experiments and can just supply credentials from our own accounts per-experiment. I can imagine this would be a huge time saver for teams that frequently experiment with different repos and companies with high number of independent teams. |
I think this is doable, would zipping the entire (really the entire, no filtering added) working directory work ?
Yep, I'm pretty sure the paid version supports this kind of vault :) but for the open-source this is still under BYOS (bring your own security) approach, which makes things a lot easier for me :) |
I think best approach would be to add another argument for root folder, this path would be relative to execution directory and if not provided execution directory could be the root instead. I think zipping the entire project would work, it is not the most optimized approach but it will get the job done. A very simple git-ignore like filter could be useful for some teams that may only need 1 pretrained model on a directory filled with pre-trained models. But we are relying on clearml-data based approach to pre-trained models so that is not a problem in our case. |
Following the request in #395 for a more specific control over the data that will be packaged. task.package_repo(root_folder='.', include_files=['*.py', '*.txt'], exclude_files=['*.pyc']) Also introduce a way to make this the default for any non git execution with the configuration file
@idantene I'd rather have a more generic interface to control additional files, than to have an additional argument on the @ErenBalatkan wdyt ? BTW: if you have better naming for the functions / parameters feel free to suggest :) This might be a bit more challenging to implement on the agent side as by default it has no storage drivers, but let's see what we can do when we get there. |
Looks perfect for me. This would give us a ton more flexibility with experiments. |
The I think it would anyway be better to have the package repositories as a dictionary, so that (if we continue with the above example), we'll have: task.add_repo(name, root_folder=..., include_files=..., exclude_files=...)
task.list_repos()
task.remove_repo(name)
task.update_repo(name, root_folder=..., include_files=..., exclude_files=...) This also fits the notion of the |
Basically the way I imagine it is, we zip everything (source code , data, etc.), then put it as an artifact on the Task.
Not sure I understand how you can ship it to a remote machine, but not log them on the Task? and actually why wouldn't you want to log it on the Task ? I like the |
From my perspective, stored code must be logged to ClearML task to ensure that every experiment is reproducable, currently saving Git commit already ensures this but for this alternative approach that does not use Git, code has to be saved on Task so that reproducability is not broken. |
(bump) Any updates on this? :)
@bmartinn - sorry, I only now noticed this. We have some legacy system that has ground truth CSV files and we haven't yet transitioned to clearml-data (we version it using DVC) - so there's no need to log it again, and multiple times, to each task that uses them. My workaround thus far has been to use the StorageManager to make it available for remote tasks. |
Hi @idantene , It's still on our TODO list - I'll try to push it forward 🙂 |
…script instead of git repository reference (issue #340)
Hey @jkhenning! Thanks for the notification; my original request was actually a bit different than this (see #395). We consolidated discussions here as @bmartinn suggested :) tldr - my original feature request was to package additional files with the task:) |
@jkhenning @bmartinn should the discussion continue in #395 now that this issue is merged in 1.1.6? |
Hmm yes, for that one we need to make sure we have a proper interface.
Am I missing something here? Main challenge: Thoughts ? On a side note, why not use git for that, it was designed for exactly these things, no?! Am I missing something from the use case ? |
I don't think you are :) I'm not aware of how things work right now, but I assume the agent downloads the requirement list (from
Adding
No, git was not designed for these things :) These are not code objects we ship, but rather CLI arguments that refer to various files (i.e. config files, or local pre-existing models, etc). They are not meant for versioning, just for that specific run (and hence replication). They are not suitable to be a EDIT: So for example, under the assumption the |
No real size limitation here, I think the actual limit is the entire Task object itself, and I think this is about 16MB. This means 700+ lines should not be a problem.
Oh that make sense, so you have git + config + models ? |
@bmartinn, sorry for being unclear. The "limit" here is not one of size, but of readability. The configuration tab fits perfectly for small, flat dictionary structure (some nested levels are fine). With 700+ lines and multiple levels of nesting, using it to store a full config file makes the file unintelligible and removes most, if not all, readability. The workaround we have now is given below, and I believe it can be implemented within clearml-agent directly. While keeping it light, I don't think boto3 and friends are a large requirement to add to a python code - do you? It can even be a dynamic requirement, based on the agent's configuration file. We first determine a cache location based on a project+task name either given in the configuration file (when running locally), or from the remote task metadata (when running remotely). In our case this is a MinIO S3 bucket. pre_init()
# set up argument parser
# load dotenv file, verify config file integrity
post_init()
# normal run as expected Then, the if Task.running_locally():
return
# Remote execution
# Set basic environment variables from remote task
# Download cache directory locally
# Clear cache directory (using boto3) and obfuscate remote basic environment variables And the if not Task.running_locally():
return
# Prepare for remote execution
task = Task.init(...)
# Set basic environment variables in task
# upload configuration file(s), model file(s), etc as needed to cache using StorageManager
# Switch to remote execution We also included a small cleanup for the remote execution that then clears up the agents local storage. I can probably share non-pseudo-code bits if this is insufficient. |
@idantene Wondering about where you say "...using it to store a full config file makes the file unintelligible and removes most, if not all, readability" - In which way does readability come into play? From the discussion so far, it sounds like the main driver is programmatic access to the document(s), and with the interface being considered (artifacts) there's no extra readability maintained over configuration objects? |
@ainoam Readability comes into play in debugging and understanding a Task's inner workings. |
@idantene Wholly agree. But how does an artifacts interface improve readability on the configuration objects one? |
@ainoam the preview window over the configurable UI, essentially. |
Thanks for pinging @idantene !
The main issue with storing additional files on the Task is cloning Tasks. Basically if everything is stored on the Task object, then creating a copy of the Task object, copies everything we need to run it. But if we add additional files as artifacts, when cloning the Task, the links to the artifacts are not cloned (because usually artifacts are the output of the Task execution not input). |
If that's the case, then it is the user fault, not ClearML (or offer a "copy input files" checkbox when cloning a task, causing them to be replicated). |
Currently, clearml has two approaches to get code to workers.
This is a problem if git is only used locally, i.e., there is no remote associated to the local git repository in which the code resides. Then, when cloning the experiment and running it on a worker, the following error occures.
If, on the other hand, the script is executed and not part of a git repository, the script code is stored by clearml.
Now, it would be nice to have control over this decision, i.e., being able to store the script code, even if a git repository is detected.
As was pointed out to me on Slack, it is possible to disable the repository autodetection.
But then, the script is not stored.
The text was updated successfully, but these errors were encountered: