Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runs stuck in progress #2995

Open
Laiaborrell opened this issue Sep 18, 2023 · 9 comments
Open

runs stuck in progress #2995

Laiaborrell opened this issue Sep 18, 2023 · 9 comments
Labels
type / question Issue type: question

Comments

@Laiaborrell
Copy link

Laiaborrell commented Sep 18, 2023

When running aim up.. and checking the runs in the UI looks like 4 out of 5 runs (except for the last one trained, which is tagged as "finished"), which have already finished training, are stuck "in progress" (green dots):
image
A couple of days ago, when I last checked the state of the trainings, those runs were alredy tagged as finished but somehow they were reactivated now...
Because of this, when accessing these runs to check for the metrics and figures, a pop up with the following message appears "Error. Run not found":
image
Note that no error is printed in the terminal where the aim up command is being run.

I would really appreciate any help,
thanks!

@Laiaborrell Laiaborrell added the type / question Issue type: question label Sep 18, 2023
@mihran113
Copy link
Contributor

Hey @Laiaborrell! Thanks a lot for the report, that seems kinda strange, as there's no scenario that runs can reactivate by themselves. My only guess is that the runs were tryed to be deleted, and something went wrong in the process of deletion, that's why it's showing that the runs are not found. As aim stores data about runs in 2 dbs (sqlite and rocksdb). I think that rocksdb portions of the data were removed, and the data in sqlite is still there.
You can check if that's the case by checking if the ./aim/meta/chunks/{run_hash} directory still exists or not.

@Laiaborrell
Copy link
Author

Laiaborrell commented Sep 19, 2023

Hey @mihran113, thanks for your reply!! I checked and the hashes for the runs are still in the chunks folder:
image
I did not try to delete any of the files either :/
It is weird for me because they appear as active and the run time keeps increasing (8 days now), but the gpu where the process was training has been stopped... Also, the chunk folders' files were also last updated three days ago, when their training finished

@Maximiliano-Villanueva
Copy link

Hi @Laiaborrell did you manage to solve this? Because Im having the same issue using langchain callbacks.

@Laiaborrell
Copy link
Author

Hello @Maximiliano-Villanueva, I didn't manage to solve it. I had to relaunch the hyperparameter search.... sorry about that. Hope that someone else can help, it would be helpful for any future issues like this.

@Michael-Tanzer
Copy link

Michael-Tanzer commented Apr 17, 2024

@mihran113 Do you know if there is any update about this? I also see the run in meta/chunks and I am not able to delete the runs as they appear online on the UI. It looks like restarting the server fixes the issue, I hope this is a useful piece of information in fixing it! Would it be possible to perhaps add a "force delete" button to force deletion of running runs?

ETA: when restarting the server some runs will not be deleted as they are "locked"

@mihran113
Copy link
Contributor

@Michael-Tanzer Can I ask you to share the logs from aim up command when the error happens(when not found is thrown trying to open the run)? Also if you can share some scenario or example script when this happens so I can reproduce it on my end would be really helpful as well.

@Michael-Tanzer
Copy link

I have now deleted the problematic runs by deleting the lock manually and then deleting from the UI. I will share a log as soon as it happens again.

@mihran113
Copy link
Contributor

Let me know when that happens again, as it's pretty hard to reproduce, but the error should tell a lot about what's happening and it would help a lot.

@mihran113
Copy link
Contributor

Regarding the force delete, we'll consider to implement it for the next minor version: 3.20.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type / question Issue type: question
Projects
None yet
Development

No branches or pull requests

4 participants