-
Notifications
You must be signed in to change notification settings - Fork 7
RDST parallelism KeyError #24
Comments
This issue is caused by a problem with the dependencies and/or packages versions. |
I am having this issue for most of my datasets with no threading, deleting the numba cache seems to fix it for a few runs, but it breaks again when they are written. Another similar numba error (after removing the numba parallel option in an attempt to fix the first one, so this may be completely on me):
|
You still get the KeyError even with a new environment ? Creating a new one seem to have fixed the issue for me. |
@MatthewMiddlehurst I cannot reproduce the issue for both KeyError and LoweringError on my end, with and without parallel keyword and/or n_jobs > 1 with the following machines : System:
Python dependencies:
and this one (on which experiments are made) System:
Python dependencies:
Could you please provide an example code with the version you are currently using ? |
There is indeed something off with numba see #34 . Does the problem happens on your end with the non Ensemble version ? |
I am running it on our computing cluster, so the setup may be a bit odd.
I am just running the ridge version using a simple wrapper for the sktime interface. There are more dependencies installed than the ones listed above, but none of the other sktime numba seems to have an issue on the same environment. |
The workflow is running many individual jobs over many distributed cores, so it may not be typical. Again, the first few runs seemed to be fine but when the functions are cached, errors start to appear. |
I see, as no error are thrown when test on python 3.7 are run, I doubt it would change something but, could you by any chance try to run it on a python 3.8+ or use pickle5 ? (see https://numba.readthedocs.io/en/stable/developer/caching.html) It may impact how caching is handled, from what I understand. Nevertheless, there is definitely something wrong with the Ensemble version even on Python 3.11, #34 shows high std for timings for RDST Ensemble, which could indicate that function are being cached again after other models have been run. I suspect that it has something to do with the combination of multiple joblib processes using numba parallel, although i followed instructions from https://numba.readthedocs.io/en/stable/user/threading-layer.html#example-of-limiting-the-number-of-threads. The fact that it is spread over multiple machine may also be part of the issue, I never tested it in this context. This will require a bit of time to fix, I'm afraid. If that helps, I can provide results generated on my end. |
Yeah, no issue, I can try giving it a run with more datasets on my own machine and with your suggestions on the cluster. Currently, I am just running it without numba on the cluster (I wont report any timing results from these as that would be unfair). |
Thanks, if you do find any strange difference for accuracy results, I would also appreciate feedback. I will update progress on this issue when I find the source of the problem. |
The issue may be caused by what is described as cache invalidation due to the loading of numba function from the |
@MatthewMiddlehurst The version 0.2.5 fixed the issues I was noticing on my side, at the cost of ~20% more run-time for RDST Ensemble for now until I learn how to properly manage a thread pool with numba and joblib threads. Hopefully that also fix the issues on your side, would appreciate an update when you have some time. |
Just a quick update, I ran with my setup using Python 3.10 and the newest update and still had similar errors. It could possibly be a conflict with another dependency, as I am running it through a larger package. This probably is more HPC specific as I'm running >200 builds at once, but it's weird I haven't seen this with any of my numba stuff which also caches. No issue running after I hacked it to remove caching from the transform parts. I am unsure how this has impacted performance, but it still seemed to finish everything rather quickly (much faster than no numba at least). As a temporary fix maybe allow the setting of a global variable to disable caching on these functions? Have not tried it so may not be possible, but i don't see why it wouldn't 🙂. |
Thanks for the update ! I indeed suspect it has something to do with the HPC cluster as I cannot reproduce anything on my end, but it is indeed a bit worrying that it only happens with this Numba code ... I think the global variable approach is the best one, if it is feasible, i'll look into it and close this issue when it's done. I would be curious as to why the problem happens, but I don't have an HPC cluster available as of now :/ Will update if i can manage to get to the bottom of it. |
Hi Matthew, Sorry for the delay, I have ve been quite busy with the postdoc. So, I've tried some different solutions, and the one that worked with minimal complexity was to add variables defined in You can see an example of how to do this in this example : https://github.com/baraline/convst/blob/main/examples/Changing_numba_options.py Alternatively, if this does not work for your setup, you can simply modify the value of the parameter in the init.py file to affect the entire module, which is still a hack, but does not require you to change the value across all the files. Hope this will fix the issue on your side ! |
Thanks for looking into it @baraline! We managed to run everything using numba without the caching, and it all looks faithful to the reported results, very impressive work! The changes should make it easier for future runs. |
Thanks for your return @MatthewMiddlehurst ! Don't hesitate to re-open this issue if the fix does not work. Additionally, if you have run RDST/RDST Ensemble on multivariate datasets, I fixed a bug in version 0.2.7 that caused some shapelets to not be correctly generated when n_jobs>1, which improved the performance for some multivariate dataset on my side. (see #44) |
During some, but not all, runs (e.g. FordA / FordB datasets) RDST Ensemble classifier fails with the following error dump :
The text was updated successfully, but these errors were encountered: