Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving cache management #82

Open
ainar opened this issue Apr 28, 2023 · 3 comments
Open

Improving cache management #82

ainar opened this issue Apr 28, 2023 · 3 comments

Comments

@ainar
Copy link
Contributor

ainar commented Apr 28, 2023

Currently, if a working directory is given, all cache files are deleted from memory after each stage execution.
We could accelerate the whole execution of a pipeline by avoiding re-loading results that have been used or produced during the preceding stage.

This is how it would work:
We would use a cache map for all stage executions. At each stage execution, this cache is filled with dependency results/caches and, in the end, with the stage result.
This is already the case when no working directory is given. So technically, we could use the same cache.
Then, at the beginning of each stage execution, we delete the cache that will not be used during the subsequent execution.

This reduces the execution time of the whole pipeline significantly depending on the cache file sizes. This may be explained by the fact that a topological sort gives execution order: for each stage, its result is probably used during the next stage.

There is one major drawback: the cache is mutable, so any modification of a retrieved cache will be transmitted to the following stages using this same retrieved cache (which does not happen currently and we generally do not want). I do not know how to prevent that because efficiently detecting a modification of an object or making it immutable depends on the kind of this object. A heavy way to do that is to compare the serialization before and after execution, but then we may lose most of the speedup depending on the cost of a serialization.
Let me know if you see any solution apart from warning the developer or if you note any other drawbacks.

Maybe we could make my suggestion an optional behavior?

I already implemented it in my fork (diffs). If you agree with my suggestion, should I wait for #81 to be merged to open a new PR? Or may you want to implement it by yourself?

@sebhoerl
Copy link
Contributor

Yes, initially it was actually kept in cache, and you identified one of the reasons why it was changed:

  • There are actually many cases in the pipeline where the stage changes information in the data frame and so the cached information gets altered. This was the main reason why it is written and reloaded now. So it was a decision of robustness over runtime.
  • Keeping everything in memory may make the pipeline crash for large use cases due to lack of memory while the pipeline runs without a problem if they are loaded step by step.

That being said, nothing prevents us from making this optional.

I don't know if it would be possible to always send a deep copy of the objects to the stages as input. Probably still faster than reading the files, but I'm not sure if there is a generic way to do that?

Feel free to send a PR, we can merge it then. I'm about to test the current PR with some of the existing pipelines.

@ainar
Copy link
Contributor Author

ainar commented Apr 28, 2023

Keeping everything in memory may make the pipeline crash for large use cases due to lack of memory while the pipeline runs without a problem if they are loaded step by step.

This is why I suggest deleting the useless cache before each execution. As I work with tens of gigabytes of data, I/O times and memory use are critical for me!

I don't know if it would be possible to always send a deep copy of the objects to the stages as input. Probably still faster than reading the files, but I'm not sure if there is a generic way to do that?

copy.deepcopy is how to deep-copy any pickable object in Python. First, deep-copying will increase the runtime, but it will also double the memory usage...
Making the current behavior the default one, and making my suggestion an optional "fast" or "unsafe" behavior seems to be the best option to me. It seems that a compromise may not be adequate.

@sebhoerl
Copy link
Contributor

sebhoerl commented May 2, 2023

Making the current behavior the default one, and making my suggestion an optional "fast" or "unsafe" behavior seems to be the best option to me. It seems that a compromise may not be adequate.

Yes, that's fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants