Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: this refactors execution to properly use the process pool #352

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

zcstarr
Copy link
Contributor

@zcstarr zcstarr commented Mar 15, 2024

Prior to this change the process pool was created and destroyed for every configuration, this would cause the cpu/memory to thrash and improperly allocate task to avaialble cpu, resulting in sometimes 25% utilization of available cpu resources

The change corrects this, as well as tackles memory problems, by writing temporary results to disk and then reading them back at the end of the simulation.

This is non configurable in this commit, and can also result in loading too much memory, as it does not include the ability to progressively or lazy load data into the final dataframe to complete the simulation.

This commit is a WIP fixes #351

Copy link
Member

@danlessa danlessa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change does seem reasonable, although I would prefer to make it configurable as our test coverage is not complete as of now, and I would err on the side of not modifying the default runtime behavior, as this improvement has additional execution scope attached to it (it reads and writes temporary files)

cadCAD/engine/execution.py Show resolved Hide resolved
Prior to this change the process pool was created and destroyed
for every configuration, this would cause the cpu/memory to
thrash and improperly allocate task to avaialble cpu, resulting
in sometimes 25% utilization of available cpu resources

The change corrects this, as well as tackles memory problems,
by writing temporary results to disk and then reading them back
at the end of the simulation.

This is non configurable in this commit, and can also result in
loading too much memory, as it does not include the ability to
progressively or lazy load data into the final dataframe to
complete the simulation.

This commit is a WIP fixes cadCAD-org#351
This feature refactors the parallel processing to use lazy evaluation
When simulation data grows, keeping the data in memory causes
memory bloat, which causes too Ram to be consumed.

The refactor uses lazy_flattens to keep the memory Python consumes
low, the change was able to reduce consumption by a factor of 10
in many cases.

The change also writes temporary files to disk before lazily
rearranging the simulation results to fit the cadCAD expected format
for a regular pandas dataframe.
@zcstarr
Copy link
Contributor Author

zcstarr commented Mar 21, 2024

@danlessa PTAL I included the memory profiler and a test headless script to allow you to play with different sizing, the initial results are quite good for lazy evaluation, so good that I don't think it'll be necessary to write results to disk. Or rather that can be another enhancement down the line. In the issue i'll redetail it for people.

The main thing I believe is the datastructure of the results being returned from the simulation in memory. The nested arrays, seem to just eat it. I created a simple ~200kb simulation and addid a memory profile on execute, in what exists now in parallel mode it consumes 1gb of memory to to return back a result, with lazy evaluation its down to 117mb from a starting postion of 115mb. I stripped down the simulation to be able to get an assessment of the impact.

I will rm the profiler and resubmit if things generally look good, I just wanted people to be able to check out the commit and play with it.

The second thing here was that there are alot of checks for maybe "odd" or deprecated configs, my thinking was that this feature doesn't need to support everything that cadcad ever supported, so I simply rm'd those checks. If those have significant meaning or use case I can add those back, my thinking was that it was probably support for older configurations that were in use.

Let me know what you think about this direction, and I will rm the memory part and keep on iterating. Additional results in the issue

@zcstarr
Copy link
Contributor Author

zcstarr commented Mar 25, 2024

Wanted to add a clarifying comment here, when back filled with the df it expands to 846mb from 860 pre eval, but the lazy evaluation gives people a potential option to write to disk if they feel like it , it might be worth while to have easy run return a lazy object as well, to allow people to save the data without writing it all to disk or to use another system to store it for analysis like https://github.com/vaexio/vaex without hitting the memory limitations of applying transformations all at once

Copy link
Member

@danlessa danlessa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, still, we need the remove the profiler code

@danlessa danlessa added the enhancement New feature or request label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: cadCAD underutilizes CPU and opens too many filehandles
2 participants