Skip to content

Updated _evaluate_pipelines to consolidate side effects #1410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 3, 2020

Conversation

christopherbunn
Copy link
Contributor

Going off of #1337, but this will have fixes to parallel memory usage.

Resolves #1295

@dsherry
Copy link
Contributor

dsherry commented Nov 5, 2020

@christopherbunn a couple ideas for things to try right off the bat:

  • I wonder what happens if you create a copy of this branch where you disable all the unit tests which run stacking
  • Similarly I wonder what happens if you disable all the automl unit tests. I'd expect that to produce no errors. If it still does, could mean the problem isn't with automl, which would be unexpected.

@dsherry
Copy link
Contributor

dsherry commented Nov 5, 2020

Could also try disabling the KeyboardInterrupt tests and see what that does

@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch 2 times, most recently from 29a0f0a to 22a7fe8 Compare November 12, 2020 18:26
@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #1410 (9624243) into main (3987ab7) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1410     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         223      223             
  Lines       15139    15135      -4     
=========================================
- Hits        15132    15128      -4     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.7% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/utils/logger.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3987ab7...9624243. Read the comment docs.

@dsherry
Copy link
Contributor

dsherry commented Nov 13, 2020

@christopherbunn and I spent a while this afternoon debugging the intermittent failures on this PR. Its still not 100% clear what's causing the problem here, but we have more evidence. For tl;dr see next steps at the bottom. UPDATE: filed main memory issue as #1438

Notes
We ssh'ed into the circleci workers and used mprof to record memory usage.

mprof run  --include-children pytest evalml/ -n 8 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure -v

Sidenote: we noticed around 500 tests are getting marked as skipped, filed as #1437 .

Here's what the memory usage over time for the unit tests looked like on main, specifically f0987455276eaa88e2ba73ada298eb68949167e1:
mprof_4

And here's what that looked like for @christopherbunn 's branch in this PR (rewound to commit 22a7fe8934b10b8b91f7e9f17852913bcacf5eeb to undo some debugging changes):
mprof_3

Observations

  • In these particular runs, main took ~250sec to run whereas the PR branch took ~500
  • Max memory usage on main went up to 20GB (yikes!). Max memory usage on the PR branch went up to 25GB. We saw the circleci box listed total "used" plus "free" memory at around 30GB.
  • I ran a second time on each branch and got plots which are fairly similar to the results from the first time. So the pattern of memory usage appears to be similar across unit test runs.

Thoughts

  • The fact that the memory climbs up to 20GB on main is bad. That shouldn't be happening. I'm going to file that as a separate bug and we can look into it.
  • I still don't know why the problem seems to be exacerbated by the PR branch. Its hard to reason about that while there's a potential memory issue on main.
  • As @christopherbunn suggested, a short-term workaround could be to use the machine option on circleci to get a dedicated machine, with more memory. But fundamentally, we need to understand the memory usage problem on main.

Next steps

  • I'm going to continue looking into this for the time being.
  • Try to isolate the unit tests which resulted in the largest increases in memory usage
  • Verify that the memory usage measured doesn't include caching.
  • We noticed the paths in the fit and score mocking in test_max_batches_works may not be correct. Perhaps this is true elsewhere... should investigate.
  • Try pytest-leaks
  • Look into Tests getting skipped #1437 which may be related.
  • Review code in this PR once more with all this in mind.
  • Discuss further ✨

@dsherry
Copy link
Contributor

dsherry commented Nov 18, 2020

@freddyaboulton @christopherbunn @rpeck and I have been looking into the memory issues in #1438 . Updates posted on that issue.

Next steps for this PR:

  • @christopherbunn will try migrating our unit test jobs to use the machine option in circleci, so we can provision a slightly larger box for our unit tests.
  • If that fails, we'll decrease the number of pytest workers from 8 to 4, as we were able to verify on main at least that reduced the overall memory consumption by ~40%.
  • Once one of those solutions is merged to main, we'll update this PR, trigger the unit tests a few times and verify that there are no out-of-memory or other intermittent errors.

@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch 2 times, most recently from 49b81c8 to 817d522 Compare November 24, 2020 16:29
@christopherbunn christopherbunn marked this pull request as ready for review November 24, 2020 16:55
@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch 3 times, most recently from 7691f5d to c1b1720 Compare November 30, 2020 16:02
@christopherbunn
Copy link
Contributor Author

christopherbunn commented Nov 30, 2020

After #1447 was merged into main, I was able to rerun the Python linux tests three times and have it successfully pass each time. All three runs should be grouped under the current checks.

image

@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch 3 times, most recently from b9fdee7 to 7425573 Compare December 1, 2020 18:05

if search_iteration_plot:
search_iteration_plot.update()
if search_iteration_plot:
Copy link
Contributor

@freddyaboulton freddyaboulton Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call update() here and in self._evaluate_pipelines. I don't think we need both? I don't think this anything to do with the memory issues we've seen on this branch but I think it'd be good to only update the plot once if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I wanted to be able to make sure that all of the results were plotted if the search was interrupted early. However, I did just test it and the one instance in self._evaluate_pipelines was sufficient so I'll remove the redundant one in search().

@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch from 7425573 to 284aad7 Compare December 1, 2020 22:22
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @christopherbunn ! 😄

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn thanks for sticking with this! It brought us down an interesting rabbit hole RE #1438 . Let's get it merged!

@christopherbunn christopherbunn force-pushed the 1295_`_evaluate`_changes branch from 284aad7 to 9624243 Compare December 3, 2020 16:31
@christopherbunn christopherbunn merged commit d05134f into main Dec 3, 2020
@freddyaboulton freddyaboulton deleted the 1295_`_evaluate`_changes branch May 13, 2022 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor AutoML Search for Parallel Workers
3 participants