Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images #519

Merged
merged 10 commits into from
Apr 16, 2022

Conversation

consideRatio
Copy link
Collaborator

@consideRatio consideRatio commented Mar 27, 2022

Important

While this is a big PR, it is about CI - it only touches two setup.py files which is directly related to end use. Overall, this is reducing complexity but since I've added a lot of comments and documentation there are a lot of additional lines.

image


Summary


Intermittent failures

This PR initially introduced intermittent failures for the hadoop test, but they are now gone by restricting memory settings on scheduler/workers started during tests. See #422 for details on already existing intermittent test failures though.

@consideRatio consideRatio marked this pull request as draft March 27, 2022 21:29
@consideRatio consideRatio force-pushed the pr/de-duplicate-dependencies branch 3 times, most recently from 1b420ef to 07c076d Compare April 3, 2022 23:41
@consideRatio consideRatio changed the title maint: de-duplicate declared Python dependencies maint: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images Apr 3, 2022
@consideRatio consideRatio marked this pull request as ready for review April 4, 2022 00:02
@consideRatio consideRatio force-pushed the pr/de-duplicate-dependencies branch 4 times, most recently from 763f2e4 to 188b488 Compare April 4, 2022 04:11
@consideRatio consideRatio marked this pull request as draft April 4, 2022 04:13
@martindurant
Copy link
Member

which was a major undertaking.

:)
I will look it over, not sure how long it will take or how authoritative I can be!

@consideRatio
Copy link
Collaborator Author

consideRatio commented Apr 4, 2022

@martindurant I think I figured it out, it was a newline character in CLASSPATH as set by skein after reading what yarn classpath reported which made at least the last entry - but not all entries - in the CLASSPATH fail to function properly. Figuring this out took me a full day of effort ;D

I've updated the top comment with links to PRs etc to fix it.

@consideRatio consideRatio marked this pull request as ready for review April 5, 2022 13:25
@consideRatio consideRatio force-pushed the pr/de-duplicate-dependencies branch 6 times, most recently from a20b908 to 20183f2 Compare April 5, 2022 16:26
@consideRatio
Copy link
Collaborator Author

@martindurant @jacobtomlinson @jcrist this is ready for review now. I've updated the original PR description to be more succinct about the changes.

@consideRatio
Copy link
Collaborator Author

@jcrist I've verified that the wheel with skein 0.8.2 that was recently released functions as it should as part of this PR!

@consideRatio consideRatio changed the title maint: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images ci: de-duplicate deps by docs/requirements.txt and tests/requirements.txt and update CI images Apr 5, 2022
@consideRatio
Copy link
Collaborator Author

@martindurant @jacobtomlinson @jcrist I'd love to get this merged and iterate from there, this includes too many changes as it is.

Next work item is to make tests more reliable, the main tests, the pbs tests, and now also the hadoop tests. I don't know what makes them be unreliable though besides guesses that it at least partially involves a OOMKiller sending SIGKILL causing exit code 137 for the PBS test, which then shows symptoms of "ValueError, 404 NOT FOUND" etc, which is what I also see in main/hadoop test.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked through much of the YARN stuff, and I have very little problem with anything, it all looks solid. Of course, there are loads of configs, so I could easily have missed something.

continuous_integration/docker/base/Dockerfile Show resolved Hide resolved
continuous_integration/docker/base/Dockerfile Show resolved Hide resolved
continuous_integration/docker/hadoop/_print_logs.sh Outdated Show resolved Hide resolved
continuous_integration/docker/hadoop/_test.sh Show resolved Hide resolved
@@ -45,24 +45,4 @@
<value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
</property>

<property>
<name>hadoop.http.authentication.type</name>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You disabled HTTP auth altogether? I agree with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw it failing anyhow, so I figured removing it reduced complexity without downside.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. This is the default value anyhow so this didn't change much I think, but what does this config really do? I think it influences something that we may not be using?

<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably know, but YARN doesn't actually respect these as limits, it's only used for internal bookkeeping; so it doesn't matter is the worker actually had this much memory available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, that is critical knowledge. I wanted to stay within memory limits and constrain this to avoid crashing things during tests. I had seen a warning about possible "thrashing" of workers or similar.

Ideas on how to make us limit ourselves better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this config also pointless?

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>32</value>
    </property>

    <property>
        <name>yarn.resource-types.memory-mb.increment-allocation</name>
        <value>${yarn.scheduler.minimum-allocation-mb}</value>
    </property>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did get tripped up by this once, where where the default total memory resource of the cluster was 4096 and the minimum allocation was 1024, so you would get no more than 4 containers no matter what the actual memory requirements of requests were. The default minimum still seems to be 1024 https://hadoop.apache.org/docs/r3.0.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

I would make these numbers small, or else just wait in case something ever goes wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I found that in the test_yarn_backend.py file, the YarnBackend is configured with memory for the workers. I lowered it from 512 to 128 and things may have improved, not sure yet - intermittency =/

@consideRatio
Copy link
Collaborator Author

consideRatio commented Apr 15, 2022

@martindurant status update:

This PR not longer introduces a regression in the amount of intermittent failures! By lowering the memory requirements, we resolved the regression seen previously.

So far hadoop tests have succeeded 4/4 since memory limits were stricted, and pbs test has succeeded 3/4 since - but errored during pip install during setup for testing within the container due to memory issues. This is the previous intermittent error documented in #422.

@martindurant I consider this ready for merge at this point - if you agree, lets do it!

@consideRatio
Copy link
Collaborator Author

@martindurant thanks for review efforts!!! 🎉 ❤️ - to know you were around giving it a look made a big difference on my motivation to get this done!

@martindurant
Copy link
Member

I'm not sure when I'll have the time to look through again :| If you think things are in a good state, I trust you!

@consideRatio
Copy link
Collaborator Author

consideRatio commented Apr 16, 2022

I'm not sure when I'll have the time to look through again :| If you think things are in a good state, I trust you!

Thank you! I think this is good to go then based on my understanding that also @jcrist thought these were acceptable changes by brief inspection as discussed in the video meet we scheduled.

Btw, did you receive an email about this meet? I sent one to your address declared in LinkedIn.

@consideRatio consideRatio merged commit e054a65 into dask:main Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

De-duplicate package dependencies paywalling of cloudera repo
2 participants