Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add support for float16 metrics serialization #2915

Merged
merged 1 commit into from Sep 22, 2021

Conversation

BlueskyFR
Copy link
Contributor

@BlueskyFR BlueskyFR commented Sep 7, 2021

Description

Adds support for float16 metrics serialization

Test Plan

Commentary (optional)

When running an experiment using keras while having enable mixed precision using tf.keras.mixed_precision.set_global_policy("mixed_float16") (TF 2.6), the metrics are serialized before being sent by socket (to the master I guess?).
However, while float64 and float32 are supported, float16 is not one of them: this patch hopefully fixes it.

Checklist

  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot
Copy link

cla-bot bot commented Sep 7, 2021

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @BlueskyFR on file. In order for us to review and merge your code, please start the CLA process at https://determined.ai/cla.

After we approve your CLA, we will update the contributors list (private) and comment @cla-bot[bot] check to rerun the check.

@netlify
Copy link

netlify bot commented Sep 7, 2021

✔️ Deploy Preview for determined-ui canceled.

🔨 Explore the source changes: f9a7e13

🔍 Inspect the deploy log: https://app.netlify.com/sites/determined-ui/deploys/613781fe03054100075a51a9

Copy link
Member

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good, thank you for the fix!

@BlueskyFR
Copy link
Contributor Author

I was not yet able to test it locally though, so I am wondering if anything else will need to be changed too

@rb-determined-ai
Copy link
Member

Ok. If you are not able to test it manually by the time we work out the CLA details, I'll write a test before we merge this.

@BlueskyFR
Copy link
Contributor Author

BlueskyFR commented Sep 8, 2021

Ok. If you are not able to test it manually by the time we work out the CLA details, I'll write a test before we merge this.

Is there some build instructions somewhere so that I can test the fixed agent image locally?

@cla-bot
Copy link

cla-bot bot commented Sep 8, 2021

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @BlueskyFR on file. In order for us to review and merge your code, please start the CLA process at https://determined.ai/cla.

After we approve your CLA, we will update the contributors list (private) and comment @cla-bot[bot] check to rerun the check.

@cla-bot
Copy link

cla-bot bot commented Sep 8, 2021

The cla-bot has been summoned, and re-checked this pull request!

@rb-determined-ai
Copy link
Member

Is there some build instructions somewhere so that I can test the fixed agent image locally?

Ah, I guess not. Actually our build system is a bit of a pain. You are welcome to clone the repo and build it make get-deps; make build, but you'll need working go environment, protoc, java, python, sphinx, node... it's a bit unwieldy.

On the other hand, for simple fixes like this you can hotpatch the python harness in your live cluster in a startup-hook.sh that you put in your model directory:

file="$(python -c 'import determined.util as x; print(x.__file__)')"
sed -i -e '/if isinstance(obj, np.float32):/a \            return float(obj)\n        if isinstance(obj, np.float16):'  "$file"

That will literally just modify the source of the installed python library in the container just before running your code.

@cla-bot[bot] check

We're still talking to legal about what exactly required in your case. At any rate, the next release is next week so there's not a huge rush here.

@determined-ai determined-ai deleted a comment from cla-bot bot Sep 8, 2021
@determined-ai determined-ai deleted a comment from cla-bot bot Sep 8, 2021
@BlueskyFR
Copy link
Contributor Author

BlueskyFR commented Sep 8, 2021

I'll run some tests tomorrow then :)
Once you get any feedback on the cla make sure to keep me informed ;)

I also tried to guess the build commands with the Makefiles but make get-deps failed on my environment. I will try the hot patching methods instead!

@rb-determined-ai
Copy link
Member

@cla-bot[bot] check

@cla-bot cla-bot bot added the cla-signed label Sep 13, 2021
@cla-bot
Copy link

cla-bot bot commented Sep 13, 2021

The cla-bot has been summoned, and re-checked this pull request!

@rb-determined-ai
Copy link
Member

@BlueskyFR we talked to legal and you are good to go!

@rb-determined-ai
Copy link
Member

rb-determined-ai commented Sep 14, 2021

@BlueskyFR have you had a chance to test this yet? In either case, please mark the PR as ready for review (so either you or I can test it and I can merge it)

@rb-determined-ai
Copy link
Member

Not sure where you went but I'll just land this.

@rb-determined-ai rb-determined-ai marked this pull request as ready for review September 22, 2021 02:30
@rb-determined-ai rb-determined-ai merged commit a0db48e into determined-ai:master Sep 22, 2021
@BlueskyFR
Copy link
Contributor Author

Not sure where you went but I'll just land this.

Hi @rb-determined-ai, sorry for this: I returned to school so I didn't see the Github notifications :(
Glad to see you merged it! :D

@BlueskyFR BlueskyFR deleted the patch-1 branch September 22, 2021 15:34
@dannysauer dannysauer modified the milestones: 0.0.102, 0.17.0 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants