Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-9616] Add RegisterDoFn #12903

Merged
merged 4 commits into from
Sep 25, 2020
Merged

[BEAM-9616] Add RegisterDoFn #12903

merged 4 commits into from
Sep 25, 2020

Conversation

lostluck
Copy link
Contributor

@lostluck lostluck commented Sep 22, 2020

Adds a beam.RegisterDoFn call for users to call on their DoFns instead of individually calling RegisterFunction and RegisterType for all parameter types, using reflection.

These registrations are important to ensure workers can process and lookup types and functions correctly.
This PR also lets starcgen key off of RegisterDoFn in addition to the identifiers list.
Only transforms/top has been adjusted to demonstrate, and a full go generate ./... pass run on the SDK to demonstrate that the code generator works in both modes simultaneously. In time, we'll remove the identifiers approach and require RegisterDoFn as the only way to hint to the code generator.

The extra changes are go fmt changes missed from other PRs and are benign.

TODO in later PRs:

  • Make use of RegisterDoFn everywhere in the GoSDK and examples and document it as a best practice.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@lostluck
Copy link
Contributor Author

R: @youngoli @tysonjh

Copy link
Contributor

@youngoli youngoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have a few small comments and questions, but nothing big enough that I feel is worth blocking on.

As a caveat, I was able to generally follow the starcgenx code, but I'm not very familiar with ast so I'm liable to miss small details if there are issues there.

}

var retFunc interface{}
rt := reflect.TypeOf(dofn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just use the rt from near the top of the function? It's a bit confusing right now with it named the same as a different scoped variable in the same function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that rt is out of scope and it's only present if the user passed in a reflect.Type instance. It's scoped to the if statement. It's also nearly 30 lines up, which hampers readability a bit by having an extended distance between declaration and use. This is adjacent to the uses, and that if block at the start ensures that there's only one way to interpret the rt variable here. It's the type of the dofn.

Semantically, this is identical to calling all error values err. I'm always calling arbitrary reflect.Type instances rt.

if rt.Kind() == reflect.Ptr {
rt = rt.Elem()
}
dofn = reflect.New(rt).Interface()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for creating a new interface from a reflect type for dofn here? Is it a way to reset the dofn to it's default values by creating a new instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph.NewFn validator requires an actual instance of the DoFn, since graph.NewFn was designed for pipeline construction. This line creates such an instance, if what is passed in is reflect.Type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh. I had a brain-fart that this statement only happens if the dofn is a reflect.Type. I was imagining turning an actual DoFn instance into a reflect.Type and back. Makes complete sense now.

}
sdf := (*graph.SplittableDoFn)(fn)
c.pullMethod(sdf.CreateInitialRestrictionFn())
c.pullMethod(sdf.CreateTrackerFn())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that while restrictions need to be serializable, restriction trackers don't (and probably can't). Registering RTrackers is probably not a problem as long as nothing actually tries to use the registered type, but otherwise you can probably avoid registering RTrackers by removing this line pulling CreateTrackerFn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTs would be registered anyway since they show up in the ProcessElement signature. It's OK for extra types to be registered, the framework just won't automatically synthesize them like it needs to for other elements.

@@ -72,7 +72,7 @@ type SymbolResolver interface {
// RegisterFunction allows function registration. It is beneficial for performance
// and is needed for functions -- such as custom coders -- serialized during unit
// tests, where the underlying symbol table is not available. It should be called
// in init() only. Returns the external key for the function.
// in `init()` only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Is removing the last sentence intentional? If so, should it also be removed in forward.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! It should be removed from forward.go! I removed it since this function doesn't return anything ever, so it was incorrect documentation.

@lostluck
Copy link
Contributor Author

Thanks for the review! I'm also not super adept with ast, so we're in the same boat. The short version is that it's breaking down all the go syntax things for us. I did cover a good number of reasonable cases that do work with RegisterDoFn, and we avoid the rest by making sure the documentation (the final documentation) and the examples are complete, and consistent, which will avoid bad cases that can't be handled.

@lostluck
Copy link
Contributor Author

I ended up adding the usage documentation, may as well start off on the right foot with the recommended boilerplate.

@lostluck lostluck merged commit 806e67d into apache:master Sep 25, 2020
@lostluck
Copy link
Contributor Author

lostluck commented Sep 26, 2020 via email

ibzib pushed a commit to ibzib/beam that referenced this pull request Sep 30, 2020
tysonjh pushed a commit to tysonjh/beam that referenced this pull request Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants