-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-14112] Avoid storing a generator in _CustomBigQuerySource #17153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Generators cannot be pickled. The previous implementation causes errors when used with the Python interactive runner.
|
Run Python 3.8 PostCommit |
|
Run PythonLint PreCommit |
|
|
||
| @schema.setter | ||
| def schema(self, value): | ||
| self._schema = json.dumps(bigquery_tools.table_schema_to_dict(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any guarantees that json.dumps(bigquery_tools.table_schema_to_dict(x)) and bigquery_tools.parse_table_schema_from_json(x) are reciprocal functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is around https://cloud.google.com/bigquery/docs/schemas#creating_a_json_schema_file, so they do seem to be reciprocal functions. Need @chamikaramj to double check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigquery_tools.parse_table_schema_from_json is just an implementation detail of BQ connector and can change in the future so I would not assume that. "json.dumps" and "json.loads" are guaranteed to be reciprocal so I would use that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think json.dumps/json.loads can't be used directly since TableSchema is not a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the types are also different ? Input is a bigquery.TableSchema object and output is JSON dict. I would implement new functions here without depending on existing functions and add extensive testing to make sure that this is valid/correct for various bigquery.TableSchema objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds like the right approach. I will likely try to extract _convert_to_tuple from _JsonToDictCoder and build on top of that. I won't be able to get to it before the release cut today though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a higher level point though. I think all state of a source object being packable (including temporary state) is not a guarantee we provide and it will be very difficult to guarantee this behavior in the future for BQ and other sources. Also that can make source implementations more restrictive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Cham that the behavior of pickling a PTransform is not guaranteed. The workaround is limited by the current InteractiveRunner implementation.
@chunyang I left a comment in the PR that explores saving the coder and paths as attributes instead of the TableSchema and matadata_list. Could you please check if that is an appropriate approach?
|
R: @kevingg |
Codecov Report
@@ Coverage Diff @@
## master #17153 +/- ##
=======================================
Coverage 73.95% 73.95%
=======================================
Files 671 671
Lines 88245 88257 +12
=======================================
+ Hits 65264 65273 +9
- Misses 21869 21872 +3
Partials 1112 1112
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Python integration tests passed in postcommit AFAICT, but there was a failure in |
Could you plz share information about the failure? The PR looks nice, IIUIC: similar to the reverted one but export the files before the temp dataset gets cleaned up. The reason to add the intermediate JSON conversion is because InteractiveRunner needs to pickle the PTransform while the table schema nor a generator are pickle-able as an attribute. |
I don't know much other than what the console output is showing. Something about failing to load/store cache entries. |
|
Just to explore some other options other than converting schema to JSON, prior to #15610, there was no generator or TableSchema to cause pickling errors. Instead of storing the generator as an instance attribute, a list of TextSource and notably its Perhaps I can use the same |
I doubt this Python code change would affect the target :runners:flink:1.14:job-server:shadowJar, should be unrelated. |
How about we store the coders directly in the
So your @dataclass
class _BigQueryExportResult:
coder: beam.coders.Coder
paths: List[str]And export_result = __BigQueryExportResult(coder=self.coder(table_schema), paths=[metadata.path for metadata in metadata_list]) |
|
@chunyang, thanks for contributing to Beam and provide quality solutions to the issue! Is this PR still in active development? |
|
Hi Ning, yes I plan to get back to this, just have been swamped with some
other things at the moment. Feel free to piggyback off of the work if you
have the bandwidth.
…On Tue, Apr 5, 2022, 11:10 AM Ning Kang ***@***.***> wrote:
@chunyang <https://github.com/chunyang>, thanks for contributing to Beam
and provide quality solutions to the issue! Is this PR still in active
development?
—
Reply to this email directly, view it on GitHub
<#17153 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADPAHHCBGOZCWCRKVVMSGDVDR6XRANCNFSM5RL2RCVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Hi chunyang, I thought we had already fixed this in the other PR #17306. We can close this PR permanently if this is no longer an issue. |
|
Woah strange, there's some email necromancy going on--I sent that last reply in April but for some reason Github shows it's from 9 hours ago. Anyway, we can close this as it's fixed in #17306 as you mentioned. Thanks! |
Second attempt to avoid storing generator in _CustomBigQuerySource. See description from last attempt (#17100) below.
This time, export table schema is retrieved before the temp dataset is deleted. Because TableSchema is not picklable (see #17100 (comment)), I convert it into a JSON string before storing it as an instance variable.
Avoid storing a generator in _CustomBigQuerySource so that it can be used with the Python interactive runner. See details in https://issues.apache.org/jira/browse/BEAM-14112 .
Likely related to #15610
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.