Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11808][BEAM-9879] Support aggregate functions with two arguments #16200

Merged
merged 14 commits into from
Jan 18, 2022

Conversation

benWize
Copy link
Contributor

@benWize benWize commented Dec 10, 2021

Working on related issues

[BEAM-11808] Support aggregate functions with two arguments
[BEAM-9879] Support STRING_AGG in BeamSQL

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@benWize benWize changed the title [WIP][BEAM-11808][BEAM-9879] Support aggregate functions with two arguments [BEAM-11808][BEAM-9879] Support aggregate functions with two arguments Dec 15, 2021
@benWize benWize marked this pull request as ready for review December 15, 2021 21:17
@benWize
Copy link
Contributor Author

benWize commented Dec 28, 2021

R: @apilloud. Could you help me to review this?

@ibzib
Copy link
Contributor

ibzib commented Dec 28, 2021

R: @apilloud. Could you help me to review this?

@benWize Sorry! I reviewed this, but I forgot to press send.

@benWize
Copy link
Contributor Author

benWize commented Dec 28, 2021

R: @apilloud. Could you help me to review this?

@benWize Sorry! I reviewed this, but I forgot to press send.

Thank you @ibzib, no worries.

@benWize benWize requested a review from ibzib January 3, 2022 16:39
@ibzib
Copy link
Contributor

ibzib commented Jan 7, 2022

Apologies - I've been a bit busy this week but I'll give this another review next week.

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this through ZetaSQL compliance tests (which we still haven't set up to run directly on the Beam repo) and there were a couple errors. Can you please try adding these tests to Beam and see what the issues are?

array_aggregation_test:array_agg_constant_value

SELECT ARRAY_AGG(1) b FROM UNNEST([1, 2, 3]) a

aggregation_queries_test:timestamp_NULL_min_max

SELECT MAX(CAST(NULL AS TIMESTAMP)) AS max_NULL,
       MIN(CAST(NULL AS TIMESTAMP)) AS min_NULL
FROM (SELECT 1)

|| expr.nodeKind() == RESOLVED_GET_STRUCT_FIELD) {
ZetaSQLResolvedNodeKind.ResolvedNodeKind resolvedNodeKind =
expr.getArgumentList().get(i).nodeKind();
if (i == 0 && resolvedNodeKinds.contains(resolvedNodeKind)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case of a literal as the first argument? If I'm understanding the logic right, it doesn't throw an error even though it should (instead it continues).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both tests cases and also change the condition to throw an exception in the case of a literal as the first argument. Both tests fall in this case https://ci-beam.apache.org/job/beam_PreCommit_SQL_Commit/4616/.
The tests also failed before changing the condition but with a different exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the other exception? Also, do you think it would be possible to support literals as the first argument? I'm not sure why that limitation exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are IndexOutOfBounds when inferring types https://ci-beam.apache.org/job/beam_PreCommit_SQL_Commit/4618/
I think it would be possible to support literals as the first argument, but we should verify every function/case.

Copy link
Contributor

@ibzib ibzib Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, for now let's do this:

  • Throw an UnsupportedOperationException when literals are passed as the first argument. In other words, keep the current behavior (prior to this PR).
  • Remove the new array_agg and timestamp test cases.
  • We can add support for literals later if we think it's important. I filed a separate JIRA: https://issues.apache.org/jira/browse/BEAM-13648

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #16200 (3d92b4f) into master (0a220a1) will increase coverage by 8.98%.
The diff coverage is n/a.

❗ Current head 3d92b4f differs from pull request most recent head 1b69b20. Consider uploading reports for the commit 1b69b20 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16200      +/-   ##
==========================================
+ Coverage   74.60%   83.59%   +8.98%     
==========================================
  Files         653      452     -201     
  Lines       81798    62001   -19797     
==========================================
- Hits        61029    51830    -9199     
+ Misses      19788    10171    -9617     
+ Partials      981        0     -981     
Impacted Files Coverage Δ
sdks/python/apache_beam/utils/interactive_utils.py 87.80% <0.00%> (-2.44%) ⬇️
sdks/python/apache_beam/dataframe/transforms.py 94.92% <0.00%> (-0.34%) ⬇️
sdks/python/apache_beam/runners/common.py 89.98% <0.00%> (-0.15%) ⬇️
...eam/runners/portability/fn_api_runner/fn_runner.py 90.80% <0.00%> (-0.11%) ⬇️
sdks/python/apache_beam/transforms/util.py 95.86% <0.00%> (-0.11%) ⬇️
sdks/python/apache_beam/metrics/metric.py 95.38% <0.00%> (ø)
sdks/python/apache_beam/dataframe/expressions.py 92.90% <0.00%> (ø)
sdks/go/pkg/beam/core/graph/scope.go
sdks/go/pkg/beam/core/runtime/exec/hash.go
sdks/go/pkg/beam/transforms/stats/sum_switch.go
... and 201 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a220a1...1b69b20. Read the comment docs.

@ibzib
Copy link
Contributor

ibzib commented Jan 12, 2022

I tried compliance testing from your second-most recent commit (5ea6102266e0f477e8cc4b9f0efb2831df72610b) and there's at least one more failing test. I think we should allow ResolvedParameter as the 2nd+ argument.

Name: aggregation_queries_test:aggregation_string_agg_error_1_null_string_param

  SELECT string_agg("s", @separator) FROM (SELECT 1)

Parameters:
@separator = String(NULL)

java.lang.ClassCastException: class com.google.zetasql.resolvedast.ResolvedNodes$ResolvedParameter cannot be cast to class com.google.zetasql.resolvedast.ResolvedNodes$ResolvedLiteral (com.google.zetasql.resolvedast.ResolvedNodes$ResolvedParameter and com.google.zetasql.resolvedast.ResolvedNodes$ResolvedLiteral are in unnamed module of loader 'app')
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.SqlOperators.createStringAggOperator(SqlOperators.java:186)
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.SqlOperatorMappingTable.create(SqlOperatorMappingTable.java:124)
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.AggregateScanConverter.convertAggCall(AggregateScanConverter.java:238)
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.AggregateScanConverter.convert(AggregateScanConverter.java:99)
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.AggregateScanConverter.convert(AggregateScanConverter.java:55)
	at org.apache.beam.sdk.extensions.sql.zetasql.translation.QueryStatementConverter.convertNode(QueryStatementConverter.java:102)

@benWize
Copy link
Contributor Author

benWize commented Jan 14, 2022

I tried adding a new test with the query SELECT string_agg("s", @separator) FROM (SELECT 1) and It failed in

https://github.com/apache/beam/blob/f45b17242ef125e1ac0f303cc231e9bab614c207/sdks/java/extensions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/translation/SqlOperators.java#L186

I tried adding a new case for ResolvedParameter but I couldn't get the value for delimiter there, it is undefined at that point, I'm not sure how I should handle that case.

@ibzib
Copy link
Contributor

ibzib commented Jan 14, 2022

Let's leave parameters unsolved for now then, I think the compliance tests will be okay as long as we throw an UnsupportedOperationException instead of a ClassCastException. Would you mind also creating a jira so we can link it in the unsupported exception?

@benWize
Copy link
Contributor Author

benWize commented Jan 17, 2022

Let's leave parameters unsolved for now then, I think the compliance tests will be okay as long as we throw an UnsupportedOperationException instead of a ClassCastException. Would you mind also creating a jira so we can link it in the unsupported exception?

Ok, now is throwing UnsupportedOperationException and I filled https://issues.apache.org/jira/browse/BEAM-13673 to link it.

Copy link
Contributor

@ibzib ibzib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

FYI - I was wrong that UnsupportedOperationException was the right one to throw for parameters, so I had to change the exception to a different type to satisfy the compliance tests. But everything passes now.

@ibzib ibzib merged commit 77e924f into apache:master Jan 18, 2022
@benWize
Copy link
Contributor Author

benWize commented Jan 18, 2022

Thank you @ibzib!

kennknowles added a commit to kennknowles/beam that referenced this pull request Jan 21, 2022
* github/master: (198 commits)
  Merge pull request apache#16369 from [BEAM-13558] [Playground] Hide the Graph tab and SCIO from SDK options
  Merge pull request apache#16546 from [BEAM-13661] [BEAM-13704] [Playground] Update tags for examples/katas/unit-tests
  Merge pull request apache#16540 from [BEAM-13678][Playground]Update Github Action To Deploy Examples
  [BEAM-13430] Re-add provided configuration (apache#16552)
  [BEAM-10206] Remove Fatalf calls in non-test goroutines for tests/benchmarks (apache#16575)
  [BEAM-13693] Bump beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming timeout to 12 hours (apache#16576)
  [BEAM-13699] Replace fnv with maphash. (apache#16573)
  Update Java FnAPI beam master (apache#16572)
  Merge pull request apache#16371 from [BEAM-13518][Playground] Beam Playground quickstart page on the Beam website
  Merge pull request apache#16373 from [BEAM-13515] [Playground] Hiding lines in an example that are not necessary
  Merge pull request apache#16472: [BEAM-13697] Add SchemaFieldNumber annotation
  [BEAM-13689] Output token elements when BQ batch writes complete.
  Disable logging for memoization test. (apache#16556)
  BEAM-13611 reactivating jdbcio xlang test
  Revert "Merge pull request apache#15863 from [BEAM-13184] Autosharding for JdbcIO.write* transforms"
  Remove obsolete commands from Inventory job. (apache#16564)
  [BEAM-13688] fixed type in BPG 4.5.3 window section (apache#16560)
  [BEAM-13683] Make cross-language SQL example pipeline (apache#16567)
  Merge pull request apache#16243 from darshan-sj/feature/support-priority-spannerio - Making rpcPriority a ValueProvider in SpannerConfig
  edited README and comments in Python multi-lang pipes examples
  Merge pull request apache#16518 from [BEAM-13619] [Playground] Add loading animation to the catalog
  Merge pull request apache#16519 from [BEAM-13639] [Playground] Add notification to Log/Output tabs about cached example
  Merge pull request apache#16533 from [BEAM-13548] [Playground] Add example description popover
  Merge pull request apache#16531 from [BEAM-13567] [playground] Handle run code validation and preparation errors
  Merge pull request apache#16370 from [BEAM-13556] playground - color and scroll tabs with new content
  [BEAM-13611] Skip test_xlang_jdbc_write (apache#16554)
  [BEAM-13015] Provide caching statistics in the status client. (apache#16495)
  Merge pull request apache#16309: [BEAM-13503] Set a default value to throwWriteErrors in BulkIO constructor
  [BEAM-13015] Add state caching capability to be used as hint for runners to not duplicate cached data if the SDK can do it for user state and side inputs. (apache#16525)
  [BEAM-13665] Make SpannerIO projectID optional again (apache#16547)
  Merge pull request apache#16322 from [BEAM-13407] [Playground] Preload fonts for the web application
  Merge pull request apache#16506 from [BEAM-13652][Playground] Send examples' links to the frontend
  [BEAM-11808][BEAM-9879] Support aggregate functions with two arguments (apache#16200)
  Update walkthrough.md (apache#16512)
  [BEAM-13683] Correct SQL transform schema, fix expansion address override bug (apache#16551)
  Merge pull request apache#16486 from [BEAM-13544][Playground] Add logs to examples CI/CD to see the progress
  [BEAM-13616][BEAM-13646] Upgrade vendored calcite to 1.28.0:0.2 (apache#16544)
  [BEAM-13616][BEAM-13645] Switch to vendored grpc 1.43.2 (apache#16543)
  Also bump FnAPI container.
  [BEAM-13680] Fixed code_repository (added pipelineUuid to RunCodeResult when status is "Finished")
  [BEAM-13616] Update com.google.cloud:libraries-bom to 24.2.0 (apache#16509)
  [BEAM-13430] Remove jcenter which will no longer contain any updates. (apache#16536)
  Update GH Actions to use proper variables names and proper triggers
  Remove jcenter repositories from gradle configuration. (apache#16532)
  Merge pull request apache#16507: [BEAM-13137] Fixes ES utest size flakiness with _flush api and index.store.stats_refresh_interval=0
  [BEAM-13664] Fix Primitives hashing benchmark (apache#16523)
  Bump beam container version.
  [BEAM-12621] - Update Jenkins VMs to modern Ubuntu version (apache#16457)
  doc tweaks (apache#16498)
  Redirecting cross-language transforms content (apache#16504)
  Remove tab from source.
  Remove unnecessary fmt call in universal.go
  Clean up string cast of bytes in vet.go and corresponding tests
  fix capitalized error strings in expansionx
  Remove unnecessary blank identifier assignment in harness.go
  Replace string(buf.Bytes()) with buf.String() in coder_test.go
  Replace bytes.Compare() with bytes.Equal() in test cases
  Remove unnecessary fmt.Sprintf() in partition.go
  Fix staticcheck errors in transforms directory
  [BEAM-13590] Fix  abc imports from collections (apache#15850)
  Merge pull request apache#16482 from [BEAM-13429][Playground] Add builder for preparers
  [BEAM-10206] Resolve go vet errors in protox package
  [BEAM-12572] Run java examples on multiple runners (apache#16450)
  [BEAM-13400] JDBC IO does not support UUID and JSONB PostgreSQL types and OTHER JDBC types in general
  [BEAM-13577] Beam Select's uniquifyNames function loses nullability of Complex types while inferring schema
  [BEAM-12164]: Add SDF for reading change stream records
  [BEAM-13455] Remove duplicated artifacts when using multiple environments with Dataflow Java
  Merge pull request apache#16485 from [BEAM-13486] [Playground] For unit tests (java) if one of tests fails the output goes to stdOutput
  Merge pull request apache#16385 from [BEAM-13535] [Playground] add cancel execution button
  [BEAM-12558] Fix doc typo.
  Merge pull request apache#16467 from [BEAM-12164]: SpannerIO DetectNewPartitions SDF
  Introduce the notion of a JoinIndex for fewer shuffles. (apache#16101)
  [BEAM-12464] Change ProtoSchemaTranslator beam schema creation to match the order for protobufs containing Oneof fields (apache#14974)
  Stronger typing inference for CoGBK. (apache#16465)
  [BEAM-13480] Increase pipeline timeout for PubSubIntegrationTest.test_streaming_data_only (apache#16496)
  Provide API to check whether a hint is known.
  [BEAM-8806] Integration test for SqsIO using Localstack
  Split builder into several builder for each step of pipeline execution
  [BEAM-13399] Move service liveness polling to Runner type (apache#16487)
  Merge pull request apache#16325 from [BEAM-13471] [Playground] Tag existing unit-tests
  Adds several example multi-language Python pipelines
  [BEAM-13616][BEAM-13646] Update vendored calcite 1.28.0 with protobuf 3.19.2 (apache#16473)
  Merge pull request apache#16374 from [BEAM-13398][Playground] Split LifeCycle to DTO and business logic
  Merge pull request apache#16363 from [BEAM-13557] [Playground] show code execution time
  Merge pull request apache#16149 from [BEAM-13113] [Playground] playground frontend documentation
  Merge pull request apache#16469 from [BEAM-13623][Playground] [Bugfix] During unit tests failing there is no any output
  [BEAM-13641][Playground] Add SCIO SDK support on the CI/CD step
  [BEAM-13638] Datatype of timestamp fields in SqsMessage for AWS IOs for SDK v2 was changed from String to long, visibility of all fields was fixed from package private to public
  [BEAM-13616] Initial files for vendored gRPC 1.43.2 (apache#16460)
  [BEAM-13432] Skip ExpansionService creation in Job Server (apache#16222)
  [BEAM-13628] Update SideInputCache to use full Transform and SideInputIDs as token information (apache#16483)
  [BEAM-13631] Add deterministic SQS message coder to fix reading from SQS in batch mode
  [BEAM-8806] Integration test for SqsIO
  [adhoc] Run spotlessApply on java examples to fix master
  Merge pull request apache#16156 from [BEAM-13391] Fix temporary file format in WriteToBigQuery
  Optional args and kwargs for named external transforms.
  [BEAM-13614] Add OnWindowExpiration support to the Java SDK harness and proto translation. (apache#16458)
  [BEAM-3221] Improve documentation in model pipeline protos (apache#16474)
  Merge pull request apache#16147 from [BEAM-13359] [Playground] Tag existing examples
  [BEAM-13626] Remap expanded outputs after merging. (apache#16471)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants