-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
|
java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/Constants.java
Outdated
Show resolved
Hide resolved
JdbcToArrowConfig config = new JdbcToArrowConfigBuilder() | ||
.setAllocator(new RootAllocator()).setSchemaComment(tableComment) | ||
.setColumnCommentByColumnIndex(columnCommentByColumnIndex).setIncludeMetadata(includeMetadata).build(); | ||
return JdbcToArrowUtils.jdbcToArrowSchema(resultSetMetaData, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, for the other metadata, this method automatically extracts the metadata values and adds them here. But for REMARKS, it's non-trivial to extract it from the result set so it probably shouldn't be done automatically.
At this point, I wonder if the API shouldn't just be: allow specifying extra metadata to attach to the schema and to each column, instead of special casing a single metadata value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, @lidavidm!
Do you mean change new JdbcToArrowConfig fields to :
private final Map<String,String> schemaMetadata;
private final Map<Integer, Map<String,String>> columnMetadataByColumnIndex;
? To allow user provide any additional metadata for schema/column level? For me such approach is super rational - it allows propagate additional metadata for example SRID related to geometry /Spatial Reference Systems
It also allow to remove hard coded name String COMMENT = "comment" from source and move naming responsibility to developer side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That way you don't have to submit a new PR every time there's new metadata :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Of course, you could probably add the metadata during reading as well…but since we already offer some options, I guess we may as well have a formal API.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so I'll change implementation to more common approach but test case will be the same with "comment". Is it OK for you?
For me test - like code snippets for developers, somebody can implement complex DB metadata gathering base on the same idea. For PostgreSQL driver required special wrapper around public class ResultSetMetaData to propagate right metadata to existing jdbc->arrow bridge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can keep the test case.
It may help to also note in the test case that COMMENT is what Spark uses - I wasn't aware of that context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I'll do it on Monday. Thank you for your assistance and good idea! Have a nice weekend @lidavidm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Apache Spark schema conversion result for schema from org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest#schemaCommentWithDatabaseMetadata
structType.toDDL(): ID BIGINT NOT NULL COMMENT 'Record identifier',NAME STRING COMMENT 'Name of record',COLUMN1 BOOLEAN,COLUMNN INT COMMENT 'Informative description of columnN'
Hello @lidavidm |
Hi, sorry, it seems there's still a lint error - looks like the newly added files need the license header at top, something like this
|
I'll take a second look when I can but it may be a few days |
Ok, done license header with h2 sql file. But is comment applicable to json expected dataset? I don't think that it possible for json files |
The path can be added here: https://github.com/apache/arrow/blob/master/dev/release/rat_exclude_files.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Benchmark runs are scheduled for baseline = fac0840 and contender = 9d33df1. 9d33df1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
apache#14081) Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
apache#14081) Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too