Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

Merged
merged 6 commits into from
Sep 14, 2022
Merged

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

merged 6 commits into from
Sep 14, 2022

Conversation

igor-suhorukov
Copy link
Contributor

Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

@igor-suhorukov igor-suhorukov changed the title Propagate table/columns comments into Arrow Schema ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema Sep 9, 2022
@github-actions
Copy link

github-actions bot commented Sep 9, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions
Copy link

github-actions bot commented Sep 9, 2022

@github-actions
Copy link

github-actions bot commented Sep 9, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

JdbcToArrowConfig config = new JdbcToArrowConfigBuilder()
.setAllocator(new RootAllocator()).setSchemaComment(tableComment)
.setColumnCommentByColumnIndex(columnCommentByColumnIndex).setIncludeMetadata(includeMetadata).build();
return JdbcToArrowUtils.jdbcToArrowSchema(resultSetMetaData, config);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, for the other metadata, this method automatically extracts the metadata values and adds them here. But for REMARKS, it's non-trivial to extract it from the result set so it probably shouldn't be done automatically.

At this point, I wonder if the API shouldn't just be: allow specifying extra metadata to attach to the schema and to each column, instead of special casing a single metadata value?

Copy link
Contributor Author

@igor-suhorukov igor-suhorukov Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, @lidavidm!
Do you mean change new JdbcToArrowConfig fields to :
private final Map<String,String> schemaMetadata;
private final Map<Integer, Map<String,String>> columnMetadataByColumnIndex;

? To allow user provide any additional metadata for schema/column level? For me such approach is super rational - it allows propagate additional metadata for example SRID related to geometry /Spatial Reference Systems
It also allow to remove hard coded name String COMMENT = "comment" from source and move naming responsibility to developer side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That way you don't have to submit a new PR every time there's new metadata :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Of course, you could probably add the metadata during reading as well…but since we already offer some options, I guess we may as well have a formal API.)

Copy link
Contributor Author

@igor-suhorukov igor-suhorukov Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so I'll change implementation to more common approach but test case will be the same with "comment". Is it OK for you?
For me test - like code snippets for developers, somebody can implement complex DB metadata gathering base on the same idea. For PostgreSQL driver required special wrapper around public class ResultSetMetaData to propagate right metadata to existing jdbc->arrow bridge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can keep the test case.

It may help to also note in the test case that COMMENT is what Spark uses - I wasn't aware of that context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll do it on Monday. Thank you for your assistance and good idea! Have a nice weekend @lidavidm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

To Apache Spark schema conversion result for schema from org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest#schemaCommentWithDatabaseMetadata

structType.toDDL(): ID BIGINT NOT NULL COMMENT 'Record identifier',NAME STRING COMMENT 'Name of record',COLUMN1 BOOLEAN,COLUMNN INT COMMENT 'Informative description of columnN'

@igor-suhorukov
Copy link
Contributor Author

Hello @lidavidm
Is code ok after rework?

@lidavidm
Copy link
Member

Hi, sorry, it seems there's still a lint error - looks like the newly added files need the license header at top, something like this

#Licensed to the Apache Software Foundation (ASF) under one or more contributor
#license agreements. See the NOTICE file distributed with this work for additional
#information regarding copyright ownership. The ASF licenses this file to
#You under the Apache License, Version 2.0 (the "License"); you may not use
#this file except in compliance with the License. You may obtain a copy of
#the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
#by applicable law or agreed to in writing, software distributed under the
#License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
#OF ANY KIND, either express or implied. See the License for the specific
#language governing permissions and limitations under the License.

@lidavidm
Copy link
Member

I'll take a second look when I can but it may be a few days

@igor-suhorukov
Copy link
Contributor Author

Ok, done license header with h2 sql file. But is comment applicable to json expected dataset? I don't think that it possible for json files

@lidavidm
Copy link
Member

The path can be added here: https://github.com/apache/arrow/blob/master/dev/release/rat_exclude_files.txt

@igor-suhorukov
Copy link
Contributor Author

json as exception already here
image

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lidavidm lidavidm merged commit 9d33df1 into apache:master Sep 14, 2022
@ursabot
Copy link

ursabot commented Sep 14, 2022

Benchmark runs are scheduled for baseline = fac0840 and contender = 9d33df1. 9d33df1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.2% ⬆️0.07%] test-mac-arm
[Failed ⬇️0.28% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.11% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9d33df19 ec2-t3-xlarge-us-east-2
[Finished] 9d33df19 test-mac-arm
[Failed] 9d33df19 ursa-i9-9960x
[Finished] 9d33df19 ursa-thinkcentre-m75q
[Finished] fac08404 ec2-t3-xlarge-us-east-2
[Finished] fac08404 test-mac-arm
[Failed] fac08404 ursa-i9-9960x
[Finished] fac08404 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
apache#14081)

Allow user to provide comment in Arrow Schema from  JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
apache#14081)

Allow user to provide comment in Arrow Schema from  JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants