ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

igor-suhorukov · 2022-09-09T08:15:18Z

Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

github-actions · 2022-09-09T08:40:42Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2022-09-09T08:48:40Z

https://issues.apache.org/jira/browse/ARROW-17631

github-actions · 2022-09-09T08:48:42Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/Constants.java

lidavidm · 2022-09-09T14:02:00Z

...adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/JdbcToArrowCommentMetadataTest.java

+        JdbcToArrowConfig config = new JdbcToArrowConfigBuilder()
+                .setAllocator(new RootAllocator()).setSchemaComment(tableComment)
+                .setColumnCommentByColumnIndex(columnCommentByColumnIndex).setIncludeMetadata(includeMetadata).build();
+        return JdbcToArrowUtils.jdbcToArrowSchema(resultSetMetaData, config);


Ah, for the other metadata, this method automatically extracts the metadata values and adds them here. But for REMARKS, it's non-trivial to extract it from the result set so it probably shouldn't be done automatically.

At this point, I wonder if the API shouldn't just be: allow specifying extra metadata to attach to the schema and to each column, instead of special casing a single metadata value?

Good idea, @lidavidm!
Do you mean change new JdbcToArrowConfig fields to :
private final Map<String,String> schemaMetadata;
private final Map<Integer, Map<String,String>> columnMetadataByColumnIndex;

? To allow user provide any additional metadata for schema/column level? For me such approach is super rational - it allows propagate additional metadata for example SRID related to geometry /Spatial Reference Systems
It also allow to remove hard coded name String COMMENT = "comment" from source and move naming responsibility to developer side

Yes! That way you don't have to submit a new PR every time there's new metadata :)

(Of course, you could probably add the metadata during reading as well…but since we already offer some options, I guess we may as well have a formal API.)

Cool, so I'll change implementation to more common approach but test case will be the same with "comment". Is it OK for you?
For me test - like code snippets for developers, somebody can implement complex DB metadata gathering base on the same idea. For PostgreSQL driver required special wrapper around public class ResultSetMetaData to propagate right metadata to existing jdbc->arrow bridge

Yes, we can keep the test case.

It may help to also note in the test case that COMMENT is what Spark uses - I wasn't aware of that context

Sure! I'll do it on Monday. Thank you for your assistance and good idea! Have a nice weekend @lidavidm

To Apache Spark schema conversion result for schema from org.apache.arrow.adapter.jdbc.JdbcToArrowCommentMetadataTest#schemaCommentWithDatabaseMetadata

structType.toDDL(): ID BIGINT NOT NULL COMMENT 'Record identifier',NAME STRING COMMENT 'Name of record',COLUMN1 BOOLEAN,COLUMNN INT COMMENT 'Informative description of columnN'

igor-suhorukov · 2022-09-13T15:52:52Z

Hello @lidavidm
Is code ok after rework?

lidavidm · 2022-09-13T15:54:55Z

Hi, sorry, it seems there's still a lint error - looks like the newly added files need the license header at top, something like this

arrow/java/adapter/jdbc/src/test/resources/h2/test1_all_datatypes_h2.yml

Lines 1 to 10 in 8bf60b5

    
           #Licensed to the Apache Software Foundation (ASF) under one or more contributor 
        
           #license agreements. See the NOTICE file distributed with this work for additional 
        
           #information regarding copyright ownership. The ASF licenses this file to 
        
           #You under the Apache License, Version 2.0 (the "License"); you may not use 
        
           #this file except in compliance with the License. You may obtain a copy of 
        
           #the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
        
           #by applicable law or agreed to in writing, software distributed under the 
        
           #License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
        
           #OF ANY KIND, either express or implied. See the License for the specific 
        
           #language governing permissions and limitations under the License.

lidavidm · 2022-09-13T15:55:17Z

I'll take a second look when I can but it may be a few days

igor-suhorukov · 2022-09-13T16:30:04Z

Ok, done license header with h2 sql file. But is comment applicable to json expected dataset? I don't think that it possible for json files

lidavidm · 2022-09-13T16:47:32Z

The path can be added here: https://github.com/apache/arrow/blob/master/dev/release/rat_exclude_files.txt

igor-suhorukov · 2022-09-13T17:57:47Z

json as exception already here

lidavidm

Thanks!

ursabot · 2022-09-14T21:53:25Z

Benchmark runs are scheduled for baseline = fac0840 and contender = 9d33df1. 9d33df1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.2% ⬆️0.07%] test-mac-arm
[Failed ⬇️0.28% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.11% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 9d33df19 ec2-t3-xlarge-us-east-2
[Finished] 9d33df19 test-mac-arm
[Failed] 9d33df19 ursa-i9-9960x
[Finished] 9d33df19 ursa-thinkcentre-m75q
[Finished] fac08404 ec2-t3-xlarge-us-east-2
[Finished] fac08404 test-mac-arm
[Failed] fac08404 ursa-i9-9960x
[Finished] fac08404 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

apache#14081) Allow user to provide comment in Arrow Schema from JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

igor-suhorukov added 2 commits September 9, 2022 01:41

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

ecebb3a

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

c96a9cb

igor-suhorukov changed the title ~~Propagate table/columns comments into Arrow Schema~~ ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema Sep 9, 2022

github-actions bot added the Component: Java label Sep 9, 2022

lidavidm reviewed Sep 9, 2022

View reviewed changes

igor-suhorukov added 3 commits September 10, 2022 13:48

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

3499cde

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

8016691

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

1a89051

ARROW-17631: [Java] Propagate table/columns comment into Arrow Schema

f483a8d

lidavidm approved these changes Sep 14, 2022

View reviewed changes

lidavidm merged commit 9d33df1 into apache:master Sep 14, 2022

igor-suhorukov mentioned this pull request Sep 21, 2022

How to get Arrow Schema for PostgreSQL column of hstore(map) and geometry types in JdbcToArrowUtils.jdbcToArrowSchema ? #14039

Closed

asfimport mentioned this pull request Sep 14, 2022

[Java] Propagate table/columns comment into Arrow Schema #32874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

igor-suhorukov commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

lidavidm Sep 9, 2022

igor-suhorukov Sep 9, 2022 •

edited

Loading

lidavidm Sep 9, 2022

lidavidm Sep 9, 2022

igor-suhorukov Sep 9, 2022 •

edited

Loading

lidavidm Sep 9, 2022

igor-suhorukov Sep 9, 2022

igor-suhorukov Sep 10, 2022

igor-suhorukov commented Sep 13, 2022

lidavidm commented Sep 13, 2022

lidavidm commented Sep 13, 2022

igor-suhorukov commented Sep 13, 2022

lidavidm commented Sep 13, 2022

igor-suhorukov commented Sep 13, 2022

lidavidm left a comment

ursabot commented Sep 14, 2022

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

ARROW-17631: [Java] Propagate table/columns comments into Arrow Schema #14081

Conversation

igor-suhorukov commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

github-actions bot commented Sep 9, 2022

lidavidm Sep 9, 2022

Choose a reason for hiding this comment

igor-suhorukov Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

lidavidm Sep 9, 2022

Choose a reason for hiding this comment

lidavidm Sep 9, 2022

Choose a reason for hiding this comment

igor-suhorukov Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

lidavidm Sep 9, 2022

Choose a reason for hiding this comment

igor-suhorukov Sep 9, 2022

Choose a reason for hiding this comment

igor-suhorukov Sep 10, 2022

Choose a reason for hiding this comment

igor-suhorukov commented Sep 13, 2022

lidavidm commented Sep 13, 2022

lidavidm commented Sep 13, 2022

igor-suhorukov commented Sep 13, 2022

lidavidm commented Sep 13, 2022

igor-suhorukov commented Sep 13, 2022

lidavidm left a comment

Choose a reason for hiding this comment

ursabot commented Sep 14, 2022

igor-suhorukov Sep 9, 2022 •

edited

Loading

igor-suhorukov Sep 9, 2022 •

edited

Loading