Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get Arrow Schema for PostgreSQL column of hstore(map) and geometry types in JdbcToArrowUtils.jdbcToArrowSchema ? #14039

Closed
igor-suhorukov opened this issue Sep 4, 2022 · 6 comments

Comments

@igor-suhorukov
Copy link
Contributor

I want to convert into Apache Arrow IPC data format in Java existing data from PostgreSQL database.

JdbcToArrowUtils.jdbcToArrowSchema allows do it for scalar types and arrays.

org.postgresql.jdbc.PgResultSetMetaData instance from PostgreSQL jdbc driver provide all required info related to column types pgResultSetMetaData.getColumnTypeName().

But looks like current implementation support only ArrowType.List with no support of Map type mapping in code

How to get Apache Arrow schema and transform data into Arrow for hstore(map) and geometry database column types?

@lidavidm
Copy link
Member

lidavidm commented Sep 6, 2022

I don't think it's possible; while you can override the conversion from JDBC type to Arrow type, the supported Arrow types are hardcoded:

static JdbcConsumer getConsumer(ArrowType arrowType, int columnIndex, boolean nullable,

For Map: it'd be reasonable to add support directory

For geometry type: the question would be, what type do you expect this to be mapped to? And would it be useful to extend the JDBC adapter with full control over the Arrow type support, so that you can add custom types at runtime?

@igor-suhorukov
Copy link
Contributor Author

Hello @lidavidm
Thank you for your time and knowledge sharing!

For Map: it'd be reasonable to add support directory

Sounds perfect! Something similar with following snippet for schema:
if(arrowType.getTypeID() == ArrowType.ArrowTypeID.Map) { FieldType mapType = new FieldType(false, ArrowType.Struct.INSTANCE, null, null); FieldType keyType1 = new FieldType(false, new ArrowType.Utf8(), null, null); children = new ArrayList<>(); children.add(new Field("child", mapType, Arrays.asList(new Field(MapVector.KEY_NAME, keyType1, null), new Field(MapVector.VALUE_NAME, keyType1, null)))); }

For geometry type: the question would be, what type do you expect this to be mapped to?

binary representation is OK for me for current task

And would it be useful to extend the JDBC adapter with full control over the Arrow type support, so that you can add custom types at runtime?

It could be good solution for custom type mapping in complex cases.
First step is add column index in
final ArrowType arrowType = config.getJdbcToArrowTypeConverter().apply(columnIndex, columnFieldInfo);
It is allow write custom type mapping by end users (Leveraging jdbc driver specific features and types. The same story with arrays elements types - much more easy to get element type by column index then iterate all columns to collect data for jdbcToArrowConfig.setArraySubTypeByColumnIndexMap()

I also have proposal for column/schema level metadata - propagate "comment" for table/columns here by using java.sql.DatabaseMetaData#getColumns / java.sql.DatabaseMetaData#getTables from connection.getMetaData() or just allow user to provide it from custom comment handler in JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

@lidavidm
Copy link
Member

lidavidm commented Sep 6, 2022

Hello @lidavidm Thank you for your time and knowledge sharing!

For Map: it'd be reasonable to add support directory

Sounds perfect! Something similar with following snippet for schema: if(arrowType.getTypeID() == ArrowType.ArrowTypeID.Map) { FieldType mapType = new FieldType(false, ArrowType.Struct.INSTANCE, null, null); FieldType keyType1 = new FieldType(false, new ArrowType.Utf8(), null, null); children = new ArrayList<>(); children.add(new Field("child", mapType, Arrays.asList(new Field(MapVector.KEY_NAME, keyType1, null), new Field(MapVector.VALUE_NAME, keyType1, null)))); }

Yes - if you want to file a Jira and/or PR they would be much appreciated. (For anything here, I'm generally supportive, but I don't have too much time right now to do more than review code.)

For geometry type: the question would be, what type do you expect this to be mapped to?

binary representation is OK for me for current task

That sounds good to me. If there's more capabilities required, you may want to follow the discussion about "canonical extension types": https://lists.apache.org/thread/qxc1g7h9ow79qt6r7sqtgbj8mdbdgnhb

This would let us define a new "type" for Postgres-style geometry data which is just metadata over an existing type (such as binary).

And would it be useful to extend the JDBC adapter with full control over the Arrow type support, so that you can add custom types at runtime?

It could be good solution for custom type mapping in complex cases. First step is add column index in final ArrowType arrowType = config.getJdbcToArrowTypeConverter().apply(columnIndex, columnFieldInfo); It is allow write custom type mapping by end users (Leveraging jdbc driver specific features and types. The same story with arrays elements types - much more easy to get element type by column index then iterate all columns to collect data for jdbcToArrowConfig.setArraySubTypeByColumnIndexMap()

Ok, cool. If you have an API proposal, feel free to file a Jira and/or PR - I think it makes sense to allow more flexibility here, given that we allow it in the 'opposite direction' anyways.

I also have proposal for column/schema level metadata - propagate "comment" for table/columns here by using java.sql.DatabaseMetaData#getColumns / java.sql.DatabaseMetaData#getTables from connection.getMetaData() or just allow user to provide it from custom comment handler in JdbcToArrowConfig . It will be very useful metadata in real life (medium to large scale project) for documentation and maintenance topics. Apache Spark code use "comment" key for such metadata, so this looks like reasonable default name for metadata in Arrow schema too

Interesting. A contribution would be welcome here as well - I think it makes sense to propagate the JDBC metadata into the Arrow metadata if it's useful for applications.

@igor-suhorukov
Copy link
Contributor Author

@lidavidm
Copy link
Member

lidavidm commented Sep 6, 2022

Thank you!

@igor-suhorukov
Copy link
Contributor Author

Issue resolved by pull request 14134, 14081:
#14134
#14081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants