DRILL-8005: Add Writer to JDBC Storage Plugin #2327

cgivre · 2021-10-06T22:19:31Z

DRILL-8005: Add Writer to JDBC Storage Plugin

Description

This PR adds the ability to write to JDBC storage. Users will be able to execute the following queries against JDBC data sources.

CREATE TABLE AS
CREATE TABLE IF NOT EXISTS
DROP TABLE

Example Queries:

CREATE TABLE IF NOT EXISTS pg.public.`t1` AS 
  SELECT int_field, float_field, varchar_field, boolean_field 
  FROM cp.`json/dataTypes.json`

Known Limitations:

JDBC in general does not support complex types, and current implementation of this plugin will throw an exception if a user tries to write a complex field to a JDBC source.
This PR attempts to be as generic as possible and as such, the translation between Drill data types and JDBC data types isn't always the same. Specifically, various databases use different types for INT, FLOAT etc. The plugin will default back to NUMERIC for most FLOAT types.
VARBINARY is not supported yet.
Write capability was tested on MySQL, Postgres and H2. Given the lack of standardization of DDL queries, there may be bugs when trying to write to other JDBC data sources.

Documentation

Documentation will be provided in a separate pull request.

Testing

This PR adds unit tests for the writer for MySQL, Postgres, and H2. Additionally, this PR adds additional unit tests for the JDBC storage plugin and Postgres.

lgtm-com · 2021-10-06T23:39:06Z

This pull request introduces 2 alerts when merging 57d3fa6 into bad5e66 - view on LGTM.com

new alerts:

2 for Potential database resource leak

lgtm-com · 2021-10-07T01:20:50Z

This pull request introduces 2 alerts when merging af78c5a into bad5e66 - view on LGTM.com

new alerts:

2 for Potential database resource leak

jnturton

Comments and questions inline.

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/utils/JdbcQueryBuilder.java

jnturton · 2021-10-08T09:43:48Z

...rib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/utils/JdbcDDLQueryUtils.java

+      String cleanSQL =  node.toSqlString(dialect).getSql();
+
+      // TODO Fix this hack
+      // HACK  See CALCITE-4820 (https://issues.apache.org/jira/browse/CALCITE-4820)


Did you see the response from the Calcite team on CALCITE-4820?

I didn't see the response.
@vvysotskyi would it be possible to merge apache/calcite#1568 to the Drill calcite?

I think yes, it is a quite small fix, so no conflicts should appear.

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/CapitalizingJdbcSchema.java

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcCatalogSchema.java

.../storage-jdbc/src/test/java/org/apache/drill/exec/store/jdbc/TestJdbcPluginWithPostgres.java

jnturton · 2021-10-08T11:05:27Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+    return new NullableIntJDBCConverter(fieldId, fieldName, reader, fields);
+  }
+
+  public class NullableIntJDBCConverter extends FieldConverter {


I guess maybe we could do something with a Freemarker template for the converters but I'm not convinced it's worth it now that we already have these written.

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

jnturton · 2021-10-08T11:15:05Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+    }
+  }
+
+  private String buildInsertQuery() {


I think that the maximum number of records DBMSes allow in a VALUES expression is commonly order 1e3 to 1e4. If Drill batch sizes can exceed that we're going to have a problem. A possible solution is to always partition into conservative insert batches of, say 500 records. The PreparedStatement and executeBatch JDBC API usage in this answer https://stackoverflow.com/a/3786127/1153953 might help to keep things as efficient as possible.

@cgivre did you see this? Have we tested CTAS statements with 10k, 100k, 1m records?

@dzamo
This is a good question. What is supposed to happen is that inserts actually happen in batches. Any suggestions as to how to test? Do you think I should just generate a CSV file with 1M records and see what happens?

I'm still learning about the writer API myself, so I'm figuring this out as we go, but I'm also not quite sure where you control the batch size. I can see if I can figure that out.

@cgivre I think generating a test file of 1m records is a good thing to do at least once. I don't know much about Drill's batching but I think of it as unrelated to the size limitations of VALUES expressions in external dbs. If it were me I'd assume Drill could send batches bigger than the target db's VALUES limit and I'd write a loop in JdbcRecordWriter that inserts no more than ~500 records at a time, as outlined in my first comment.

@cgivre a little trick for producing 1e6 records

set `planner.enable_nljoin_for_scalar_only` = false; create temporary table t as select o1.* from cp.`tpch/orders.parquet` o1 cross join cp.`tpch/orders.parquet` o2 limit 1e6;

.gitignore

contrib/storage-jdbc/pom.xml

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/CapitalizingJdbcSchema.java

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcStoragePlugin.java

docs/dev/CreatingAWriter.md

cgivre · 2021-10-11T00:34:26Z

@dzamo, @vvysotskyi
Thank you for your timely review on this. I addressed all your comments. Once the drill-calcite PR is merged, I should be able to remove the hack and (hopefully) it should be ready to go at that point.

luocooong · 2021-10-14T10:33:49Z

@cgivre Hello Charles. Is it possible to compress the large_csv.csvh to zip or tar.gz format ?

jnturton · 2021-10-14T11:40:35Z

Or we can use a cross join in a query based on a small CSV file of n rows to get n² rows.

cgivre · 2021-10-15T21:16:56Z

@cgivre Hello Charles. Is it possible to compress the large_csv.csvh to zip or tar.gz format ?

Hi @luocooong Are you looking for a test to see if it can go from compressed file to insert? I added a new test that generates a 100k row CSV file and inserts that. The only thing is that the test is slow, so I disabled it by default.

contrib/storage-jdbc/src/test/java/org/apache/drill/exec/store/jdbc/TestJdbcWriterWithH2.java

.../storage-jdbc/src/test/java/org/apache/drill/exec/store/jdbc/TestJdbcWriterWithPostgres.java

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

vvysotskyi · 2021-10-17T18:46:16Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+      Statement stmt = connection.createStatement();
+      stmt.execute(insertQuery);
+      logger.debug("Query complete");
+      // Close connection
+      AutoCloseables.closeSilently(stmt, connection);


Please wrap the statement into the try-with-resources block and add a closing connection to the finally block, since, for the case of exception during statement execution, it wouldn't be closed...

Fixed I think.

vvysotskyi · 2021-10-17T19:13:25Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+  public void endRecord() throws IOException {
+    logger.debug("Ending record");
+
+    // Add values to rowString


Can't we use Calcite's JdbcTableModify to create insert statement string instead of using custom logic?

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

contrib/storage-jdbc/pom.xml

vvysotskyi · 2021-10-19T17:51:58Z

contrib/storage-jdbc/pom.xml

-
+    <dependency>
+      <groupId>com.github.vvysotskyi.drill-calcite</groupId>
+      <artifactId>calcite-server</artifactId>


Could you please explain why this dependency is required here?

Calcite-core is included in the root pom.xml file. The DDR operations which we needed for this are in the calcite-server module which was not listed as a dependency in the root pom.xml.

cgivre · 2021-10-21T02:48:38Z

@dzamo @vvysotskyi
Thank you for your patience. I've removed the hackery and addressed your review comments.
Thanks!

jnturton

If Vova's happy I'm happy +1.

jnturton · 2021-10-11T11:45:35Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+    }
+  }
+
+  private String buildInsertQuery() {


@cgivre I think generating a test file of 1m records is a good thing to do at least once. I don't know much about Drill's batching but I think of it as unrelated to the size limitations of VALUES expressions in external dbs. If it were me I'd assume Drill could send batches bigger than the target db's VALUES limit and I'd write a loop in JdbcRecordWriter that inserts no more than ~500 records at a time, as outlined in my first comment.

jnturton · 2021-10-11T11:59:25Z

contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordWriter.java

+    }
+  }
+
+  private String buildInsertQuery() {


@cgivre a little trick for producing 1e6 records

set `planner.enable_nljoin_for_scalar_only` = false; create temporary table t as select o1.* from cp.`tpch/orders.parquet` o1 cross join cp.`tpch/orders.parquet` o2 limit 1e6;

contrib/storage-jdbc/pom.xml

cgivre · 2021-10-22T03:04:38Z

@dzamo
Per your request, I thought about this some more and added the ability to configure the batch size for the INSERT queries. What happens now is that the user can set the batch size depending on their environment and the database to which they are inserting data.

The unit tests pass and I ran this locally with a 1M row CSV insert into a MySQL database which worked perfectly. Previously, this ran into the max_packet_size limit in MySQL, but now this is not an issue.

jnturton · 2021-10-22T08:11:39Z

@dzamo Per your request, I thought about this some more and added the ability to configure the batch size for the INSERT queries. What happens now is that the user can set the batch size depending on their environment and the database to which they are inserting data.

@cgivre this is great. I thought of one more possible optimisation: creating a parameterised INSERT PreparedStatement of writer_batch_size rows and reusing it for as long as there are >= writer_batch_size rows remaining to insert. I don't know Calcite stuff but I can say I saw a class called SqlDynamicParam in it. This would mean that the receiving DBMS does not need to parse a very long INSERT statement at the start of every batch, a noticeable saving of memory and CPU time for it I would guess. Just a possible optimisation I wanted to share, I view it as something that can also come in a later version.

vvysotskyi

+1

cgivre · 2021-10-22T11:55:11Z

Thank you @dzamo @vvysotskyi and @luocooong for your review!

* Initial commit * Getting to writer * Schema should be created * Queries complete * Queries successfully completing * WIP * WIP * Fixed schema resolution issue * Fixed VARCHAR precision issue * Added writable config option * Mostly working * Added Postgres Unit Tests, Some working * All Postgres tests working * Fix spacing * Code Cleanup * WIP * WIP - Null rows * Null input working * Added additional unit tests * Finished MySQL unit tests * MySQL and Postgres Tests all Passing * h2 writer tests working * Final commit * Ready for PR * Final fixes * Add license header * Fixed MySQL Timezones in unit tests * Removed unused import * Remove unneeded unit tests * Minor fixes * Working on docs * Addressed code comments * Fixed unit test * Added test for large file * Updated tutorial * Added tests for large inserts * Added documentation * Addressed Review Commnet * De-hacking * Removed ROW Hack * Added configurable batching

cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation labels Oct 6, 2021

cgivre requested review from jnturton and vvysotskyi October 6, 2021 22:19

cgivre self-assigned this Oct 6, 2021

jnturton reviewed Oct 8, 2021

View reviewed changes

vvysotskyi requested changes Oct 8, 2021

View reviewed changes

cgivre mentioned this pull request Oct 8, 2021

Integrate CALCITE-3486 into Drill Calcite vvysotskyi/drill-calcite#6

Closed

cgivre mentioned this pull request Oct 12, 2021

CALCITE-3486 In JDBC Adapter, Unparse ROW keyword if Dialect Allows vvysotskyi/drill-calcite#8

Merged

luocooong reviewed Oct 16, 2021

View reviewed changes

contrib/storage-jdbc/src/test/java/org/apache/drill/exec/store/jdbc/TestJdbcWriterWithH2.java Show resolved Hide resolved

.../storage-jdbc/src/test/java/org/apache/drill/exec/store/jdbc/TestJdbcWriterWithPostgres.java Show resolved Hide resolved

vvysotskyi reviewed Oct 19, 2021

View reviewed changes

cgivre added 13 commits October 20, 2021 13:59

Initial commit

3f205c1

Getting to writer

35b87be

Schema should be created

2a84493

Queries complete

e8d87be

Queries successfully completing

c4282ea

WIP

b4958e6

WIP

ebe0b8f

Fixed schema resolution issue

d841590

Fixed VARCHAR precision issue

c910928

Added writable config option

34168be

Mostly working

e9001d0

Added Postgres Unit Tests, Some working

2a4decd

All Postgres tests working

e168f41

cgivre added 18 commits October 20, 2021 13:59

Final commit

95b9217

Ready for PR

c06c0d9

Final fixes

e3b1dc0

Add license header

edfece3

Fixed MySQL Timezones in unit tests

0b4d7e5

Removed unused import

73e8424

Remove unneeded unit tests

187e258

Minor fixes

8529a77

Working on docs

81857dd

Addressed code comments

e2f348a

Fixed unit test

dbc923e

Added test for large file

dadf9dd

Updated tutorial

a5c5822

Added tests for large inserts

93f6a68

Added documentation

70b2d54

Addressed Review Commnet

7b50122

De-hacking

b7985a1

Removed ROW Hack

bcea074

cgivre force-pushed the jdbc_writer branch from 9255e06 to bcea074 Compare October 21, 2021 01:19

jnturton approved these changes Oct 21, 2021

View reviewed changes

vvysotskyi reviewed Oct 21, 2021

View reviewed changes

contrib/storage-jdbc/pom.xml Outdated Show resolved Hide resolved

Added configurable batching

96355b8

vvysotskyi approved these changes Oct 22, 2021

View reviewed changes

cgivre merged commit 81b8a98 into apache:master Oct 22, 2021

cgivre deleted the jdbc_writer branch October 22, 2021 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-8005: Add Writer to JDBC Storage Plugin #2327

DRILL-8005: Add Writer to JDBC Storage Plugin #2327

cgivre commented Oct 6, 2021

lgtm-com bot commented Oct 6, 2021

lgtm-com bot commented Oct 7, 2021

jnturton left a comment

jnturton Oct 8, 2021

cgivre Oct 8, 2021

vvysotskyi Oct 8, 2021

jnturton Oct 8, 2021

jnturton Oct 8, 2021 •

edited

Loading

jnturton Oct 11, 2021

cgivre Oct 11, 2021

cgivre Oct 11, 2021

jnturton Oct 11, 2021

jnturton Oct 11, 2021

cgivre commented Oct 11, 2021

luocooong commented Oct 14, 2021

jnturton commented Oct 14, 2021

cgivre commented Oct 15, 2021

vvysotskyi Oct 17, 2021

cgivre Oct 19, 2021

vvysotskyi Oct 17, 2021

vvysotskyi Oct 19, 2021

cgivre Oct 19, 2021

cgivre commented Oct 21, 2021

jnturton left a comment

jnturton Oct 11, 2021

jnturton Oct 11, 2021

cgivre commented Oct 22, 2021

jnturton commented Oct 22, 2021

vvysotskyi left a comment

cgivre commented Oct 22, 2021

DRILL-8005: Add Writer to JDBC Storage Plugin #2327

DRILL-8005: Add Writer to JDBC Storage Plugin #2327

Conversation

cgivre commented Oct 6, 2021

DRILL-8005: Add Writer to JDBC Storage Plugin

Description

Example Queries:

Known Limitations:

Documentation

Testing

lgtm-com bot commented Oct 6, 2021

lgtm-com bot commented Oct 7, 2021

jnturton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnturton Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgivre commented Oct 11, 2021

luocooong commented Oct 14, 2021

jnturton commented Oct 14, 2021

cgivre commented Oct 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgivre commented Oct 21, 2021

jnturton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgivre commented Oct 22, 2021

jnturton commented Oct 22, 2021

vvysotskyi left a comment

Choose a reason for hiding this comment

cgivre commented Oct 22, 2021

jnturton Oct 8, 2021 •

edited

Loading