Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8005: Add Writer to JDBC Storage Plugin #2327

Merged
merged 41 commits into from
Oct 22, 2021
Merged

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Oct 6, 2021

DRILL-8005: Add Writer to JDBC Storage Plugin

Description

This PR adds the ability to write to JDBC storage. Users will be able to execute the following queries against JDBC data sources.

  • CREATE TABLE AS
  • CREATE TABLE IF NOT EXISTS
  • DROP TABLE

Example Queries:

CREATE TABLE IF NOT EXISTS pg.public.`t1` AS 
  SELECT int_field, float_field, varchar_field, boolean_field 
  FROM cp.`json/dataTypes.json`

Known Limitations:

  • JDBC in general does not support complex types, and current implementation of this plugin will throw an exception if a user tries to write a complex field to a JDBC source.
  • This PR attempts to be as generic as possible and as such, the translation between Drill data types and JDBC data types isn't always the same. Specifically, various databases use different types for INT, FLOAT etc. The plugin will default back to NUMERIC for most FLOAT types.
  • VARBINARY is not supported yet.
  • Write capability was tested on MySQL, Postgres and H2. Given the lack of standardization of DDL queries, there may be bugs when trying to write to other JDBC data sources.

Documentation

Documentation will be provided in a separate pull request.

Testing

This PR adds unit tests for the writer for MySQL, Postgres, and H2. Additionally, this PR adds additional unit tests for the JDBC storage plugin and Postgres.

@cgivre cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation labels Oct 6, 2021
@cgivre cgivre self-assigned this Oct 6, 2021
@lgtm-com
Copy link

lgtm-com bot commented Oct 6, 2021

This pull request introduces 2 alerts when merging 57d3fa6 into bad5e66 - view on LGTM.com

new alerts:

  • 2 for Potential database resource leak

@lgtm-com
Copy link

lgtm-com bot commented Oct 7, 2021

This pull request introduces 2 alerts when merging af78c5a into bad5e66 - view on LGTM.com

new alerts:

  • 2 for Potential database resource leak

Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments and questions inline.

String cleanSQL = node.toSqlString(dialect).getSql();

// TODO Fix this hack
// HACK See CALCITE-4820 (https://issues.apache.org/jira/browse/CALCITE-4820)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the response from the Calcite team on CALCITE-4820?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the response.
@vvysotskyi would it be possible to merge apache/calcite#1568 to the Drill calcite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, it is a quite small fix, so no conflicts should appear.

return new NullableIntJDBCConverter(fieldId, fieldName, reader, fields);
}

public class NullableIntJDBCConverter extends FieldConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe we could do something with a Freemarker template for the converters but I'm not convinced it's worth it now that we already have these written.

}
}

private String buildInsertQuery() {
Copy link
Contributor

@jnturton jnturton Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the maximum number of records DBMSes allow in a VALUES expression is commonly order 1e3 to 1e4. If Drill batch sizes can exceed that we're going to have a problem. A possible solution is to always partition into conservative insert batches of, say 500 records. The PreparedStatement and executeBatch JDBC API usage in this answer https://stackoverflow.com/a/3786127/1153953 might help to keep things as efficient as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre did you see this? Have we tested CTAS statements with 10k, 100k, 1m records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzamo
This is a good question. What is supposed to happen is that inserts actually happen in batches. Any suggestions as to how to test? Do you think I should just generate a CSV file with 1M records and see what happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still learning about the writer API myself, so I'm figuring this out as we go, but I'm also not quite sure where you control the batch size. I can see if I can figure that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre I think generating a test file of 1m records is a good thing to do at least once. I don't know much about Drill's batching but I think of it as unrelated to the size limitations of VALUES expressions in external dbs. If it were me I'd assume Drill could send batches bigger than the target db's VALUES limit and I'd write a loop in JdbcRecordWriter that inserts no more than ~500 records at a time, as outlined in my first comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre a little trick for producing 1e6 records

set `planner.enable_nljoin_for_scalar_only` = false;
create temporary table t as select o1.* from cp.`tpch/orders.parquet` o1 cross join cp.`tpch/orders.parquet` o2 limit 1e6;

.gitignore Outdated Show resolved Hide resolved
contrib/storage-jdbc/pom.xml Outdated Show resolved Hide resolved
docs/dev/CreatingAWriter.md Show resolved Hide resolved
@cgivre
Copy link
Contributor Author

cgivre commented Oct 11, 2021

@dzamo, @vvysotskyi
Thank you for your timely review on this. I addressed all your comments. Once the drill-calcite PR is merged, I should be able to remove the hack and (hopefully) it should be ready to go at that point.

@luocooong
Copy link
Member

@cgivre Hello Charles. Is it possible to compress the large_csv.csvh to zip or tar.gz format ?

@jnturton
Copy link
Contributor

Or we can use a cross join in a query based on a small CSV file of n rows to get n² rows.

@cgivre
Copy link
Contributor Author

cgivre commented Oct 15, 2021

@cgivre Hello Charles. Is it possible to compress the large_csv.csvh to zip or tar.gz format ?

Hi @luocooong Are you looking for a test to see if it can go from compressed file to insert? I added a new test that generates a 100k row CSV file and inserts that. The only thing is that the test is slow, so I disabled it by default.

Comment on lines 270 to 274
Statement stmt = connection.createStatement();
stmt.execute(insertQuery);
logger.debug("Query complete");
// Close connection
AutoCloseables.closeSilently(stmt, connection);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap the statement into the try-with-resources block and add a closing connection to the finally block, since, for the case of exception during statement execution, it wouldn't be closed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed I think.

public void endRecord() throws IOException {
logger.debug("Ending record");

// Add values to rowString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use Calcite's JdbcTableModify to create insert statement string instead of using custom logic?

contrib/storage-jdbc/pom.xml Outdated Show resolved Hide resolved

<dependency>
<groupId>com.github.vvysotskyi.drill-calcite</groupId>
<artifactId>calcite-server</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why this dependency is required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite-core is included in the root pom.xml file. The DDR operations which we needed for this are in the calcite-server module which was not listed as a dependency in the root pom.xml.

@cgivre
Copy link
Contributor Author

cgivre commented Oct 21, 2021

@dzamo @vvysotskyi
Thank you for your patience. I've removed the hackery and addressed your review comments.
Thanks!

Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Vova's happy I'm happy +1.

}
}

private String buildInsertQuery() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre I think generating a test file of 1m records is a good thing to do at least once. I don't know much about Drill's batching but I think of it as unrelated to the size limitations of VALUES expressions in external dbs. If it were me I'd assume Drill could send batches bigger than the target db's VALUES limit and I'd write a loop in JdbcRecordWriter that inserts no more than ~500 records at a time, as outlined in my first comment.

}
}

private String buildInsertQuery() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre a little trick for producing 1e6 records

set `planner.enable_nljoin_for_scalar_only` = false;
create temporary table t as select o1.* from cp.`tpch/orders.parquet` o1 cross join cp.`tpch/orders.parquet` o2 limit 1e6;

@cgivre
Copy link
Contributor Author

cgivre commented Oct 22, 2021

@dzamo
Per your request, I thought about this some more and added the ability to configure the batch size for the INSERT queries. What happens now is that the user can set the batch size depending on their environment and the database to which they are inserting data.

The unit tests pass and I ran this locally with a 1M row CSV insert into a MySQL database which worked perfectly. Previously, this ran into the max_packet_size limit in MySQL, but now this is not an issue.

@jnturton
Copy link
Contributor

@dzamo Per your request, I thought about this some more and added the ability to configure the batch size for the INSERT queries. What happens now is that the user can set the batch size depending on their environment and the database to which they are inserting data.

@cgivre this is great. I thought of one more possible optimisation: creating a parameterised INSERT PreparedStatement of writer_batch_size rows and reusing it for as long as there are >= writer_batch_size rows remaining to insert. I don't know Calcite stuff but I can say I saw a class called SqlDynamicParam in it. This would mean that the receiving DBMS does not need to parse a very long INSERT statement at the start of every batch, a noticeable saving of memory and CPU time for it I would guess. Just a possible optimisation I wanted to share, I view it as something that can also come in a later version.

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cgivre
Copy link
Contributor Author

cgivre commented Oct 22, 2021

Thank you @dzamo @vvysotskyi and @luocooong for your review!

@cgivre cgivre merged commit 81b8a98 into apache:master Oct 22, 2021
@cgivre cgivre deleted the jdbc_writer branch October 22, 2021 14:13
estherbuchwalter pushed a commit to MFoss19/drill that referenced this pull request Oct 26, 2021
* Initial commit

* Getting to writer

* Schema should be created

* Queries complete

* Queries successfully completing

* WIP

* WIP

* Fixed schema resolution issue

* Fixed VARCHAR precision issue

* Added writable config option

* Mostly working

* Added Postgres Unit Tests, Some working

* All Postgres tests working

* Fix spacing

* Code Cleanup

* WIP

* WIP - Null rows

* Null input working

* Added additional unit tests

* Finished MySQL unit tests

* MySQL and Postgres Tests all Passing

* h2 writer tests working

* Final commit

* Ready for PR

* Final fixes

* Add license header

* Fixed MySQL Timezones in unit tests

* Removed unused import

* Remove unneeded unit tests

* Minor fixes

* Working on docs

* Addressed code comments

* Fixed unit test

* Added test for large file

* Updated tutorial

* Added tests for large inserts

* Added documentation

* Addressed Review Commnet

* De-hacking

* Removed ROW Hack

* Added configurable batching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants