[BEAM-244] Add JDBC IO #942

jbonofre · 2016-09-11T18:58:36Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

jbonofre · 2016-09-11T18:58:51Z

hejianchao · 2016-09-12T10:02:56Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcDataRecord.java

+ */
+public class JdbcDataRecord implements Serializable {
+
+  private String[] tableNames;


hi, why field tableNames in class JdbcDataRecord use String[]?
each record only belongs to one table, maybe better to declare as:
private String tableName;

If the user query is like this:

select t1.id,t2.name from table1 t1, table t2 where t1.id = t2.id

each JdbcDataRecord can have different table.

jkff · 2016-09-12T15:52:45Z

Thanks! Will take a deeper look later today, but first comment: the Source API is unnecessary here since the connector uses none of its features (ignores desiredBundleSizeBytes, doesn't provide estimated size, progress reporting from the reader, or liquid sharding); please refactor the source to be a simpler composite PTransform made of ParDo's.

jbonofre · 2016-09-12T16:04:36Z

Thanks ! Yes, it's what I'm doing now (this initial code was on an old local branch ;)).

jkff · 2016-09-12T17:57:11Z

OK, one more high-level comment that needs to be resolved before continuing.
I think representing every database row by a JdbcDataRecord will be inefficient, since every such object requires allocating 4 arrays, boxing the primitive-type columns into the Object[] array, etc. My experience with pumping a lot of data from/to a database says that these overheads can easily become a serious bottleneck.

I would recommend a slightly different approach that, e.g., Spring JDBC uses (RowMapper and BatchPreparedStatementSetter, see http://docs.spring.io/spring/docs/current/spring-framework-reference/html/jdbc.html): make this source and sink generic and let the user provide callbacks for extracting their type from a ResultSet/ResultSetMetadata, and for filling in a PreparedStatement from their type. I don't mean necessarily use Spring JDBC's classes directly (I don't know if we can or want to have that as a dependency, though if the answer was "yes" then it'd be nice)

Additionally, in the sink:

for efficiency it'd be great to introduce batching.
for both efficiency and generality, it'd be better to let the user supply a parameterized SQL string, rather than hand-assemble the string in the processElement call. It might be e.g. that the user wants to update() things rather than insert, or that they want to use a custom SQL dialect supported by their database of choice.

I'll also make a pass now over the code and comment on things that will likely stay relevant after such a refactoring.

jkff · 2016-09-12T17:59:50Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+        return advance();
+      } catch (Exception e) {
+        LOGGER.error("Can't connect to the database", e);
+        return false;


From the point of view of the runner, this is simply a "return false" - meaning, the runner will happily declare that the bundle is complete because we reached EOF. From the point of view of the user, this is silent data loss signaled only by an error message in worker logs. I don't think that's what you intended, so it's better to rethrow the exception, both here and in advance().

jbonofre · 2016-09-12T18:23:25Z

Thanks for your comments. I will update the PR accordingly.

jbonofre · 2016-09-13T19:18:54Z

First refactoring to use DoFn instead of Source. I'm now starting to evaluate RowMapper and BatchPreparedStatementSetter. I'm also starting to test numSplits and batching.

jkff · 2016-09-13T23:15:20Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

@@ -203,223 +156,218 @@ public BoundedJdbcSource withPassword(String password) {
    @Nullable
    private final String password;

-    public BoundedJdbcSource(DataSource dataSource, String query, String username,
-                             String password) {
+    private Read(DataSource dataSource, String query, @Nullable String username,


Don't duplicate variables between the Read class and the JdbcOptions class. Instead make Read have a private variable of type JdbcOptions.

jbonofre · 2016-09-14T06:16:02Z

Fix checkstyle and use JdbcOptions in Read and directly as start of the pipeline. Now starting split, batch, RowMapper (or equivalent) support.

jbonofre · 2016-09-21T16:18:54Z

Resuming on this PR.

jbonofre · 2016-09-23T10:53:39Z

Updated with RowMapper and ElementInserter. The Write doesn't use autocommit anymore and commit per bundle.
I wonder if it would not make sense to provide a default RowMapper populated the PCollection with String (as CSV format). WDYT ?

jbonofre · 2016-09-23T14:34:25Z

I'm fixing Javadoc issue.

jkff

Thanks! A bunch of high-level comments; haven't looked at the tests yet.

jkff · 2016-09-23T15:41:29Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+ *   {@code
+ *
+ * pipeline.apply(JdbcIO.read()
+ *   .withDataSource(myDataSource)


Add the RowMapper to this example?

jkff · 2016-09-23T15:44:09Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+ *
+ * pipeline
+ *   .apply(...)
+ *   .apply(JdbcIO.write().withDataSource(myDataSource)


jkff · 2016-09-23T15:45:58Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+   * object used in the {@link PCollection}.
+   */
+  public interface RowMapper<T> extends Serializable {
+


Remove the unnecessary blank lines

jkff · 2016-09-23T15:48:06Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+            try (ResultSet resultSet = statement.executeQuery()) {
+              while (resultSet.next()) {
+                T record = rowMapper.mapRow(resultSet);
+                if (record != null) {


Why this non-null check? If it's allowed to return nulls, as a user I'd be surprised that nulls aren't making it into my PCollection. If it's not allowed to return nulls, I'd be surprised that the error is silently swallowed.

jkff · 2016-09-23T15:49:04Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+        T record = context.element();
+        try {
+          PreparedStatement statement = elementInserter.insert(record, connection);
+          if (statement != null) {


Same question - why the non-null check.

jkff · 2016-09-23T15:50:34Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+            }
+          }
+        } catch (Exception e) {
+          LOGGER.warn("Can't insert data into table", e);


Rethrow the exception; otherwise in case of failures the user's pipeline will succeed but experience data loss with no apparent indication of that. A pipeline should not succeed if it was unable to do the processing requested.

jkff · 2016-09-23T15:50:58Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+      public void processElement(ProcessContext context) {
+        T record = context.element();
+        try {
+          PreparedStatement statement = elementInserter.insert(record, connection);


Seems like there's no batching - we're doing 1 database transaction per record?

jkff · 2016-09-23T15:53:36Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+
+      @FinishBundle
+      public void finishBundle(Context context) throws Exception {
+        connection.commit();


Bundles can be arbitrarily large - e.g. it is entirely possible that the bundle will live for 3 days and accumulate a billion elements. You need to commit every time the transaction reaches a certain size, as well as when the bundle finishes.

This also means that the user should be aware that some data can be committed multiple times, in case the bundle commits something but then fails and is retried and commits the same thing again. I believe this is currently not easily avoidable (since a DoFn can not force the runner to commit the bundle), it's just something that should be documented, so that e.g. the user should use idempotent PreparedStatement's if possible, or be okay with duplicates.

jkff · 2016-09-23T15:54:45Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+  }
+
+  /**
+   * An interface used by the JdbcIO Write for mapping {@link PCollection} elements as a  rows of a


This comment appears copy-pasted from RowMapper, please update.

jkff · 2016-09-23T16:22:46Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+   */
+  public interface ElementInserter<T> extends Serializable {
+
+    PreparedStatement insert(T element, Connection connection);


Perhaps it'd be better to make the API only slightly less generic but slightly more clear to use / harder to mis-use, by making the following change:

Pass the SQL string to the Write transform

Make the WriteFn create the prepared statement itself, in the Setup method

Make ElementInserter's signature be: void setParameters(T element, PreparedStatement statement) (and rename it to PreparedStatementSetter?)

This way:

the PreparedStatement object can be reused, which will give much better performance

the user doesn't have to call .prepareStatement

the user can not accidentally mess up the connection by doing something to it that the WriteFn doesn't expect (e.g. calling commit, or close, or doing queries)

it is easier to understand the semantics of the transform because it always uses the same SQL just with different parameters

an ElementInserter is easier to test without a Connection

it becomes easier to write composable ElementInserter classes, e.g. you could imagine expressing something like { st.setString(1, t.getKey()); st.setInteger(2, t.getValue()); } as "allOf(stringAt(1, key()), integerAt(2, value()))" where allOf, stringAt, integerAt, key, value are static utility functions. Not suggesting to do this in the current PR, but I once developed a library of composable row mappers and prepared statement setters like this at one of my previous companies and it was very handy to use.

jbonofre · 2016-09-25T08:10:19Z

Added batching for Write, use of a preparedStatementSetter with query for write.

jkff · 2016-09-25T18:20:25Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

@@ -57,14 +59,25 @@
 *
 * pipeline.apply(JdbcIO.read()
 *   .withDataSource(myDataSource)


Where does the query go?

jkff · 2016-09-25T18:21:06Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+ *     public MyElement mapRow(ResultSet resultSet) {
+ *       // use the resultSet to build the element of the PCollection
+ *       // for instance:
+ *       // return resultSet.getString(2);


This line should not be commented out; and it should return a MyElement.
E.g.: return new MyElement(resultSet.getInt(1), resultSet.getString(2)) or something
(it should also correspond to the query - internal inconsistency within an example can be confusing to readers of the documentation)

jkff · 2016-09-25T18:21:13Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

 *
 *   }
 * </pre>
+ * <p>
+ *   You can find an full {@link RowMapper} example in {@code JdbcIOTest}.


jkff · 2016-09-25T18:23:57Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

 * <h3>Writing to JDBC datasource</h3>
 * <p>
 * JDBC sink supports writing records into a database. It expects a
 * {@code PCollection<T>}, converts the {@code T} elements as SQL statement
- * and insert into the database. T is the type expected by the provided {@link ElementInserter}.
+ * and setParameters into the database. T is the type expected by the


The grammar seems off - "converts the T elements as SQL statement and setParameters".
Perhaps:
It writes a PCollection to the database by converting each T into a PreparedStatement via a user-provided PreparedStatementSetter?

jkff · 2016-09-25T18:25:26Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

- *   .apply(JdbcIO.write().withDataSource(myDataSource)
+ *   .apply(JdbcIO.write()
+ *      .withDataSource(myDataSource)
+ *      .withQuery(query)


This should be called a statement, rather than a query

jkff · 2016-09-25T18:52:30Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

+  }
+
+  /**
+   * Example of {@link PCollection} element that a


This example is unnecessarily complicated, it does not reflect the kind of code a user will write (users typically know the schema of the data they're reading/writing), and it does not increase coverage of the code under test (the code under test doesn't care about the schema, it just passes it to your RowMapper/Setter, so there's no additional coverage gained from having a very sophisticated row mapper or setter). Please use a simpler example - e.g. a simple POJO with a couple of fields.

jkff · 2016-09-25T18:55:44Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

+    PCollection<JdbcDataRecord> output = pipeline.apply(
+        JdbcIO.read()
+            .withDataSource(dataSource)
+            .withQuery("select * from BEAM")


Using "select *" would be poor practice in an actual pipeline because it will break if fields in the table are reordered; or it will start performing poorly if a bulky column is added to the database, etc. I think "select *" is generally considered a code smell in database code and should be effectively never used in production code.

It's better to supply a query that explicitly lists the columns you're reading.

jkff · 2016-09-25T18:56:35Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

+              public KV<String, Void> apply(JdbcDataRecord input) {
+                // find NAME column id
+                int index = -1;
+                for (int i = 0; i < input.getColumnNames().length; i++) {


Again: if you make the example simpler by deleting JdbcDataRecord and using a simple POJO, this loop will be unnecessary and you'll just do regular field access on the POJO.

jkff · 2016-09-25T18:57:01Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

+
+  private class InsertRecord {
+
+    private int columnType;


This class is unused.

jkff · 2016-09-25T18:57:56Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

+    Connection connection = dataSource.getConnection();
+
+    Statement statement = connection.createStatement();
+    ResultSet resultSet = statement.executeQuery("select * from TEST");


Just use select count(*).

jbonofre · 2016-09-25T19:18:13Z

Agree with your comments. My concern is about data source. Even if the interface is not serializable most of implementations are. I think it's better to let user define his own data source in order to deal with connection pooling, etc. Thoughts ?

jkff · 2016-09-25T19:35:03Z

It might be the case that most DataSource are serializable; but the user will be unable to use this connector with those that aren't; and there's no fundamental reason why this has to be the case.
With the proposed ConnectionFactory interface, the user can still use a DataSource under the hood. For cases when the DataSource is in fact serializable, you could even supply a convenience method "withDataSource(DataSource, username, password)" that would delegate to withConnectionFactory, supplying a ConnectionFactory that creates the connection using this Datasource, username and password. But this way the user also has an escape hatch when the DataSource is not serializable but there still is some way to create a connection.

jbonofre · 2016-09-25T19:54:01Z

Let me think and experiment this. I don't think ConnectionFactory is a good name as it's confusing with JMS ConnectionFactory. Thanks. I'm preparing updates.

jbonofre · 2016-09-26T12:27:33Z

Implemented large part of the comments. TODO:

use custom serializable interface instead of directly DataSource. I'm thinking about implementing all that I need to actually create the datasource (datasource class name, jdbc URL, ...)
re-use of the same connection object between Read and Write to avoid code duplication
better handling of exception in WriteFn (to remove the element from the batch list only if the corresponding statement has been executed successfully).

jbonofre · 2016-09-26T12:48:23Z

I'm evaluating using of DBCP2 and let user provide driver classname, JDBC URL, and optionally username, password, and pooling size.

jkff

Thanks! Here's one more round, just on the latest commit. Please let me know after addressing all comments from this and the previous round and I'll do one more pass on the full code.

jkff · 2016-09-26T17:17:03Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

- *       // use the resultSet to build the element of the PCollection
- *       // for instance:
- *       // return resultSet.getString(2);
+ *   .withStatement("select id,name from Person")


For the read, it should be query. I'm using common terminology: query is something read-only; statement is something that modifies things.

Understood, and it makes sense.

jkff · 2016-09-26T17:17:15Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+ *   .withStatement("select id,name from Person")
+ *   .withRowMapper(new JdbcIO.RowMapper<KV<Integer, String>>() {
+ *     public KV<Integer, String> mapRow(ResultSet resultSet) {
+ *       KV<Integer, String> kv = KV.of(resultSet.getInt(1), resultSet.getString(2));


return statement forgotten?

jkff · 2016-09-26T17:17:27Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

- * and setParameters into the database. T is the type expected by the
- * provided {@link PreparedStatementSetter}.
+ * JDBC sink supports writing records into a database. It writes a {@link PCollection} to the
+ * database by converting each T into a {@link PreparedStatement} via an user-provided


a user-provided

jkff · 2016-09-26T17:18:02Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

- *      .withQuery(query)
- *      .withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<MyElement>() {
+ *      .withStatement("insert into Person values(?, ?)")
+ *      .withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {


This line uses KV[Integer,String] but the next line uses MyElement - please fix.

jkff · 2016-09-26T17:18:33Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

 *        }
 *      })
 *
 *   }
 * </pre>
 * <p>
- *   You can find a full {@link PreparedStatementSetter} in {@code JdbcIOTest}.
+ * You can find a full {@link PreparedStatementSetter} in {@code JdbcIOTest}.


a full ... example?

jkff · 2016-09-26T17:21:09Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

        }
+        // TODO clear only record actually inserted in the database (remove from batch


Per offline discussion, this is moot.

jkff · 2016-09-26T17:29:08Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

                } catch (Exception e) {
-                  LOGGER.error("Can't map row", e);
+                  return null;


This code says "if can't parse record, silently drop it and return null instead" - this is silent data loss and it's a poor example to show to users, especially given that the documentation of JdbcIO delegates to the test for a complete example. As a user, I would be very surprised to find nulls in my PCollection with no indication of how they got there. I'd much prefer if the read failed, than if it silently produced nulls.

Please remove the error handling and let it fail. I think this is a general recommendation that I've made several times in these PRs: the runner does error handling for you. Unless you are dealing with a rare case where the error is transient/recoverable AND your logic does a better job of recovering from it than the runner would have (by retrying your bundle), you should not have error handling code at all (I can think of only one case: retrying individual RPCs in case of transient RPC errors, but usually various client libraries do that for you).

And most importantly, it is absolutely never OK to have error handling code that masks a data-loss error (e.g. failure to write data or failure to parse data).

+1 (my bad)

jkff · 2016-09-26T17:33:54Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+ *   }
+ * </pre>
+ * <p>
+ * You can find a full {@link RowMapper} example in {@code JdbcIOTest}.


The example in JdbcIOTest seems not particularly more complete than the one in this doc; maybe it's not worth mentioning here. Likewise for the write.

jkff · 2016-09-26T17:37:47Z

sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOTest.java

-                }
+            new SimpleFunction<KV<Integer, String>, KV<String, Void>>() {
+              public KV<String, Void> apply(KV<Integer, String> input) {
+                return KV.of(input.getValue(), null);


Why the KV[String, Void]? If all you're doing is Count.perKey(), you can apply it to the "output" collection directly, without this MapElements.

I have a KV<Integer, String>, and key is non deterministic. So, at minimum, I have to define String as key (or change the query/rowMapper).

Oh, I see, you're flipping the Integer, String into String, Void.
Instead of that, how about using Values.create() and Count.perElement() to count the String's directly?

I updated in the last commit to directly use the name as key (more straight forward).

jkff · 2016-09-26T17:40:48Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

-            LOGGER.error("Can't insert into database", e);
-          }
+          preparedStatementSetter.setParameters(record, preparedStatement);
+          preparedStatement.executeUpdate();


Just noticed you're doing executeUpdate rather than executeBatch. This still executes the statements one-by-one, which is inefficient. You need to do addBatch (you can do it in processElement), and do executeBatch in finishBundle.

In that case, I don't need the batch list anymore: just batchSize.
In the processElement method, I can do:

T record = context.element(); preparedStatement.clearParameters(); preparedStatementSetter.setParameters(record, preparedStatement); preparedStatement.addBatch(); batchCount++; if (batchCount >= batchSize) { finishBundle(context); }

Then, in the finishBundle, I do:

preparedStatement.executeBatch();

Agreed ?

Seems reasonable. Also set batchCount back to 0 after executing the batch. And make sure to commit the batch as a single transaction, for efficiency (http://stackoverflow.com/questions/6892105/bulk-insert-in-java-using-prepared-statements-batch-update recommends not using autocommit when using this pattern)

…the "parsing" of ResultSet/PCollection element

…se an unique PreparedStatement for write, simplify tests, fix indentation and comments

…urce provided by user or created by DBCP2 using configuration, DataSourceConfiguration is used by both Read and Write

…esource, improve main javadoc

…taSourceConfiguration, improve tests and some cleanups

jbonofre · 2016-09-30T14:50:35Z

Merged @jkff improvements, added connection pool configuration.

What about preventing fusion in the Read apply() ?

…f sense for users to change default values

jbonofre · 2016-09-30T16:46:43Z

Add random key generation, following by GroupByKey and then ungroup to prevent fusion.

jkff · 2016-09-30T17:01:44Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+            @Setup
+            public void setup() {
+              Random random = new Random();
+              randInt = random.nextInt();


This uses the same key for the whole bundle, which is equivalent to not doing this at all. The random key should be generated in processElement so that different elements get distributed onto different keys.

Ideally (but not required): can you try running an example pipeline against, e.g., a local spark or flink cluster with a local database and confirm that it speeds up as you add more workers? (I wish we had some infrastructure for performance testing...) I'd suggest a scenario where the database contains a small number of records (e.g. 100), but processing of every record is, e.g., a 10-second sleep. It should take less than 1000 seconds to complete and should scale linearly with number of workers; but it'd be sufficient to test with, e.g., 10 workers and confirm that it takes about 100 seconds (modulo per-worker startup time).

My bad, I wanted to do new Random in @setup, but nextInt in @ProcessElement. I'm fixing that.

And yes for the test, I have my dockerized cluster. I will test on it.

jkff · 2016-09-30T17:02:11Z

sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java

+            }
+            @ProcessElement
+            public void processElement(ProcessContext context) {
+              T record = context.element();


Not necessary to extract all the subexpressions into variables, this can be a one-liner:
context.output(KV.of(random.nextInt(), context.element()));

jkff · 2016-09-30T17:20:19Z

LGTM assuming you fix the empty executeBatch() thing. Thanks for bearing with me!

jbonofre · 2016-09-30T17:24:44Z

Updated: prevent to do executeBatch() if the batch is empty.

jkff · 2016-09-30T20:05:00Z

LGTM.

jbonofre · 2016-10-01T07:09:35Z

Thanks @jkff. I will merge later today !

nambrot · 2017-03-29T17:27:13Z

@jbonofre I found this PR after googling for how to configure connection pooling for BEAM. I'm not too familiar with JDBC, but I saw your commit removing the options to specify pooling, I'm curious what were your thoughts behind not exposing those options?

jkff · 2017-03-29T18:42:52Z

I suggested to not expose those options, for a couple of reasons.

Beam philosophy in general is to be "no-knobs" - a big part of the motivation behind this is that we've found that, given the ability to configure something, people nearly always configure it wrong - or configure it right, but then don't change the configuration while the workload changes, etc. So we prefer not to expose tuning parameters unless absolutely necessary, and instead choose reasonable defaults or have the runner adjust things dynamically and intelligently (e.g. autoscaling in Dataflow).
Choosing values for connection pooling parameters would require, I presume, making assumptions about how exactly the pipeline will be run: e.g. how many threads will be running this particular DoFn at the same time on every worker, how the DoFn will be shared between these threads, etc. This assumptions can and will break as runners change.

Do you have a use case where you can reliably achieve much better performance by specifying values different from the defaults? In that case, how are you choosing the values and what assumptions are you making when choosing them? I'm not 100% opposed to adding these knobs if they are necessary, but want to understand whether they really are.

Also, perhaps does the overload of JdbcIO.ConnectionConfiguration.create(DataSource) work for you, where you specify your parameters on the DataSource manually?

nambrot · 2017-03-29T20:09:45Z

Thank you for these very elaborate explanations. I don't think I have a specific use case where it would be different from the defaults. As said, I'm not very familiar with Beam and JDBC in general. From my background (Web and Ruby), I know that sometimes tuning the connection pool size can greatly improve throughput and I was surprised to find no options. That being said your explanation totally makes sense to me, so I will report back when I'd find such use case (but probably won't!) Also thanks for the manual override of DataSource, will consult that first!

nambrot · 2017-03-29T20:14:14Z

One thing I did just notice was running my Beam pipeline on Spark with 8 workers and giving me a Deadlock found when trying to get lock; while running on 4 workers seems fine. I'm inserting 2M records into a MySQL database. Is there something I should avoid when doing so with Beam?

EDIT: It also seems like I'm missing some rows that haven't been inserted?

jkff · 2017-03-29T20:16:31Z

To resolve this issue, I'd recommend to post a question on StackOverflow with more details under tag apache-beam http://stackoverflow.com/questions/tagged/apache-beam (e.g. the full stack trace of the error; and relevant parts of the code if you're comfortable sharing them).

jbonofre · 2017-03-30T04:09:36Z

@nambrot Hey, can you post (on the user mailing list) a pipeline to reproduce the issue ? By the way, I remember to have seen deadlock as well (I think I created a Jira about that). Do you use DBCP datasource ?

jbonofre force-pushed the BEAM-244-JDBCIO branch from 6d77b13 to 12674a8 Compare September 12, 2016 04:26

hejianchao reviewed Sep 12, 2016
View reviewed changes

jkff reviewed Sep 12, 2016
View reviewed changes

jbonofre force-pushed the BEAM-244-JDBCIO branch from 12674a8 to 51a0bd8 Compare September 13, 2016 19:16

jkff reviewed Sep 13, 2016
View reviewed changes

jbonofre force-pushed the BEAM-244-JDBCIO branch from 51a0bd8 to dcd89f4 Compare September 14, 2016 06:14

jbonofre force-pushed the BEAM-244-JDBCIO branch from dcd89f4 to 2049571 Compare September 23, 2016 10:51

jkff requested changes Sep 23, 2016

View reviewed changes

jbonofre force-pushed the BEAM-244-JDBCIO branch from 8a44cb4 to a75fcfd Compare September 25, 2016 08:09

jkff requested changes Sep 25, 2016

View reviewed changes

jkff requested changes Sep 26, 2016

View reviewed changes

jbonofre and others added 10 commits September 30, 2016 16:34

[BEAM-244] Add RowMapper and ElementInserter to allow user customize …

89cf44e

…the "parsing" of ResultSet/PCollection element

[BEAM-244] Javadoc fixes

4aed92a

[BEAM-244] Add batching for write, use preparedStatementSetter

d26353f

[BEAM-244] Rename query to statement, rename JdbcWriter to WriteFn, u…

d3227dd

…se an unique PreparedStatement for write, simplify tests, fix indentation and comments

[BEAM-244] Minor fix on main comment section

23c6e19

[BEAM-244] Use preparedStatement batch, simplify test

5a8a21e

[BEAM-244] Introduce DataSourceConfiguration POJO to bootstrap DataSo…

a2b94e1

…urce provided by user or created by DBCP2 using configuration, DataSourceConfiguration is used by both Read and Write

[BEAM-244] Add connection pool configuration properties, use try-on-r…

3a8b613

…esource, improve main javadoc

[BEAM-244] Finishing touches on JdbcIO, using AutoValue, abstracts Da…

d7669c4

…taSourceConfiguration, improve tests and some cleanups

[BEAM-244] Add connection pool configuration

759650f

jbonofre force-pushed the BEAM-244-JDBCIO branch from 557a80a to 759650f Compare September 30, 2016 14:46

jbonofre added 2 commits September 30, 2016 17:19

[BEAM-244] Remove connection pool properties as they don't make lot o…

fd15a44

…f sense for users to change default values

[BEAM-244] Add Fn to prevent fusion in the Read PTransform

7f734cf

jkff requested changes Sep 30, 2016

View reviewed changes

[BEAM-244] Move Random in @setup and use nextInt in @ProcessElement

0f07d39

[BEAM-244] Prevent executeBatch() in WriteFn if the batch is empty

ba2c4bf

asfgit closed this in c5c3436 Oct 2, 2016

jbonofre deleted the BEAM-244-JDBCIO branch October 2, 2016 12:56

		}
		// TODO clear only record actually inserted in the database (remove from batch

[BEAM-244] Add JDBC IO #942

[BEAM-244] Add JDBC IO #942

Conversation

jbonofre commented Sep 11, 2016

jbonofre commented Sep 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff commented Sep 12, 2016

jbonofre commented Sep 12, 2016

jkff commented Sep 12, 2016

Choose a reason for hiding this comment

jbonofre commented Sep 12, 2016

jbonofre commented Sep 13, 2016

Choose a reason for hiding this comment

jbonofre commented Sep 14, 2016

jbonofre commented Sep 21, 2016

jbonofre commented Sep 23, 2016

jbonofre commented Sep 23, 2016

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff Sep 23, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff Sep 23, 2016 • edited

Choose a reason for hiding this comment

jbonofre commented Sep 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff Sep 25, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff Sep 25, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Sep 25, 2016

jkff commented Sep 25, 2016

jbonofre commented Sep 25, 2016

jbonofre commented Sep 26, 2016

jbonofre commented Sep 26, 2016

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbonofre commented Sep 30, 2016

jbonofre commented Sep 30, 2016

jkff Sep 30, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff Sep 30, 2016 • edited

Choose a reason for hiding this comment

jkff commented Sep 30, 2016

jbonofre commented Sep 30, 2016

jkff Sep 23, 2016 •

edited

jkff Sep 23, 2016 •

edited

jkff Sep 25, 2016 •

edited

jkff Sep 25, 2016 •

edited

jkff Sep 30, 2016 •

edited

jkff Sep 30, 2016 •

edited

jkff commented Mar 29, 2017 •

edited

nambrot commented Mar 29, 2017 •

edited