[WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer #969

taherk77 · 2019-10-23T15:30:12Z

No description provided.

taherk77 · 2019-10-24T03:50:23Z

@vinothchandar @leesf Travis failed on modules which weren't touched by me. Any idea how to restart the travis build

taherk77 · 2019-10-24T03:53:22Z

@vinothchandar @leesf Travis failed on modules which weren't touched by me. Any idea how to restart the travis build

Also guys another thing that we need to test and implement here is the continuous pull where user gives an interval and after every interval deltastreamer starts to pull from rdbms until it is terminated by the user.

leesf · 2019-10-24T11:40:55Z

The forked VM terminated without properly saying goodbye. VM crash or System.exit called occurs again., I saw it in my local dev sometimes, and will investigate when get a circle.
Now only committers or PMC could restart the travis build.

leesf · 2019-10-24T18:21:39Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+          .option(Config.RDBMS_TABLE_PROP, properties.getString(Config.RDBMS_TABLE_NAME));
+
+      if (properties.containsKey(Config.PASSWORD) && !StringUtils
+          .isNullOrEmpty(properties.getString(Config.PASSWORD))) {


maybe the value of Config.PASSWORD would be empty in some case.

@vinothchandar Please advise should we entertain empty password?

may be in some test setups?. might be good to allow that actually.

leesf · 2019-10-24T18:29:42Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+        String[] split = prop.split("\\.");
+        String key = split[split.length - 1];
+        String value = properties.getString(prop);
+        LOG.info(String.format("Adding %s -> %s to jdbc options", key, value));
+        dataFrameReader.option(key, value);


Would extral options be configured EXTRA_OPTIONS + "a.b" = value? Will add b = value but a.b = value

I don't think it will be a.b it will always be a = value

I don't think it will be a.b it will always be a = value

I find it will add b = value, correct me if i am wrong.

@leesf I researched more about the options. All options that i saw were lowerbound, upperbound, numPartitions etc. which fit well to what we have do above. However I came across OracleIntegrationSuite.scala

which uses oracle.jdbc.mapDateToTimestamp property so I think we will have to change the code to support this.

@vinothchandar any further comments on this?

leesf · 2019-10-24T18:34:16Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+      addExtraJdbcOptions(properties, dataFrameReader);
+
+      if (properties.containsKey(Config.IS_INCREMENTAL) && StringUtils
+          .isNullOrEmpty(properties.getString(Config.IS_INCREMENTAL))) {


maybe the value equals to true?

I don't understand what you are trying to ask here

@leesf i understand what you're saying. I shall make that change

leesf · 2019-10-24T18:47:04Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+  }
+
+  @Override
+  protected Pair<Option<Dataset<Row>>, String> fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) {


how about use sourceLimit to limit num records read from RDBMS. Thats select xxx from xxx limit sourceLimit or set fetchsize = sourceLimit to dataFrameReader?

@leesf problem with that would be not all databases support the limit clause. Some support limit, some say top and the lingo is different. To instead just pull everything

We could limit the df if needed but that's just added complexity on the user. Because between batches we would then have to manage what we have already sent earlier and what we should send now and in either case spark is always reading everything from RDBMS but just limiting to df level so i don't think we should be doing anything here. @vinothchandar please advise further if am wrong

@leesf and @vinothchandar Im sure we cannot limit through sql query as I see that Mysql and Postgres do SELET * FROM XXX LIMIT 1 where as orcale uses SELET * FROM XXX FETCH NEXT 1 ROWS and derby uses SELET * FROM XXX FETCH FIRST 10 ROWS ONLY

Good point, in case different SQL syntax in different RDBMS, I think it is ok to ignore sourceLimit if there is no better way.

Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

So on the driver class that the user gives straight away if it contains MySql or postgres keyword we should start applying limit?

yes usually the jdbc url is like jdbc:mysql:. and jdbc:postgressql: ,. you can just match this.. Also document this special handling..

Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

Hi @vinothchandar so do you mean if the user sets the limit to 10. For postgres and MYSQL we should do select * from table limit 10?

I dont think that would work here with the type of semantics we have here. In continuous mode and full scan JDBC scans the whole table every interval.

In incremental we first do a full scan and write checkpoints we then assume the column given for incremental is either a long, int or timestamp. If the query for incremental fails then we fall back to full scans. How would limit work here?

It would always keep getting the same records.

Further talking about interval of jobs. This has not yet implemented as I do not have clarity. I want to know how we should do it. This would require further brain storming of how to keep schedule jobs.

leesf · 2019-10-24T19:02:07Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java

+      Assert.assertEquals(10, rowDataset.where("commit_time=000").count());
+      Assert.assertEquals(10, rowDataset.where("commit_time=001").count());


rowDataset1?

will correct that

taherk77 · 2019-10-25T06:40:19Z

@leesf @vinothchandar can we extract DataFrameReader from a Dataset? Would be really helpful in testing properties if there is a way

taherk77 · 2019-10-25T09:10:55Z

@leesf @vinothchandar Travis failed again on the same module as before.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project hudi-client: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter7257868499234226311.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefire6853623625226259050tmp /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefire_31524552303193339725tmp

leesf · 2019-10-28T01:43:43Z

@leesf @vinothchandar can we extract DataFrameReader from a Dataset? Would be really helpful in testing properties if there is a way

As far as I know, we could easily get Dataset from DataFrameReader, But I didn't find a way to get DataFrameReader from Dataset, let me know if there is.

taherk77 · 2019-10-28T11:09:41Z

@leesf @vinothchandar All changes are addressed and fixed. Travis builds failing because of random VM crashes. What can we do further?

taherk77 · 2019-10-28T11:10:00Z

@leesf @vinothchandar All changes are addressed and fixed. Travis builds failing because of random VM crashes. What can we do further?

Is this good to go now?

taherk77 · 2019-10-28T15:58:10Z

@leesf @vinothchandar All tests passing guys!!!!!!!! 🥇

vinothchandar

Few comments.. Looks almost ready!

vinothchandar · 2019-10-29T12:10:13Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+          .option(Config.RDBMS_TABLE_PROP, properties.getString(Config.RDBMS_TABLE_NAME));
+
+      if (properties.containsKey(Config.PASSWORD) && !StringUtils
+          .isNullOrEmpty(properties.getString(Config.PASSWORD))) {


may be in some test setups?. might be good to allow that actually.

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

vinothchandar · 2019-10-29T12:11:38Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+          .isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+        LOG.info(
+            String.format("Reading JDBC password from password file %s", properties.getString(Config.PASSWORD_FILE)));
+        FileSystem fileSystem = FileSystem.get(new Configuration());


please use the configuration object from spark context. otherwise, it may not be able to pick up settings placed there

vinothchandar · 2019-10-29T12:12:58Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+        FileSystem fileSystem = FileSystem.get(new Configuration());
+        passwordFileStream = fileSystem.open(new Path(properties.getString(Config.PASSWORD_FILE)));
+        byte[] bytes = new byte[passwordFileStream.available()];
+        passwordFileStream.read(bytes);


IIRC there is already a helper method in FileIOUtils for this? can we reuse that

Here I need to read from FS like hdfs, gcs or s3 which the hadoop fs FileSystem will be useful for. FileIOUtils dont use FileSystem for reads

but you can pass the passwordFileStream to FileIoUtils.readAsByteArray` correct? its just another InputStream,?

vinothchandar · 2019-10-29T12:13:51Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+        String key = Arrays.asList(prop.split(Config.EXTRA_OPTIONS)).stream()
+            .collect(Collectors.joining());
+        String value = properties.getString(prop);
+        if (!StringUtils.isNullOrEmpty(value)) {


same here.. should we allow empty?

Sure will allow empty options

vinothchandar · 2019-10-29T12:21:05Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+  }
+
+  @Override
+  protected Pair<Option<Dataset<Row>>, String> fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) {


Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

vinothchandar · 2019-10-29T12:25:29Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java

+    try {
+      if (isIncremental) {
+        Column incrementalColumn = rowDataset.col(props.getString(Config.INCREMENTAL_COLUMN));
+        final String max = rowDataset.agg(functions.max(incrementalColumn).cast(DataTypes.StringType)).first()


this would make a second pass over rowDataset right.. Should we cache this so that we don't go and read from the database twice? Line 126 above, will do one fetch and compute the checkpoint and then the rowDataset is passed back to DeltaSync class back, and we start to fetch rows to write to Hudi, it will trigger another "recomputation" .. We should atleast support an option to cache this dataset IMO.

Also is there a way to do this via Accumulators? (could be tricky) :D We can file a JIRA for later?

Sure for now should we cache in the fetch() method, get the max of checkpointing, uncache and then return the df?

yes. lets cache it in fetch() and add a parameter to control the persistence level, like we do in BloomIndex please

hoodie.datasource.jdbc.storage.level="MEMORY_ONLY_SER"
Has been added to the props file to get storage level. If no storage level is given by the user then we use the default as MEMORY_AND_DISK_SER

vinothchandar · 2019-10-29T12:26:28Z

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java

+
+  public static void cleanDerby() {
+    try {
+      Files.walk(Paths.get(DatabaseUtils.getDerbyDir()))


already have a helper in FileIOUtils

vinothchandar · 2019-10-29T12:27:34Z

hudi-utilities/src/test/resources/delta-streamer-config/jdbc-source.properties

+hoodie.datasource.jdbc.table.incremental.pull.interval=5000
+#Extra options for jdbc
+hoodie.datasource.jdbc.extra.options.fetchsize=1000
+hoodie.datasource.jdbc.extra.options.timestampFormat="yyyy-mm-dd hh:mm:ss"


nit:newline

taherk77 · 2019-10-30T14:57:16Z

@vinothchandar @leesf Changes addressed

vinothchandar · 2019-10-31T16:36:06Z

@taherk77 if you could resovle comments after addressing them, that would be very helpful for reviewing incrementally. small tip :)

vinothchandar

@taherk77 I think this will make for a great blog post here.. You can provide an end-end example of how to bulk load, lkeep incrementally ingesting with code and command samples?
https://cwiki.apache.org/confluence/display/HUDI#ApacheHudi(Incubating)-How-toblogs

lmk if you are interested. We can work out the perms

taherk77 · 2019-11-01T06:02:18Z

@taherk77 if you could resovle comments after addressing them, that would be very helpful for reviewing incrementally. small tip :)

Apologies. Will keep and mind

pushpavanthar · 2019-11-27T19:55:05Z

Hi @vinothchandar and @taherk77
I would like to add 2 points to this feature to make this very generic

We might need support for combination of incrementing columns. Incrementing columns can be of below types

Timestamp columns
Auto Incrementing column
Timestamp + Auto Incrementing.
Instead of code figuring out the incremental pull strategy, it'll be better if user provides it as config for each table.
Considering Timestamp incrementing column, there can be more than once column contributing to this strategy. e.g. When a row is creation, only created_at column is set and updated_at is null by default. When the same row is updated, updated_at gets assigned to some timestamp. In such cases it is wise to consider both columns in the query formation.

We need to sort rows according to above mentioned incrementing columns to fetch rows in chunks (you can make use of defaultFetchSize in MySQL). I'm aware that sorting adds load on Database, but it helps in tracking the last pulled timestamp or auto incrementing id and help retry/resume from the point last recorded. This will be a saviour during failures.

A sample MySQL query for incrementing timestamp columns as (created_at and updated_at) might look like
SELECT * FROM inventory.customers WHERE COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > $last_recorder_time AND COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < $current_time ORDER BY COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC

vinothchandar · 2019-12-03T15:17:08Z

@pushpavanthar Great suggestion..

Let me see if we can structure this solution more,. Just supporting raw sql as input for extracting the data with the hoodie checkpoint simply being a list of string replaces in a template sql, could provide a lot of flexibility

Taking the same example from above.

user specifies the following SQL. (we can blog and document this well)

hoodie.datasource.jdbc.sql=SELECT COALESCE(inventory.customers.updated_at,inventory.customers.created_at) as created_updated_at, inventory.customers.user_id as user_id, * FROM inventory.customers WHERE created_updated_at > ${1} AND created_updated_at < ${1} AND user_id  > ${2}  ORDER BY created_updated_at ASC
hoodie.datasource.jdbc.incremental.column.names=created_updated_at, user_id
hoodie.datasource.jdbc.incremental.column.funcs=max, min
hoodie.datasource.jdbc.bulkload.sql=<sql to load it once initially or we could use some all inclusive filters for column names like user_id > 0 etc >

Hoodie checkpoint is a list of string values, once for each of the incremental column names, e.g 2019113048384, 1001 (timestamp and a user_id). we simple replace {1} with 2019113048384 and {2} with the user_id or second checkpoint value. Execute the sql, and then use the column funcs to derive the next checkpoint values off the fetched data set.. I would prefer to keep this computation out of the database and in Spark (for same reasons of avoiding more load on database)..

All this said, I want to get a basic version working and checked in :) first.
@taherk77 where are we at for this PR atm? Are you actively working on this?

pushpavanthar · 2019-12-03T19:58:11Z

@vinothchandar Thanks.
It would be great if we document this in design doc and proceed from there. I've been using JDBC incremental puller approach as one of the sources to Apache HUDI at work. I'm very excited about this feature.
In my opinion, user shouldn't be aware of query unless it is of special case (very rare). All incremental pulling of data follow same query template to which check-pointed values are substituted.
However, I would like to understand where this processor of HUDI maintains checkpointing/state data. If it is filesystem, are we going to provide this filesystem path as config? Or is It external state store?
If you can redirect me to the doc to this feature, I would like to add my thoughts to it.

vinothchandar · 2019-12-11T15:22:31Z

@pushpavanthar Was out for a conference.. So picking this back up.. I am also pretty interested in this feature. I think having some basic usable version (even if not perfect) will go a long way...

Hudi will store checkpoints as a part of the commit metadata and supply them at the start of every batch to the source to substitute into SQL. There is no ticket open for you to add thoughts to it.. Feel free to open a new JIRA on your ideas?

@taherk77 I unassigned the JIRA HUDI-251 since I was not sure if you were still working on this. At this point, @pushpavanthar or anyone interested in taking this forward , please go ahead..

vinothchandar · 2020-02-04T07:33:04Z

Closing due to inactivity

[HUDI-251] JDBC incremental load to HUDI DeltaStreamer

bdc37c9

leesf reviewed Oct 24, 2019

View reviewed changes

taher-koitawala added 2 commits October 25, 2019 12:31

[HUDI-251] JDBC incremental load to HUDI DeltaStreamer

aeaaa24

Merge branch 'master' of https://github.com/taherk77/incubator-hudi

f64b060

taher-koitawala added 7 commits October 28, 2019 16:43

Merge remote-tracking branch 'upstream/master'

c507f1a

Merge remote-tracking branch 'upstream/master'

8ea2055

Merge remote-tracking branch 'refs/remotes/origin/master'

d6087ad

Merge remote-tracking branch 'refs/remotes/origin/master'

fca988f

Merge remote-tracking branch 'refs/remotes/origin/master'

882a619

Merge remote-tracking branch 'refs/remotes/origin/master'

a95bcee

Merge branch 'master' of https://github.com/taherk77/incubator-hudi

5972add

vinothchandar reviewed Oct 29, 2019

View reviewed changes

taher-koitawala added 2 commits October 30, 2019 20:25

Merge branch 'master' of https://github.com/taherk77/incubator-hudi

37ca59e

Merge remote-tracking branch 'refs/remotes/origin/master'

cd4c5e1

vinothchandar reviewed Oct 31, 2019

View reviewed changes

vinothchandar assigned leesf Nov 13, 2019

vinothchandar self-assigned this Dec 3, 2019

vinothchandar changed the title ~~[HUDI-251] JDBC incremental load to HUDI DeltaStreamer~~ [WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer Dec 18, 2019

vinothchandar closed this Feb 4, 2020

vinothchandar added the may-be-later label May 14, 2020

		Assert.assertEquals(10, rowDataset.where("commit_time=000").count());
		Assert.assertEquals(10, rowDataset.where("commit_time=001").count());

[WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer #969

[WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer #969

Conversation

taherk77 commented Oct 23, 2019

taherk77 commented Oct 24, 2019

taherk77 commented Oct 24, 2019

leesf commented Oct 24, 2019 • edited

leesf Oct 24, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leesf Oct 24, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leesf Oct 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taherk77 commented Oct 25, 2019

taherk77 commented Oct 25, 2019

leesf commented Oct 28, 2019

taherk77 commented Oct 28, 2019

taherk77 commented Oct 28, 2019

taherk77 commented Oct 28, 2019

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taherk77 commented Oct 30, 2019

vinothchandar commented Oct 31, 2019

vinothchandar left a comment

Choose a reason for hiding this comment

taherk77 commented Nov 1, 2019

pushpavanthar commented Nov 27, 2019 • edited

vinothchandar commented Dec 3, 2019

pushpavanthar commented Dec 3, 2019

vinothchandar commented Dec 11, 2019

vinothchandar commented Feb 4, 2020

leesf commented Oct 24, 2019 •

edited

leesf Oct 24, 2019 •

edited

leesf Oct 24, 2019 •

edited

leesf Oct 28, 2019 •

edited

pushpavanthar commented Nov 27, 2019 •

edited