Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer #969

Closed
wants to merge 12 commits into from
Closed

[WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer #969

wants to merge 12 commits into from

Conversation

taherk77
Copy link
Contributor

No description provided.

@taherk77
Copy link
Contributor Author

@vinothchandar @leesf Travis failed on modules which weren't touched by me. Any idea how to restart the travis build

@taherk77
Copy link
Contributor Author

@vinothchandar @leesf Travis failed on modules which weren't touched by me. Any idea how to restart the travis build

Also guys another thing that we need to test and implement here is the continuous pull where user gives an interval and after every interval deltastreamer starts to pull from rdbms until it is terminated by the user.

@leesf
Copy link
Contributor

leesf commented Oct 24, 2019

The forked VM terminated without properly saying goodbye. VM crash or System.exit called occurs again., I saw it in my local dev sometimes, and will investigate when get a circle.
Now only committers or PMC could restart the travis build.

.option(Config.RDBMS_TABLE_PROP, properties.getString(Config.RDBMS_TABLE_NAME));

if (properties.containsKey(Config.PASSWORD) && !StringUtils
.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
Copy link
Contributor

@leesf leesf Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the value of Config.PASSWORD would be empty in some case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar Please advise should we entertain empty password?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be in some test setups?. might be good to allow that actually.

Comment on lines 89 to 93
String[] split = prop.split("\\.");
String key = split[split.length - 1];
String value = properties.getString(prop);
LOG.info(String.format("Adding %s -> %s to jdbc options", key, value));
dataFrameReader.option(key, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would extral options be configured EXTRA_OPTIONS + "a.b" = value? Will add b = value but a.b = value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will be a.b it will always be a = value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will be a.b it will always be a = value

I find it will add b = value, correct me if i am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf I researched more about the options. All options that i saw were lowerbound, upperbound, numPartitions etc. which fit well to what we have do above. However I came across OracleIntegrationSuite.scala

which uses oracle.jdbc.mapDateToTimestamp property so I think we will have to change the code to support this.

@vinothchandar any further comments on this?

addExtraJdbcOptions(properties, dataFrameReader);

if (properties.containsKey(Config.IS_INCREMENTAL) && StringUtils
.isNullOrEmpty(properties.getString(Config.IS_INCREMENTAL))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the value equals to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are trying to ask here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf i understand what you're saying. I shall make that change

}

@Override
protected Pair<Option<Dataset<Row>>, String> fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) {
Copy link
Contributor

@leesf leesf Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use sourceLimit to limit num records read from RDBMS. Thats select xxx from xxx limit sourceLimit or set fetchsize = sourceLimit to dataFrameReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf problem with that would be not all databases support the limit clause. Some support limit, some say top and the lingo is different. To instead just pull everything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could limit the df if needed but that's just added complexity on the user. Because between batches we would then have to manage what we have already sent earlier and what we should send now and in either case spark is always reading everything from RDBMS but just limiting to df level so i don't think we should be doing anything here. @vinothchandar please advise further if am wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leesf and @vinothchandar Im sure we cannot limit through sql query as I see that Mysql and Postgres do SELET * FROM XXX LIMIT 1 where as orcale uses SELET * FROM XXX FETCH NEXT 1 ROWS and derby uses SELET * FROM XXX FETCH FIRST 10 ROWS ONLY

Copy link
Contributor

@leesf leesf Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, in case different SQL syntax in different RDBMS, I think it is ok to ignore sourceLimit if there is no better way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on the driver class that the user gives straight away if it contains MySql or postgres keyword we should start applying limit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes usually the jdbc url is like jdbc:mysql:. and jdbc:postgressql: ,. you can just match this.. Also document this special handling..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

Hi @vinothchandar so do you mean if the user sets the limit to 10. For postgres and MYSQL we should do select * from table limit 10?

I dont think that would work here with the type of semantics we have here. In continuous mode and full scan JDBC scans the whole table every interval.

In incremental we first do a full scan and write checkpoints we then assume the column given for incremental is either a long, int or timestamp. If the query for incremental fails then we fall back to full scans. How would limit work here?

It would always keep getting the same records.

Further talking about interval of jobs. This has not yet implemented as I do not have clarity. I want to know how we should do it. This would require further brain storming of how to keep schedule jobs.

Comment on lines 280 to 281
Assert.assertEquals(10, rowDataset.where("commit_time=000").count());
Assert.assertEquals(10, rowDataset.where("commit_time=001").count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rowDataset1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will correct that

@taherk77
Copy link
Contributor Author

@leesf @vinothchandar can we extract DataFrameReader from a Dataset? Would be really helpful in testing properties if there is a way

@taherk77
Copy link
Contributor Author

@leesf @vinothchandar Travis failed again on the same module as before.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project hudi-client: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? [ERROR] Command was /bin/sh -c cd /home/travis/build/apache/incubator-hudi/hudi-client && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefirebooter7257868499234226311.jar /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefire6853623625226259050tmp /home/travis/build/apache/incubator-hudi/hudi-client/target/surefire/surefire_31524552303193339725tmp

@leesf
Copy link
Contributor

leesf commented Oct 28, 2019

@leesf @vinothchandar can we extract DataFrameReader from a Dataset? Would be really helpful in testing properties if there is a way

As far as I know, we could easily get Dataset from DataFrameReader, But I didn't find a way to get DataFrameReader from Dataset, let me know if there is.

@taherk77
Copy link
Contributor Author

@leesf @vinothchandar All changes are addressed and fixed. Travis builds failing because of random VM crashes. What can we do further?

@taherk77
Copy link
Contributor Author

@leesf @vinothchandar All changes are addressed and fixed. Travis builds failing because of random VM crashes. What can we do further?

Is this good to go now?

@taherk77
Copy link
Contributor Author

@leesf @vinothchandar All tests passing guys!!!!!!!! 🥇

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments.. Looks almost ready!

.option(Config.RDBMS_TABLE_PROP, properties.getString(Config.RDBMS_TABLE_NAME));

if (properties.containsKey(Config.PASSWORD) && !StringUtils
.isNullOrEmpty(properties.getString(Config.PASSWORD))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be in some test setups?. might be good to allow that actually.

.isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
LOG.info(
String.format("Reading JDBC password from password file %s", properties.getString(Config.PASSWORD_FILE)));
FileSystem fileSystem = FileSystem.get(new Configuration());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the configuration object from spark context. otherwise, it may not be able to pick up settings placed there

FileSystem fileSystem = FileSystem.get(new Configuration());
passwordFileStream = fileSystem.open(new Path(properties.getString(Config.PASSWORD_FILE)));
byte[] bytes = new byte[passwordFileStream.available()];
passwordFileStream.read(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there is already a helper method in FileIOUtils for this? can we reuse that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I need to read from FS like hdfs, gcs or s3 which the hadoop fs FileSystem will be useful for. FileIOUtils dont use FileSystem for reads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you can pass the passwordFileStream to FileIoUtils.readAsByteArray` correct? its just another InputStream,?

String key = Arrays.asList(prop.split(Config.EXTRA_OPTIONS)).stream()
.collect(Collectors.joining());
String value = properties.getString(prop);
if (!StringUtils.isNullOrEmpty(value)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.. should we allow empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will allow empty options

}

@Override
protected Pair<Option<Dataset<Row>>, String> fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting might be helpful to break down the load into smaller chunks. DBMSes don't usually like large scans... So having some ability to limit would actually be good..

@taherk77 how about having the ability to add a LIMIT clause depending on the jdbc endpoint.. it should tell you if its MySQL or Postgres (two are very popular anyway, so having this working even for those 2 initially would be awesome)

try {
if (isIncremental) {
Column incrementalColumn = rowDataset.col(props.getString(Config.INCREMENTAL_COLUMN));
final String max = rowDataset.agg(functions.max(incrementalColumn).cast(DataTypes.StringType)).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would make a second pass over rowDataset right.. Should we cache this so that we don't go and read from the database twice? Line 126 above, will do one fetch and compute the checkpoint and then the rowDataset is passed back to DeltaSync class back, and we start to fetch rows to write to Hudi, it will trigger another "recomputation" .. We should atleast support an option to cache this dataset IMO.

Also is there a way to do this via Accumulators? (could be tricky) :D We can file a JIRA for later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure for now should we cache in the fetch() method, get the max of checkpointing, uncache and then return the df?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. lets cache it in fetch() and add a parameter to control the persistence level, like we do in BloomIndex please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoodie.datasource.jdbc.storage.level="MEMORY_ONLY_SER"
Has been added to the props file to get storage level. If no storage level is given by the user then we use the default as MEMORY_AND_DISK_SER


public static void cleanDerby() {
try {
Files.walk(Paths.get(DatabaseUtils.getDerbyDir()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already have a helper in FileIOUtils

hoodie.datasource.jdbc.table.incremental.pull.interval=5000
#Extra options for jdbc
hoodie.datasource.jdbc.extra.options.fetchsize=1000
hoodie.datasource.jdbc.extra.options.timestampFormat="yyyy-mm-dd hh:mm:ss"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:newline

@taherk77
Copy link
Contributor Author

@vinothchandar @leesf Changes addressed

@vinothchandar
Copy link
Member

@taherk77 if you could resovle comments after addressing them, that would be very helpful for reviewing incrementally. small tip :)

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taherk77 I think this will make for a great blog post here.. You can provide an end-end example of how to bulk load, lkeep incrementally ingesting with code and command samples?
https://cwiki.apache.org/confluence/display/HUDI#ApacheHudi(Incubating)-How-toblogs

lmk if you are interested. We can work out the perms

@taherk77
Copy link
Contributor Author

taherk77 commented Nov 1, 2019

@taherk77 if you could resovle comments after addressing them, that would be very helpful for reviewing incrementally. small tip :)

Apologies. Will keep and mind

@pushpavanthar
Copy link

pushpavanthar commented Nov 27, 2019

Hi @vinothchandar and @taherk77
I would like to add 2 points to this feature to make this very generic

  • We might need support for combination of incrementing columns. Incrementing columns can be of below types
  1. Timestamp columns
  2. Auto Incrementing column
  3. Timestamp + Auto Incrementing.
    Instead of code figuring out the incremental pull strategy, it'll be better if user provides it as config for each table.
    Considering Timestamp incrementing column, there can be more than once column contributing to this strategy. e.g. When a row is creation, only created_at column is set and updated_at is null by default. When the same row is updated, updated_at gets assigned to some timestamp. In such cases it is wise to consider both columns in the query formation.
  • We need to sort rows according to above mentioned incrementing columns to fetch rows in chunks (you can make use of defaultFetchSize in MySQL). I'm aware that sorting adds load on Database, but it helps in tracking the last pulled timestamp or auto incrementing id and help retry/resume from the point last recorded. This will be a saviour during failures.

A sample MySQL query for incrementing timestamp columns as (created_at and updated_at) might look like
SELECT * FROM inventory.customers WHERE COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > $last_recorder_time AND COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < $current_time ORDER BY COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC

@vinothchandar vinothchandar self-assigned this Dec 3, 2019
@vinothchandar
Copy link
Member

@pushpavanthar Great suggestion..

Let me see if we can structure this solution more,. Just supporting raw sql as input for extracting the data with the hoodie checkpoint simply being a list of string replaces in a template sql, could provide a lot of flexibility

Taking the same example from above.

user specifies the following SQL. (we can blog and document this well)

hoodie.datasource.jdbc.sql=SELECT COALESCE(inventory.customers.updated_at,inventory.customers.created_at) as created_updated_at, inventory.customers.user_id as user_id, * FROM inventory.customers WHERE created_updated_at > ${1} AND created_updated_at < ${1} AND user_id  > ${2}  ORDER BY created_updated_at ASC
hoodie.datasource.jdbc.incremental.column.names=created_updated_at, user_id
hoodie.datasource.jdbc.incremental.column.funcs=max, min
hoodie.datasource.jdbc.bulkload.sql=<sql to load it once initially or we could use some all inclusive filters for column names like user_id > 0 etc >

Hoodie checkpoint is a list of string values, once for each of the incremental column names, e.g 2019113048384, 1001 (timestamp and a user_id). we simple replace {1} with 2019113048384 and {2} with the user_id or second checkpoint value. Execute the sql, and then use the column funcs to derive the next checkpoint values off the fetched data set.. I would prefer to keep this computation out of the database and in Spark (for same reasons of avoiding more load on database)..

All this said, I want to get a basic version working and checked in :) first.
@taherk77 where are we at for this PR atm? Are you actively working on this?

@pushpavanthar
Copy link

@vinothchandar Thanks.
It would be great if we document this in design doc and proceed from there. I've been using JDBC incremental puller approach as one of the sources to Apache HUDI at work. I'm very excited about this feature.
In my opinion, user shouldn't be aware of query unless it is of special case (very rare). All incremental pulling of data follow same query template to which check-pointed values are substituted.
However, I would like to understand where this processor of HUDI maintains checkpointing/state data. If it is filesystem, are we going to provide this filesystem path as config? Or is It external state store?
If you can redirect me to the doc to this feature, I would like to add my thoughts to it.

@vinothchandar
Copy link
Member

@pushpavanthar Was out for a conference.. So picking this back up.. I am also pretty interested in this feature. I think having some basic usable version (even if not perfect) will go a long way...

Hudi will store checkpoints as a part of the commit metadata and supply them at the start of every batch to the source to substitute into SQL. There is no ticket open for you to add thoughts to it.. Feel free to open a new JIRA on your ideas?

@taherk77 I unassigned the JIRA HUDI-251 since I was not sure if you were still working on this. At this point, @pushpavanthar or anyone interested in taking this forward , please go ahead..

@vinothchandar vinothchandar changed the title [HUDI-251] JDBC incremental load to HUDI DeltaStreamer [WIP] [HUDI-251] JDBC incremental load to HUDI DeltaStreamer Dec 18, 2019
@vinothchandar
Copy link
Member

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants