Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

APEXMALHAR-2172: Updates to JDBC Poll Input Operator #358

Merged
merged 1 commit into from
Aug 11, 2016

Conversation

DT-Priyanka
Copy link
Contributor

Changes include:

  1. Using jooq query to construct sql queries
  2. Update logic to query based on row numbers instead of primary key column values
  3. Small fixes
  4. Code refactoring to make it more readable
  5. Update operator to emit pojo class constructed from table input fields
  6. Update to test cases, to abstract out generic test code and add more tests

@DT-Priyanka
Copy link
Contributor Author

@bhupeshchawda and @devtagare please review

import org.slf4j.LoggerFactory;

import org.apache.apex.malhar.lib.wal.FSWindowDataManager;
import org.apache.apex.malhar.lib.wal.WindowDataManager;
import org.apache.commons.lang3.tuple.MutablePair;
import org.apache.hadoop.classification.InterfaceStability.Evolving;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple file headers and import statements mixed.

@tweise
Copy link
Contributor

tweise commented Jul 30, 2016

How does 2. work? The query needs to be repeatable, so you will still need to sort on the primary key?

@tweise
Copy link
Contributor

tweise commented Jul 30, 2016

The jooq license isn't one of those listed here:

http://www.apache.org/legal/resolved.html#category-a

Where is the analysis on license compatibility?

@bhupeshchawda
Copy link
Contributor

Building the range query is currently based on limit and offset.
This kind of querying might create problems since the order of fetching data might change depending on various factors like presence of indexes on that column or also might depend on what plan is generated for the query. The output being a set, we cannot assume the order. In order for it to be repeatable, we have the following options:

  1. Explicitly use "order by" with limit and offset. - Might be costly for larger tables. Although this is worth exploring as we can assume a unique key index on this column and improve the performance.
  2. Explicitly identify bounds for the query and change the query format to where key >= low and key <= high. This should work in a repeatable manner; However, the performance here is again questionable in absence of indexes. Perhaps we'll need to assume an index in this case as well for performance.
  3. Use the current query (limit and offset) using something like RowNumber as in oracle databases. However, need to check if such functionality is present in other databases.

@DT-Priyanka
Copy link
Contributor Author

Reading further, I guess right thing to do will be use "order by" and also avoid using "offset".

Here are some good reads about it:
http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/
http://www.slideshare.net/slideshow/view?login=Eweaver&preview=no&slideid=1&title=efficient-pagination-using-mysql

I will check if I can modify our query to be more efficient, also would clarify assumptions about "key" column if any.

@DT-Priyanka DT-Priyanka force-pushed the APEXMALHAR-2172-jdbc-poller-input branch 2 times, most recently from 8e962ad to e7b87c2 Compare August 1, 2016 10:18
@DT-Priyanka
Copy link
Contributor Author

Thomas, using "order by" key column in queries to ensure right results.

Would evaluate options to optimize performance by avoiding use of "offset"

protected Integer lower;
protected transient boolean isPolled;
protected transient Integer lastPolledBound;
protected transient Integer lastEmittedRecord;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can lower, lastPolledBound and lastEmittedRecord be int instead of Integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing a null check and those variables are compared and assigned to each other so won't do that.

@tweise
Copy link
Contributor

tweise commented Aug 3, 2016

What if the user wants to use RDBMS specific SQL?

@DT-Priyanka
Copy link
Contributor Author

Thomas,
In JDBCPojoInputOperator, we do allow user to directly specify raw query, but we haven't done same here, it would be little complicated considering we are firing range queries here. If user doesn't specify such range queries our operator can misbehave.

@DT-Priyanka DT-Priyanka force-pushed the APEXMALHAR-2172-jdbc-poller-input branch 2 times, most recently from 00e7020 to 8113c44 Compare August 5, 2016 09:15
private transient ScheduledExecutorService scanService;
protected transient boolean isPolled;
protected transient Integer lastPolledBound;
protected transient Integer lastEmittedRecord;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make lowerBound, lastPolledBound and lastEmittedRecord as int instead of Integer


/**
* A concrete implementation for {@link AbstractJdbcPollInputOperator}} for
* A concrete implementation for {@link AbstractJdbcPollInputOperator} for
* consuming data from MySQL using JDBC interface <br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove explicit mention for MySQL? I think it is generic as of now..

@bhupeshchawda
Copy link
Contributor

@DT-Priyanka I have a few outstanding comments. Please address them.
Additionally also add documentation to data members and getters and setters along with proper doc tags. Fix documentation for refactored classes.
Also, please check a few places where store.disconnect() is not called in case of an exception.

@DT-Priyanka DT-Priyanka force-pushed the APEXMALHAR-2172-jdbc-poller-input branch 3 times, most recently from 5526efa to fae2c8b Compare August 9, 2016 09:54
@DT-Priyanka DT-Priyanka closed this Aug 9, 2016
@DT-Priyanka DT-Priyanka reopened this Aug 9, 2016
do {
boolean submitted = false;
while (!submitted) {
submitted = emitQueue.offer(getTuple(result));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while(!emitQueue.offer(getTuple(result))) { Thread.sleep(10); } ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 1000 sounds too much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change it to 500, as it's our default window size?

@DT-Priyanka
Copy link
Contributor Author

Sorry had missed couple of comments before, addressing remaining comments.

@DT-Priyanka DT-Priyanka force-pushed the APEXMALHAR-2172-jdbc-poller-input branch 5 times, most recently from f5a231a to 2ebcdd8 Compare August 10, 2016 09:43
@bhupeshchawda
Copy link
Contributor

Looks good to me. If there are no other comments, will merge by tomorrow.

@DT-Priyanka DT-Priyanka force-pushed the APEXMALHAR-2172-jdbc-poller-input branch from 2ebcdd8 to 26fa9d7 Compare August 11, 2016 10:38
@DT-Priyanka
Copy link
Contributor Author

File Splitter test is failing reopening to re-run tests.

@DT-Priyanka DT-Priyanka reopened this Aug 11, 2016
@asfgit asfgit merged commit 26fa9d7 into apache:master Aug 11, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants