Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1790] Added SqlSource to fetch data from any partitions for backfill use case #2896

Merged
merged 3 commits into from Jun 10, 2021

Conversation

vingov
Copy link
Contributor

@vingov vingov commented Apr 29, 2021

Tips

What is the purpose of the pull request

This pull request adds a new source to delta streamer, to perform snapshot queries mainly used for backfilling historical partitions.

Brief change log

  • Added a new SqlSource to delta streamer to handle backfills for any specific date range snapshot queries.

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestSqlSource to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2021

Codecov Report

Merging #2896 (9ac435c) into master (afbafe7) will decrease coverage by 46.91%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #2896       +/-   ##
============================================
- Coverage     55.35%   8.43%   -46.92%     
+ Complexity     4025      62     -3963     
============================================
  Files           520      70      -450     
  Lines         25291    2880    -22411     
  Branches       2872     359     -2513     
============================================
- Hits          13999     243    -13756     
+ Misses         9905    2616     -7289     
+ Partials       1387      21     -1366     
Flag Coverage Δ
hudicli ?
hudiclient ?
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync 6.79% <ø> (-44.66%) ⬇️
huditimelineservice ?
hudiutilities 9.09% <0.00%> (-61.97%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...a/org/apache/hudi/utilities/sources/SqlSource.java 0.00% <0.00%> (ø)
...va/org/apache/hudi/utilities/IdentitySplitter.java 0.00% <0.00%> (-100.00%) ⬇️
...va/org/apache/hudi/utilities/schema/SchemaSet.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/hudi/utilities/sources/RowSource.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/hudi/utilities/sources/AvroSource.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/hudi/utilities/sources/JsonSource.java 0.00% <0.00%> (-100.00%) ⬇️
...rg/apache/hudi/utilities/sources/CsvDFSSource.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/hudi/utilities/sources/JsonDFSSource.java 0.00% <0.00%> (-100.00%) ⬇️
...apache/hudi/utilities/sources/JsonKafkaSource.java 0.00% <0.00%> (-100.00%) ⬇️
...pache/hudi/utilities/sources/ParquetDFSSource.java 0.00% <0.00%> (-100.00%) ⬇️
... and 498 more

@vingov
Copy link
Contributor Author

vingov commented Apr 29, 2021

Hi @n3nash, Can you please review this PR since you have more context on this feature?

@vinothchandar vinothchandar self-assigned this May 6, 2021
@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board May 6, 2021
@nsivabalan nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021
public void testSqlSource() throws IOException {
UtilitiesTestBase.dfs.mkdirs(new Path(dfsRoot));
TypedProperties props = new TypedProperties();
props.setProperty("hoodie.deltastreamer.source.sql", "select * from test_sql_table");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. Try to see if we can reuse the variable declared in source code rather than hardcoding the config key here. we could avoid any typos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable declared in the source is private, hence hardcoded it here, but to avoid typos made it as a final variable and used it everywhere.

PR Tracker Board automation moved this from Ready for Review to Nearing Landing May 19, 2021
@nsivabalan
Copy link
Contributor

@vingov : hey vinoth. Did you get a chance to check out my feedback. We can merge this in once addressed.

@vingov
Copy link
Contributor Author

vingov commented Jun 9, 2021

@nsivabalan - I've addressed all the review comments, please review it again, thanks!

@vingov
Copy link
Contributor Author

vingov commented Jun 10, 2021

The test failures are not related to this change, its on hudi-client/hudi-spark-client modules.

@nsivabalan
Copy link
Contributor

nsivabalan commented Jun 10, 2021

awesome. LGTM. Have added the test flakiness to tracking ticket: https://issues.apache.org/jira/browse/HUDI-1989
Have re-triggered CI. lets see. if it succeeds, will merge this in.

@nsivabalan nsivabalan merged commit 9e4114d into apache:master Jun 10, 2021
PR Tracker Board automation moved this from Nearing Landing to Done Jun 10, 2021
@nsivabalan
Copy link
Contributor

Thanks for your contribution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants