Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1850][HUDI-3234] Fixing read of a empty table but with failed write #2903

Merged
merged 4 commits into from Jan 23, 2022

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Apr 30, 2021

What is the purpose of the pull request

*Fixed read of an empty table (failed write) with proper information.

Brief change log

  • Fixed read of an empty table (failed write) with proper information.

Verify this pull request

  • Manually verified the change by running a job locally.

Stacktrace before the fix:
find in the attached jira.

Result after the fix:

val df = spark.read.format("hudi").load("/tmp/hudi_trips_cow")
df: org.apache.spark.sql.DataFrame = []

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2021

Codecov Report

Merging #2903 (712b446) into master (16e90d3) will decrease coverage by 20.21%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2903       +/-   ##
=============================================
- Coverage     47.62%   27.41%   -20.22%     
+ Complexity     5502     1285     -4217     
=============================================
  Files           930      381      -549     
  Lines         41268    15107    -26161     
  Branches       4137     1304     -2833     
=============================================
- Hits          19655     4141    -15514     
+ Misses        19865    10667     -9198     
+ Partials       1748      299     -1449     
Flag Coverage Δ
hudicli ?
hudiclient 21.05% <ø> (-13.54%) ⬇️
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync 5.28% <ø> (-49.20%) ⬇️
huditimelineservice ?
hudiutilities 58.60% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...main/java/org/apache/hudi/metrics/HoodieGauge.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/hudi/hive/NonPartitionedExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/hudi/metrics/MetricsReporter.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/hudi/metrics/MetricsReporterType.java 0.00% <0.00%> (-100.00%) ⬇️
...rg/apache/hudi/client/bootstrap/BootstrapMode.java 0.00% <0.00%> (-100.00%) ⬇️
...he/hudi/hive/HiveStylePartitionValueExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
...pache/hudi/client/utils/ConcatenatingIterator.java 0.00% <0.00%> (-100.00%) ⬇️
...che/hudi/config/HoodieMetricsPrometheusConfig.java 0.00% <0.00%> (-100.00%) ⬇️
.../hudi/execution/bulkinsert/BulkInsertSortMode.java 0.00% <0.00%> (-100.00%) ⬇️
...able/action/compact/CompactionTriggerStrategy.java 0.00% <0.00%> (-100.00%) ⬇️
... and 617 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16e90d3...712b446. Read the comment docs.

@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Apr 30, 2021
@vinothchandar vinothchandar added this to Review in progress in PR Tracker Board May 6, 2021
@vinothchandar vinothchandar moved this from Review in progress to Ready for Review in PR Tracker Board May 6, 2021
@vinothchandar vinothchandar moved this from Ready for Review to Review in progress in PR Tracker Board May 6, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the right behavior here should be. Should we error out or return an empty dataframe?

@vinothchandar vinothchandar moved this from Nearing Landing to Opened PRs in PR Tracker Board May 10, 2021
@vinothchandar vinothchandar moved this from Opened PRs to Ready for Review in PR Tracker Board May 11, 2021
@nsivabalan
Copy link
Contributor Author

I wonder what the right behavior here should be. Should we error out or return an empty dataframe?

If someone tries to read a hudi table from a path that does not even exist, we get this exception as of now.

scala> val tripsSnapshotDF = spark.
     |   read.
     |   format("hudi").
     |   load(basePath + "/*/*/*/*")
org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path Unable to find a hudi table for the user provided paths.
  at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:97)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 64 elided

So, I Guess the current fix should be fine. WDYT.

PR Tracker Board automation moved this from Ready for Review to Nearing Landing May 25, 2021
@nsivabalan
Copy link
Contributor Author

my bad. I have no idea why I complicated it so much. Fixed it.

case e: InvalidTableException =>
assertTrue(e.getMessage.contains("Invalid Hoodie Table"))
} finally {
spark.stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan why are stopping the spark session here? is it not shared outside of a single test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we initialize at the beginning of each test. I know we have to fix this entire class for reuse in general. @hmit is working on this refactoring.

@nsivabalan
Copy link
Contributor Author

SQL DML extension tests are failing. I need time to check those out. Will update once I am able to make CI happy.

@vinothchandar
Copy link
Member

@nsivabalan moving this back to ready for review. Please update when the tests are fixed

@vinothchandar vinothchandar moved this from Nearing Landing to Ready for Review in PR Tracker Board Jun 27, 2021
@nsivabalan
Copy link
Contributor Author

This patch needs to be redone a bit. Since w/ sql dml, create relation will be called upfront, the empty table check has to be moved to sql dml layer. I will sync up with @pengzhiwei2018 on how to go about this.

@nsivabalan
Copy link
Contributor Author

Guess we can remove the release blocker and critical label. Don't think this is very critical. I understand its nice to have.

@nsivabalan nsivabalan changed the title [HUDI-1850] Fixing read of a empty table but with failed write [HUDI-1850][HUDI-3234] Fixing read of a empty table but with failed write Jan 12, 2022
@nsivabalan nsivabalan force-pushed the firstWriteFailReadFix branch 3 times, most recently from 83ca285 to fcafe23 Compare January 13, 2022 00:29
@nsivabalan
Copy link
Contributor Author

@xushiyan : patch is good to review.

@nsivabalan
Copy link
Contributor Author

@YannByron : Can you review the patch when you get a chance. should be small one.

@YannByron
Copy link
Contributor

@YannByron : Can you review the patch when you get a chance. should be small one.

it's a very simple and forcible solution. There should have been BaseRelation s to think about it and solve it, i think.
But it works. LGTM.

@nsivabalan
Copy link
Contributor Author

@YannByron : there are some failures in sql-dml related tests. TestMergeInfo etc. after we create the table, first merge into fails bcoz, with the proposed fix, we return an empty relation which return NIL schema.
https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/5326/logs/23

2022-01-18T15:06:16.2111934Z TestMergeIntoTable:
2022-01-18T15:06:16.3784048Z - Test MergeInto Basic *** FAILED ***
2022-01-18T15:06:16.3785444Z   org.apache.spark.sql.AnalysisException: Cannot resolve 'h0.id in (`s0.id` = `h0.id`), the input columns is: [id#5461, name#5462, price#5463, ts#5464, flag#5465];
2022-01-18T15:06:16.3786516Z   at org.apache.spark.sql.hudi.analysis.HoodieResolveReferences.org$apache$spark$sql$hudi$analysis$HoodieResolveReferences$$resolveExpressionFrom(HoodieAnalysis.scala:387)
2022-01-18T15:06:16.3787449Z   at org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:200)
2022-01-18T15:06:16.3788267Z   at org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:122)
2022-01-18T15:06:16.3789124Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
2022-01-18T15:06:16.3790035Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
2022-01-18T15:06:16.3790833Z   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
2022-01-18T15:06:16.3791607Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
2022-01-18T15:06:16.3792441Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
2022-01-18T15:06:16.3793269Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
2022-01-18T15:06:16.3794064Z   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
2022-01-18T15:06:16.3794661Z   ...

So, I guess we might need some fix on sql-dml classes.
Can you help put in a fix for this. Feel free to open up a new PR if need be.

@YannByron
Copy link
Contributor

@nsivabalan All the failed DML is for an empty table but with failed write first?
If yes, that's because EmptyRelation returns Empty Schema, and Spark can't resolve the attributes in conditions and assignments.

If the table is created by Spark-SQL, the right schema can be returned correctly. If not, throwing an exception is right and can't fix it.
if you don't mind, i'll submit a pr to your branch.

@YannByron
Copy link
Contributor

@nsivabalan
nsivabalan#10

two TODO we need to consider:

  1. maybe unified schema persistence. for now, we have metadata schema, hoodie.table.create.schema just for spark-sql and file(parquet, .log) schema. Need to simplify this.
  2. consider whether all the side effects need to be cleaned up when fail to write at the first time by dataframe. including table info in metastore, and directories and files in filesystem.

@nsivabalan
Copy link
Contributor Author

@YannByron : yeah. feel free to put up a patch with all fixes required. once you have your patch, we can close this one.

good to think about the unified schema. but wondering, for an empty table, does it really require to unify the schemas. why can't sql-dml layer intercept empty table and take action appropriately.
I don't have much context into sql-dml classes. so may be you can help me understand better.

@YannByron
Copy link
Contributor

@nsivabalan
you can merge this nsivabalan#10 into your current branch, and re-test.

why can't sql-dml layer intercept empty table and take action appropriately.

�spark-sql also calls the DefaultSource.createRelation to get the schema info and valid file list. For an empty table, with this pr, sql layer can use the EmptyRelation to solve the failure of read this table.

The pr I submitted to your firstWriteFailReadFix branch should fix the sql analysis failure, like org.apache.spark.sql.AnalysisException: Cannot resolve 'h0.id in (s0.id=h0.id), the input columns is: [id#5461, name#5462, price#5463, ts#5464, flag#5465];.

For table created by spark-sql, there is the right schema persisted in hoodie.properties. Even if data fails to be written at the next commit, the schema should be retrieved correctly.

@nsivabalan
Copy link
Contributor Author

awesome, thanks a ton. I will work on it and update the patch.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

@YannByron : can you review the patch.

@YannByron
Copy link
Contributor

@YannByron : can you review the patch.

LGTM.

@nsivabalan nsivabalan merged commit f7a7796 into apache:master Jan 23, 2022
PR Tracker Board automation moved this from Nearing Landing to Done Jan 23, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Jan 25, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants