New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Inconsistent query result using GetLatestBaseFiles compared to Snapshot Query #5231
Comments
@codejoyan can you please also paste contents of |
@alexeykudinkin Unfortunately I shutdown the docker instance. But this can be replicated. |
@alexeykudinkin here is the content of the .hoodie file, the data files and the data file counts. .hoodie file content
Data Files Listing:
Data File Count
|
@codejoyan this is a funny one So i was able to reproduce behavior that you're seeing and it turns out to be that |
Created HUDI-3855 to track |
Validated that #5296 addresses the issue:
|
Thanks @alexeykudinkin for the solution. I will do some testing and go through the PR. |
Thanks, @codejoyan & @alexeykudinkin for fixing this critical issue for BQ integration! |
thanks @alexeykudinkin to find the root cause and fixing it |
I am trying to compare the query output from a snapshot query VS a query to fetch data from files returned by GetLatestBaseFiles (as below).
What might be the reason for the below 2 observations:
Files listed by GetLatestBaseFiles
Section A: SnapShot Query Output (Expected)
Section B: Query Output Using list of files returned by GetLatestBaseFiles
Section C: The latest file slice has only a subset of records in COW (expected - 197, actual - 99)
To Reproduce
Steps to reproduce the behavior:
List Latest Base Files
Steps to Reproduce
I am following the steps in the docker demo. There are 2 json files (batch_1.json, batch_2.json) in docker/demo/data. I created an additional json file batch_3.json. Just changed the year from 2018 to 2019 from the batch_1.json file.
Commit 1:
terminal 1:
terminal 2:
Commit 2:
terminal 1:
terminal 2:
Execute deltastreamer job as Commit 1
terminal 2:
Commit 3:
terminal 1:
j0s0j7j@m-c02d25lnmd6n data % cat batch_2.json | kcat -b kafkabroker -t stock_tick -P
terminal 2:
Execute deltastreamer job as Commit 1
Environment Description
Hudi version : Built using master branch (0.11)
Spark version : 2.4.4
Running on Docker? yes
The text was updated successfully, but these errors were encountered: