New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-24682: Collect dynamic partition info in FileSink for direct insert #1915
Conversation
@kuczoram could you please review this? |
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ] | ||
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ] | ||
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ] | ||
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ] | ||
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these duplicate rows in the lineage info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes these were duplicates
@@ -1649,8 +1649,6 @@ POSTHOOK: Lineage: ###Masked### | |||
POSTHOOK: Lineage: ###Masked### | |||
POSTHOOK: Lineage: ###Masked### | |||
POSTHOOK: Lineage: ###Masked### | |||
POSTHOOK: Lineage: ###Masked### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these also duplicated rows? Could you please check the q.out.orig files before the masking?
return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) || | ||
(fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) || | ||
(fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)); | ||
return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an extra ()
on the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
(fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) || | ||
(fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)); | ||
return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX) | ||
&& (fullName.contains(BASE_PREFIX) || fullName.contains(DELTA_PREFIX) || fullName.contains(DELETE_DELTA_PREFIX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly this is changing the logic. Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Previously it was working only for rootpath like delta_00001/subdir/, now it works for table/partition=1/delta_00001/subdir
The reordering was needed because fullName.contains(delta_prefix) && !dirname.startWith(delta_prefix) is true for delete_delta_000001, and it would yield wrong result
What changes were proposed in this pull request?
The dynamic partition infos can be collected from the manifest files, no need to do a costly file listing later in the MoveTask
Why are the changes needed?
Performance improvement
Does this PR introduce any user-facing change?
No.
How was this patch tested?
All unit tests are working, local performance tests