Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24682: Collect dynamic partition info in FileSink for direct insert #1915

Merged
merged 5 commits into from Feb 11, 2021

Conversation

pvargacl
Copy link
Contributor

What changes were proposed in this pull request?

The dynamic partition infos can be collected from the manifest files, no need to do a costly file listing later in the MoveTask

Why are the changes needed?

Performance improvement

Does this PR introduce any user-facing change?

No.

How was this patch tested?

All unit tests are working, local performance tests

@pvargacl
Copy link
Contributor Author

@kuczoram could you please review this?

POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=2222).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).b SIMPLE [(multiinsert_test_text)a.FieldSchema(name:b, type:int, comment:null), ]
POSTHOOK: Lineage: multiinsert_test_mm PARTITION(c=__HIVE_DEFAULT_PARTITION__).a SIMPLE [(multiinsert_test_text)a.FieldSchema(name:a, type:int, comment:null), ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these duplicate rows in the lineage info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes these were duplicates

@@ -1649,8 +1649,6 @@ POSTHOOK: Lineage: ###Masked###
POSTHOOK: Lineage: ###Masked###
POSTHOOK: Lineage: ###Masked###
POSTHOOK: Lineage: ###Masked###
POSTHOOK: Lineage: ###Masked###
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these also duplicated rows? Could you please check the q.out.orig files before the masking?

return (fullName.startsWith(BASE_PREFIX) && !dirName.startsWith(BASE_PREFIX)) ||
(fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
(fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an extra () on the first one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(fullName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELTA_PREFIX)) ||
(fullName.startsWith(DELETE_DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX));
return (!dirName.startsWith(BASE_PREFIX)) && !dirName.startsWith(DELTA_PREFIX) && !dirName.startsWith(DELETE_DELTA_PREFIX)
&& (fullName.contains(BASE_PREFIX) || fullName.contains(DELTA_PREFIX) || fullName.contains(DELETE_DELTA_PREFIX));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly this is changing the logic. Is it intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously it was working only for rootpath like delta_00001/subdir/, now it works for table/partition=1/delta_00001/subdir
The reordering was needed because fullName.contains(delta_prefix) && !dirname.startWith(delta_prefix) is true for delete_delta_000001, and it would yield wrong result

@pvargacl pvargacl merged commit 2f2b7a1 into apache:master Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants