Support default null value in data preprocessing job by jackjlli · Pull Request #7739 · apache/pinot

jackjlli · 2021-11-10T19:40:10Z

Description

This PR supports default null value for data preprocessing job.
If the value of partitioning column is null, then use the default null value to distribute the data to all the reducers.
If the value of sorting column is null, then use the default null value for sorting within each reducer.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

New configuration options
Deprecation of configurations
Signature changes to public methods/interfaces
New plugins added or old plugins removed

Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

codecov-commenter · 2021-11-10T20:32:03Z

Codecov Report

Merging #7739 (5c8f4f3) into master (13c9ee9) will increase coverage by 0.15%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master    #7739      +/-   ##
============================================
+ Coverage     71.49%   71.65%   +0.15%     
+ Complexity     4064     4061       -3     
============================================
  Files          1577     1577              
  Lines         80554    80595      +41     
  Branches      11965    11978      +13     
============================================
+ Hits          57592    57747     +155     
+ Misses        19078    18962     -116     
- Partials       3884     3886       +2

Flag	Coverage Δ
integration1	`29.45% <ø> (+0.32%)`	⬆️
integration2	`27.86% <ø> (-0.02%)`	⬇️
unittests1	`68.56% <ø> (-0.04%)`	⬇️
unittests2	`14.58% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...inot/core/util/SegmentCompletionProtocolUtils.java	`57.69% <0.00%> (-7.70%)`	⬇️
.../helix/core/minion/MinionInstancesCleanupTask.java	`77.27% <0.00%> (-4.55%)`	⬇️
.../startree/v2/builder/OffHeapSingleTreeBuilder.java	`87.42% <0.00%> (-4.20%)`	⬇️
.../java/org/apache/pinot/spi/data/TimeFieldSpec.java	`88.63% <0.00%> (-2.28%)`	⬇️
...e/pinot/common/utils/FileUploadDownloadClient.java	`64.37% <0.00%> (-1.88%)`	⬇️
...ache/pinot/common/metadata/ZKMetadataProvider.java	`82.70% <0.00%> (-0.76%)`	⬇️
...e/pinot/core/transport/InstanceRequestHandler.java	`60.75% <0.00%> (-0.36%)`	⬇️
...roker/requesthandler/BaseBrokerRequestHandler.java	`70.93% <0.00%> (-0.20%)`	⬇️
...apache/pinot/spi/ingestion/batch/spec/TlsSpec.java	`0.00% <0.00%> (ø)`
...pinot/server/api/access/AllowAllAccessFactory.java
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13c9ee9...5c8f4f3. Read the comment docs.

snleee · 2021-11-11T21:47:33Z

As we discussed, I don't think that we should evenly distribute null value across all reducers because this will break the partitioning contract. Instead, I think that the data owner should correctly purge the data correctly not to have the null value for the column that they sort & partition on. As long as we do the key salting, it's probably the best thing we can do given the skewed data.

jackjlli · 2021-11-11T23:40:56Z

@snleee updated the PR based on the discussion.

...ot-hadoop/src/main/java/org/apache/pinot/hadoop/job/mappers/AvroDataPreprocessingMapper.java

snleee

LGTM

Co-authored-by: Jack Li(Analytics Engineering) <jlli@jlli-mn1.linkedin.biz>

jackjlli requested a review from snleee November 10, 2021 19:40

jackjlli force-pushed the support-default-null-value-in-preprocessing branch from 87f83bf to 08fbc20 Compare November 11, 2021 22:39

Support default null value in data preprocessing job

5c8f4f3

jackjlli force-pushed the support-default-null-value-in-preprocessing branch from 08fbc20 to 5c8f4f3 Compare November 12, 2021 21:45

snleee reviewed Nov 12, 2021

View reviewed changes

...ot-hadoop/src/main/java/org/apache/pinot/hadoop/job/mappers/AvroDataPreprocessingMapper.java Show resolved Hide resolved

snleee approved these changes Nov 12, 2021

View reviewed changes

jackjlli merged commit 068549c into master Nov 12, 2021

jackjlli deleted the support-default-null-value-in-preprocessing branch November 12, 2021 23:05

kriti-sc pushed a commit to kriti-sc/incubator-pinot that referenced this pull request Dec 12, 2021

Support default null value in data preprocessing job (apache#7739)

dc92c42

Co-authored-by: Jack Li(Analytics Engineering) <jlli@jlli-mn1.linkedin.biz>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support default null value in data preprocessing job#7739

Support default null value in data preprocessing job#7739
jackjlli merged 1 commit intomasterfrom
support-default-null-value-in-preprocessing

jackjlli commented Nov 10, 2021 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 10, 2021 •

edited

Loading

Uh oh!

snleee commented Nov 11, 2021

Uh oh!

jackjlli commented Nov 11, 2021

Uh oh!

Uh oh!

snleee left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jackjlli commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Upgrade Notes

Release Notes

Documentation

Uh oh!

codecov-commenter commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

snleee commented Nov 11, 2021

Uh oh!

jackjlli commented Nov 11, 2021

Uh oh!

Uh oh!

snleee left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackjlli commented Nov 10, 2021 •

edited

Loading

codecov-commenter commented Nov 10, 2021 •

edited

Loading