Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexMerger to respect the includeAllDimensions flag #12428

Merged
merged 2 commits into from
Apr 13, 2022

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Apr 12, 2022

Description

includeAllDimensions was added recently in #12276. When useFieldDiscovery and includeAllDimensions are both set, IndexMerger should store every column even if it is empty. However, it currently doesn't respect the includeAllDimensions flag but only checks if the given column is explicitly specified in the dimensionsSpec. This PR fixes this bug by checking the flag properly.

This PR also fixes another bug in JsonInputFormat that implicitly filters null fields out even when useFieldDiscovery is set. Instead of filtering them out in an early stage in ingestion, keepNullColumns will be set automatically if useFieldDiscovery is set, so that MapInputRowParser can check the dimensionsSpec to decide whether to keep the null fields or not. This change will not make any user-facing change as 1) keepNullColumns is undocumented and used only in the sampler and 2) even if there is someone using keepNullColumns, the existing behavior should remain the same. The behavior changes only when includeAllDimensions is set as well as useFieldDiscovery.


Key changed/added classes in this PR
  • IndexMergerV9
  • JsonInputFormat

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

…ormat should set keepNullColumns if useFieldDiscovery is set
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just nitpicking so approved, make the changes if you'd like, or not 🤷

@@ -1460,4 +1456,33 @@ private static <T extends TimeAndDimsPointer> T reorderRowPointerColumns(
);
}
}

private static class DimensionMergerUtil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit, this doesn't really seem like a 'util', but does sort of remind me of some other thingswe call Inspector though not sure what exactly would call this, naming is hard.

Also, it seems like only one place creates this, maybe should push dimensionsSpec into the constructor of this thing and do the logic there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

In that vein, maybe DimensionsSpecInspector or DimensionStorageChecker would be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to DimensionsSpecInspector, but am not sure what the right name is for this. I think the code structure being not well-organized in IndexMerger is probably one of the reasons that it's hard to find a good name. Perhaps, we should add a wrapper of DimensionMerger and DimensionsSpecInspector that creates ColumnDescriptor. Or, there could be a even better refactoring idea than that. But I would like to do such refactoring not in this PR but in others.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Very minor style suggestions.

FYI, there also seems to be a checkstyle error due to an extra import.

}

@Test
public void testUseFieldDiscovery_doNotChangeKeepNullColumnsUserSets()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we also add a test where JsonPathSpec.useFieldDiscovery = false ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one.


@RunWith(Parameterized.class)
public class HashPartitionMultiPhaseParallelIndexingWithNullColumnTest extends AbstractMultiPhaseParallelIndexingTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We should probably rename this test class now that it is a parameterized test and runs for DimensionRangePartitionsSpec too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks. I forgot to fix it before making this PR. Renamed it now.

"dim2", "val13"
)
);
final List<Map<String, Object>> rows = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this part might be more readable if we add a private method that returns a row given the input values i.e.

private Map<String, Object> createTestRow(String ts, String... dims) {
   // return map with "ts" -> ts, "dim1" -> dims[0] and so on.
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored as suggested.

{
private final List<String> data;

@JsonCreator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests need to serialize/deserialize to run the underlying tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractParallelIndexSupervisorTaskTest provides a semi-integration testing environment. It serializes and deserializes the given task to mimic the actual task processing.

@@ -1460,4 +1456,33 @@ private static <T extends TimeAndDimsPointer> T reorderRowPointerColumns(
);
}
}

private static class DimensionMergerUtil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

In that vein, maybe DimensionsSpecInspector or DimensionStorageChecker would be a better name?

@jihoonson
Copy link
Contributor Author

@clintropolis @kfaraz thank you for your review!

@jihoonson jihoonson merged commit 5e5625f into apache:master Apr 13, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants