Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Druid refactors #16708

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

adarshsanjeev
Copy link
Contributor

Some general refactors across Druid.

  • Switch to DruidExceptions
  • Add javadocs
  • Fix a bug in IntArrayColumns
  • Add a class for LongArrayColumns
  • Remove wireTransferable since it would never be called
  • Refactor DictionaryWriter to return the index written as a return value from write.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -91,7 +91,7 @@ public int numRows()
public boolean isNull(int rowNum)
{
offset.set(rowNum);
return valueSelector.isNull();
return valueSelector.getObject() == null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have a different contract than BaseNullableColumnValueSelector? Its isNull method means that getLong/getFloat/getDouble will return a null value, not necessarily that getObject is null. E.g. sometimes non-numeric selectors will implement these primitive getters, but not necessarily be null rows.

Looking at some of the things that implement it, it doesn't appear to be different, e.g. ObjectColumnAccessorBase has an implementation of getLong that checks if the instance is a String, and if so, tries to parse it. This means that getObject will not return null, but getLong isn't valid, so there is no way to determine that the 0 it returns is actually a null because we couldn't parse the string 'hello' into a long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know if it is safe to call getLong if that is the contract?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we should eliminate implicit casting in the numeric getter methods to make it so that they only work if the column type really is a number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the javadoc for ColumnAccessor (Get whether the value of a row is null), the updated behaviour would be correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to revisit this contract, since it doesn't really feel compatible to coexist with implicit casting on the getLong/getFloat etc methods. If you have a string column but want to read numbers from it but not all of them can be parsed into numbers, you can't reliably use any of the primitive getters, because isNull would say false, but getLong would not be parseable into a number and return 0. See ObjectColumnAccessorBase.

This means you can't actually use these methods if you want to try to get numbers from an object column, so they should probably instead throw exceptions instead of silently giving you zeros that should have been handled as nulls.

We can fix this by just forbidding this sort of implicit casting, I'm not sure it does us any favors in terms of making code clear to follow, while having explicit casting wrapper gizmos would make it obvious when something is turning a string into a number.

Copy link
Member

@clintropolis clintropolis Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it has to be done now, but it does seem problematic and is why ColumnValueSelector.isNull has the strange contract it does, because you kind of need it if you allow implicit casting.

return FindResult.found(startIndex, end);
}

int i = Arrays.binarySearch(vals, startIndex, endIndex, val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't really new since the others do this too, but is there any guarantee that vals is sorted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.

Copy link
Contributor

@LakshSingla LakshSingla Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, none of the "find" stuff is actually being used for anything meaningful. Ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's more likely that we have distinct values here, but currently, this is in line with the other type columns. If we make a call on using multiple searches instead, that should be consistent with the other columns here.

Copy link
Contributor

@cryptoe cryptoe Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarshsanjeev Lets java doc the oddity for now and move forward? We can take up this work as part of a follow up refactor ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this to the javadoc in BinarySearchableAccessor

@Override
public FindResult findComplex(int startIndex, int endIndex, Object val)
{
return findLong(startIndex, endIndex, Numbers.tryParseLong(val, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this isn't really new, but why doesn't this need to distinguish between the case where val is parseable into a number and not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta answer to your question is because it's not an interface that's used for anything actually meaningful yet.

The actual answer to your question is that the one place that is trying to use this is first checking for null and then calling findNull It then does a switch on the searcher's type and calls the relevant method for that. The type of the LongArrayColumn is LONG so it won't actually call findComplex() and will call findLong() instead. This is fundamentally putting the responsibility on the call site to "do the right thing". Is that the correct behavior or not? I dunno, but it's an experimental interface that you would have to be incredibly intrepid to actually hit, so it's a detail that perhaps doesn't matter.

If your suggestion is to switch this to a NotYetImplemented exception, I'd probably agree with that suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the numberColumns to throw an exception instead of using a 0 if the string is unparsable for findComplex and findString

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Make sure you have green checks and that clint is satisfied before merge.

return FindResult.found(startIndex, end);
}

int i = Arrays.binarySearch(vals, startIndex, endIndex, val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.

return FindResult.found(startIndex, end);
}

int i = Arrays.binarySearch(vals, startIndex, endIndex, val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.

@@ -91,7 +91,7 @@ public int numRows()
public boolean isNull(int rowNum)
{
offset.set(rowNum);
return valueSelector.isNull();
return valueSelector.getObject() == null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"

return FindResult.found(startIndex, end);
}

int i = Arrays.binarySearch(vals, startIndex, endIndex, val);
Copy link
Contributor

@LakshSingla LakshSingla Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.

*/
public class NotYetImplemented extends DruidException.Failure
{
public static DruidException ex(String msg, Object... args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not have factory methods which omit the cause exception
I've seen places where there was a catch block - but the newly created exception did not had the cause filled in most likely by mistake - but as a result the stacktrace was clipped.

I think forcing the api user to write out null for cause makes this mistake less probable to do the same

++firstNonDruidExceptionIndex;
}
if (firstNonDruidExceptionIndex < stackTrace.length) {
retVal.setStackTrace(Arrays.copyOfRange(stackTrace, firstNonDruidExceptionIndex, stackTrace.length));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditional looks odd... its not wrong; but it will only skip the set if it went over the full stacktrace; but when its 0 it still does it...

I guess the intention here is to clip the exception construction process?
could you add a test for this?
with a case which checks what happens if a DruidException is set as a cause for another DruidException

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants