Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EstimateSketchUDF isn't processing BINARY fields correctly #50

Closed
koke opened this issue May 29, 2020 · 7 comments · Fixed by #51 or #52
Closed

EstimateSketchUDF isn't processing BINARY fields correctly #50

koke opened this issue May 29, 2020 · 7 comments · Fixed by #51 or #52

Comments

@koke
Copy link
Contributor

koke commented May 29, 2020

I've been scratching my head for a while with this one, but I was writing some unit tests where I created a theta sketch with a single item, and the estimate function was returning an estimate of minus 800M.

This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):

add jar /path/to/datasketches-memory-1.2.0-incubating.jar;
add jar /path/to/datasketches-java-1.2.0-incubating.jar;
add jar /path/to/datasketches-hive-1.0.0-incubating.jar;

create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF';
create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF';

create temporary table theta_input as select 1 as id;

create temporary table sketch_intermediate as select data2sketch(id) as sketch from theta_input;

select estimate(sketch) as estimate_from_table from sketch_intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | -8.80936683E8        |
-- +----------------------+--+

with intermediate as (
    select data2sketch(id) as sketch from theta_input
)
select estimate(sketch) as estimate_from_table from intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | 1.0                  |
-- +----------------------+--+

For some reason there were some extra bytes in the BytesWritable storage, which was breaking the calculations. What was supposed to be a 16 byte SingleItemSketch, got an extra 8 bytes (zero-filled), making datasketches think it was a completely different thing.

A unit test of what I was seeing coming from Hive:

@Test
public void evaluateRespectsByteLength() {
    byte[] inputBytes = new byte[]{
            (byte) 0x01, (byte) 0x03, (byte) 0x03, (byte) 0x00,
            (byte) 0x00, (byte) 0x3a, (byte) 0xcc, (byte) 0x93,
            (byte) 0x15, (byte) 0xf9, (byte) 0x7d, (byte) 0xcb,
            (byte) 0xbd, (byte) 0x86, (byte) 0xa1, (byte) 0x05,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00
    };
    BytesWritable input = new BytesWritable(inputBytes, 16);
    EstimateSketchUDF estimate = new EstimateSketchUDF();
    Double testResult = estimate.evaluate(input);
    assertEquals(1.0, testResult, 0.0);
}

Adding this wrapper around EstimateSketchUDF fixes the problem:

public class EstimateSketchUDF extends org.apache.datasketches.hive.theta.EstimateSketchUDF {

  @Override
  public Double evaluate(BytesWritable binarySketch) {
    if (binarySketch == null) {
      return 0.0;
    }

    byte[] bytes = new byte[binarySketch.getLength()];
    System.arraycopy(binarySketch.getBytes(), 0, bytes, 0, binarySketch.getLength());
    BytesWritable fixedSketch = new BytesWritable(bytes);

    return super.evaluate(fixedSketch);
  }
}
@jmalkin
Copy link
Contributor

jmalkin commented May 29, 2020

This is a bug in lots of places, as noted in the PR. I'll start working to fix them all (that might take a bit but I'll try to get it done soon).

Wrapping the larger buffer will be harmless in most cases. The specific issue here is that there was an older version of the specialized SingleItemSketch that didn't include a flag indicating that it followed the slightly different single item format. To handle that problem, the only workaround that could be identified was to use the buffer length. In general, the images shouldn't need to rely on buffer size (other than ensuring it's not too small) when being read.

And checks against size for the empty sketch scenario should be a performance optimization, but work properly if parsing an actual empty sketch later. But we should fix this properly anyway.

@jmalkin
Copy link
Contributor

jmalkin commented Jun 2, 2020

My PR replaces all calls to getBytes() in the repo (aside from in the new code to wrap BytesWritable in a Memory). Had to copy bytes in a few places, but I was mostly able to just wrap using getLength() bytes. Just need it reviewed now.

@leerho leerho closed this as completed in #51 Jun 2, 2020
@jmalkin jmalkin reopened this Jun 2, 2020
@jmalkin
Copy link
Contributor

jmalkin commented Jun 2, 2020

@leerho Please revert that change and apply the PR that completely addresses the issue.

@leerho
Copy link
Contributor

leerho commented Jun 2, 2020 via email

@jmalkin
Copy link
Contributor

jmalkin commented Jun 2, 2020

Thanks again @koke for finding this. We did add your unit test, while fixing the underlying issue everywhere in the repo so that future changes in other sketches won't risk triggering this problem again.

@koke
Copy link
Contributor Author

koke commented Jun 8, 2020

Is there a plan to do a new release with this? I just hit a new issue with the IntersectSketchUDF and had to replicate the workaround for that one

@leerho
Copy link
Contributor

leerho commented Jul 6, 2020

@koke
@jmalkin
DataSketches-hive 1.1.0-incubating has been Released!

Please goto Downloads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants