EstimateSketchUDF isn't processing BINARY fields correctly #50

koke · 2020-05-29T10:23:39Z

I've been scratching my head for a while with this one, but I was writing some unit tests where I created a theta sketch with a single item, and the estimate function was returning an estimate of minus 800M.

This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):

add jar /path/to/datasketches-memory-1.2.0-incubating.jar;
add jar /path/to/datasketches-java-1.2.0-incubating.jar;
add jar /path/to/datasketches-hive-1.0.0-incubating.jar;

create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF';
create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF';

create temporary table theta_input as select 1 as id;

create temporary table sketch_intermediate as select data2sketch(id) as sketch from theta_input;

select estimate(sketch) as estimate_from_table from sketch_intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | -8.80936683E8        |
-- +----------------------+--+

with intermediate as (
    select data2sketch(id) as sketch from theta_input
)
select estimate(sketch) as estimate_from_table from intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | 1.0                  |
-- +----------------------+--+

For some reason there were some extra bytes in the BytesWritable storage, which was breaking the calculations. What was supposed to be a 16 byte SingleItemSketch, got an extra 8 bytes (zero-filled), making datasketches think it was a completely different thing.

A unit test of what I was seeing coming from Hive:

@Test
public void evaluateRespectsByteLength() {
    byte[] inputBytes = new byte[]{
            (byte) 0x01, (byte) 0x03, (byte) 0x03, (byte) 0x00,
            (byte) 0x00, (byte) 0x3a, (byte) 0xcc, (byte) 0x93,
            (byte) 0x15, (byte) 0xf9, (byte) 0x7d, (byte) 0xcb,
            (byte) 0xbd, (byte) 0x86, (byte) 0xa1, (byte) 0x05,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00
    };
    BytesWritable input = new BytesWritable(inputBytes, 16);
    EstimateSketchUDF estimate = new EstimateSketchUDF();
    Double testResult = estimate.evaluate(input);
    assertEquals(1.0, testResult, 0.0);
}

Adding this wrapper around EstimateSketchUDF fixes the problem:

public class EstimateSketchUDF extends org.apache.datasketches.hive.theta.EstimateSketchUDF {

  @Override
  public Double evaluate(BytesWritable binarySketch) {
    if (binarySketch == null) {
      return 0.0;
    }

    byte[] bytes = new byte[binarySketch.getLength()];
    System.arraycopy(binarySketch.getBytes(), 0, bytes, 0, binarySketch.getLength());
    BytesWritable fixedSketch = new BytesWritable(bytes);

    return super.evaluate(fixedSketch);
  }
}

The text was updated successfully, but these errors were encountered:

jmalkin · 2020-05-29T18:09:34Z

This is a bug in lots of places, as noted in the PR. I'll start working to fix them all (that might take a bit but I'll try to get it done soon).

Wrapping the larger buffer will be harmless in most cases. The specific issue here is that there was an older version of the specialized SingleItemSketch that didn't include a flag indicating that it followed the slightly different single item format. To handle that problem, the only workaround that could be identified was to use the buffer length. In general, the images shouldn't need to rely on buffer size (other than ensuring it's not too small) when being read.

And checks against size for the empty sketch scenario should be a performance optimization, but work properly if parsing an actual empty sketch later. But we should fix this properly anyway.

jmalkin · 2020-06-02T00:15:41Z

My PR replaces all calls to getBytes() in the repo (aside from in the new code to wrap BytesWritable in a Memory). Had to copy bytes in a few places, but I was mostly able to just wrap using getLength() bytes. Just need it reviewed now.

jmalkin · 2020-06-02T03:52:03Z

@leerho Please revert that change and apply the PR that completely addresses the issue.

leerho · 2020-06-02T05:08:50Z

Done.

On Mon, Jun 1, 2020 at 8:52 PM Jon Malkin ***@***.***> wrote: @leerho <https://github.com/leerho> Please revert that change and apply the PR that completely addresses the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCXRQU2TY3H5XZGJON4MKLRURZO7ANCNFSM4NN5LCNQ> .

-- From my cell phone.

jmalkin · 2020-06-02T20:44:33Z

Thanks again @koke for finding this. We did add your unit test, while fixing the underlying issue everywhere in the repo so that future changes in other sketches won't risk triggering this problem again.

koke · 2020-06-08T17:12:24Z

Is there a plan to do a new release with this? I just hit a new issue with the IntersectSketchUDF and had to replicate the workaround for that one

leerho · 2020-07-06T21:54:10Z

@koke
@jmalkin
DataSketches-hive 1.1.0-incubating has been Released!

Please goto Downloads

koke mentioned this issue May 29, 2020

Fix theta sketch estimation if input has extra bytes #51

Merged

jmalkin mentioned this issue Jun 2, 2020

Replace calls to BytesWritable.getBytes() #52

Merged

leerho closed this as completed in #51 Jun 2, 2020

jmalkin reopened this Jun 2, 2020

jmalkin added a commit that referenced this issue Jun 2, 2020

make sure binary is handled consistently, add test for issue #50

18271fd

jmalkin closed this as completed in #52 Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EstimateSketchUDF isn't processing BINARY fields correctly #50

EstimateSketchUDF isn't processing BINARY fields correctly #50

koke commented May 29, 2020

jmalkin commented May 29, 2020

jmalkin commented Jun 2, 2020

jmalkin commented Jun 2, 2020

leerho commented Jun 2, 2020 via email

jmalkin commented Jun 2, 2020

koke commented Jun 8, 2020

leerho commented Jul 6, 2020

EstimateSketchUDF isn't processing BINARY fields correctly #50

EstimateSketchUDF isn't processing BINARY fields correctly #50

Comments

koke commented May 29, 2020

jmalkin commented May 29, 2020

jmalkin commented Jun 2, 2020

jmalkin commented Jun 2, 2020

leerho commented Jun 2, 2020 via email

jmalkin commented Jun 2, 2020

koke commented Jun 8, 2020

leerho commented Jul 6, 2020