Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java][Memory] Murmur hash implementation failing to hash less than 4 bytes. #38366

Closed
manolama opened this issue Oct 19, 2023 · 0 comments · Fixed by #38368
Closed

[Java][Memory] Murmur hash implementation failing to hash less than 4 bytes. #38366

manolama opened this issue Oct 19, 2023 · 0 comments · Fixed by #38368

Comments

@manolama
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

When attempting to use the MurmurHasher for values like foo and bar, the same hash was produced.

Reproduction:

@Test
  public void testHasherLessThanInt() {
    try (ArrowBuf buf1 = allocator.buffer(4);
         ArrowBuf buf2 = allocator.buffer(4)) {
      buf1.writeBytes("foo1".getBytes(StandardCharsets.UTF_8));
      buf2.writeBytes("bar2".getBytes(StandardCharsets.UTF_8));

      for (int i = 1; i <= 4; i ++) {
        verifyHashCodeNotEqual(buf1, 0, i, buf2, 0, i);
      }
    }
  }

  private void verifyHashCodeNotEqual(ArrowBuf buf1, int offset1, int length1,
                                      ArrowBuf buf2, int offset2, int length2) {
    int hashCode1 = hasher.hashCode(buf1, 0, length1);
    int hashCode2 = hasher.hashCode(buf2, 0, length2);
    System.out.println(hashCode1 + " != " + hashCode2);
    //assertNotEquals(hashCode1, hashCode2);
  }

produces

-1684229222 != -1684229222
-26109336 != -26109336
-235476567 != -235476567
779730367 != 296486961

Component(s)

Java

manolama added a commit to manolama/arrow that referenced this issue Oct 19, 2023
manolama added a commit to manolama/arrow that referenced this issue Oct 19, 2023
manolama added a commit to manolama/arrow that referenced this issue Oct 19, 2023
lidavidm pushed a commit that referenced this issue Oct 20, 2023
### Rationale for this change

Using the `MurmurHash` implementation would cause collisions on small input values.

### What changes are included in this PR?

Fix the iteration for small and tail values that are not 4 bytes in length.

### Are these changes tested?

Yes

### Are there any user-facing changes?
Unlikely unless someone was using the `MurmurHash` functions to persist a hash value.

* Closes: #38366

Authored-by: Chris Larsen <clarsen@netflix.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 15.0.0 milestone Oct 20, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…pache#38368)

### Rationale for this change

Using the `MurmurHash` implementation would cause collisions on small input values.

### What changes are included in this PR?

Fix the iteration for small and tail values that are not 4 bytes in length.

### Are these changes tested?

Yes

### Are there any user-facing changes?
Unlikely unless someone was using the `MurmurHash` functions to persist a hash value.

* Closes: apache#38366

Authored-by: Chris Larsen <clarsen@netflix.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 25, 2023
…pache#38368)

### Rationale for this change

Using the `MurmurHash` implementation would cause collisions on small input values.

### What changes are included in this PR?

Fix the iteration for small and tail values that are not 4 bytes in length.

### Are these changes tested?

Yes

### Are there any user-facing changes?
Unlikely unless someone was using the `MurmurHash` functions to persist a hash value.

* Closes: apache#38366

Authored-by: Chris Larsen <clarsen@netflix.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#38368)

### Rationale for this change

Using the `MurmurHash` implementation would cause collisions on small input values.

### What changes are included in this PR?

Fix the iteration for small and tail values that are not 4 bytes in length.

### Are these changes tested?

Yes

### Are there any user-facing changes?
Unlikely unless someone was using the `MurmurHash` functions to persist a hash value.

* Closes: apache#38366

Authored-by: Chris Larsen <clarsen@netflix.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#38368)

### Rationale for this change

Using the `MurmurHash` implementation would cause collisions on small input values.

### What changes are included in this PR?

Fix the iteration for small and tail values that are not 4 bytes in length.

### Are these changes tested?

Yes

### Are there any user-facing changes?
Unlikely unless someone was using the `MurmurHash` functions to persist a hash value.

* Closes: apache#38366

Authored-by: Chris Larsen <clarsen@netflix.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants