ARROW-6933: [Java] Suppor linear dictionary encoder #5692

liyafan82 · 2019-10-18T11:49:59Z

For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:

We need no extra memory space or data structure.
The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders.

github-actions · 2019-10-18T12:06:06Z

https://issues.apache.org/jira/browse/ARROW-6933

liyafan82 · 2019-10-18T15:03:54Z

@emkornfield This is almost the same as #5058. Would you please take a look?

emkornfield · 2019-10-24T03:41:05Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/LinearDictionaryEncoder.java

+
+  /**
+   * The dictionary for encoding/decoding.
+   * It must be sorted.


I wouldn't think it needs to be sorted? or at least clarify sorting.

Sorry for the mistake. This sentence is removed.

emkornfield

a couple of minor comments on documentation, can be merged afterwards.

emkornfield · 2019-10-24T03:42:04Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/LinearDictionaryEncoder.java

+
+  /**
+   * Constructs a dictionary encoder.
+   * @param dictionary the dictionary. Its entries should be sorted in the non-increasing order of their frequency.


sorting is performance not correctness correct?

Good point. This is made explicity in the revised code.

emkornfield · 2019-10-24T03:42:22Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/LinearDictionaryEncoder.java

+  private Range range;
+
+  /**
+   * Constructs a dictionary encoder.


note that dictionary encoding is false by default.

Good point. This is made explicit in the revised code.

codecov-io · 2019-10-24T05:49:29Z

Codecov Report

Merging #5692 into master will increase coverage by 1.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5692      +/-   ##
==========================================
+ Coverage   88.93%   89.98%   +1.04%     
==========================================
  Files         989      739     -250     
  Lines      134508   114814   -19694     
  Branches     1501        0    -1501     
==========================================
- Hits       119627   103311   -16316     
+ Misses      14516    11503    -3013     
+ Partials      365        0     -365

Impacted Files	Coverage Δ
python/pyarrow/memory.pxi	`62.79% <0%> (-3.07%)`	⬇️
python/pyarrow/plasma.py	`56.16% <0%> (-2.74%)`	⬇️
cpp/src/arrow/sparse_tensor.cc	`98.55% <0%> (-0.79%)`	⬇️
cpp/src/arrow/python/pyarrow.cc	`28.88% <0%> (-0.66%)`	⬇️
python/pyarrow/tests/test_plasma.py	`95.76% <0%> (-0.48%)`	⬇️
python/pyarrow/serialization.py	`84.35% <0%> (-0.42%)`	⬇️
cpp/src/arrow/dataset/test_util.h	`90.65% <0%> (-0.26%)`	⬇️
cpp/src/arrow/dataset/file_parquet.cc	`95.89% <0%> (-0.06%)`	⬇️
cpp/src/arrow/python/numpy_convert.cc	`95.02% <0%> (-0.05%)`	⬇️
python/pyarrow/tests/test_ipc.py	`99.08% <0%> (-0.02%)`	⬇️
... and 296 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3207ac9...78d7196. Read the comment docs.

[ARROW-6933][Java] Suppor linear dictionary encoder

2268438

fsaintjacques added the Component: Java label Oct 21, 2019

emkornfield reviewed Oct 24, 2019

View reviewed changes

emkornfield approved these changes Oct 24, 2019

View reviewed changes

[ARROW-6933][Java] Improve Javadocs

78d7196

emkornfield closed this in ac99ca0 Oct 24, 2019

asfimport mentioned this pull request Oct 24, 2019

[Java] Suppor linear dictionary encoder #23254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6933: [Java] Suppor linear dictionary encoder #5692

ARROW-6933: [Java] Suppor linear dictionary encoder #5692

liyafan82 commented Oct 18, 2019

github-actions bot commented Oct 18, 2019

liyafan82 commented Oct 18, 2019

emkornfield Oct 24, 2019 •

edited

Loading

liyafan82 Oct 24, 2019

emkornfield left a comment

emkornfield Oct 24, 2019

liyafan82 Oct 24, 2019

emkornfield Oct 24, 2019

liyafan82 Oct 24, 2019

codecov-io commented Oct 24, 2019

ARROW-6933: [Java] Suppor linear dictionary encoder #5692

ARROW-6933: [Java] Suppor linear dictionary encoder #5692

Conversation

liyafan82 commented Oct 18, 2019

github-actions bot commented Oct 18, 2019

liyafan82 commented Oct 18, 2019

emkornfield Oct 24, 2019 • edited Loading

Choose a reason for hiding this comment

liyafan82 Oct 24, 2019

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

emkornfield Oct 24, 2019

Choose a reason for hiding this comment

liyafan82 Oct 24, 2019

Choose a reason for hiding this comment

emkornfield Oct 24, 2019

Choose a reason for hiding this comment

liyafan82 Oct 24, 2019

Choose a reason for hiding this comment

codecov-io commented Oct 24, 2019

Codecov Report

emkornfield Oct 24, 2019 •

edited

Loading