-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6933: [Java] Suppor linear dictionary encoder #5692
Conversation
@emkornfield This is almost the same as #5058. Would you please take a look? |
|
||
/** | ||
* The dictionary for encoding/decoding. | ||
* It must be sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think it needs to be sorted? or at least clarify sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the mistake. This sentence is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of minor comments on documentation, can be merged afterwards.
|
||
/** | ||
* Constructs a dictionary encoder. | ||
* @param dictionary the dictionary. Its entries should be sorted in the non-increasing order of their frequency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting is performance not correctness correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is made explicity in the revised code.
private Range range; | ||
|
||
/** | ||
* Constructs a dictionary encoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that dictionary encoding is false by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is made explicit in the revised code.
Codecov Report
@@ Coverage Diff @@
## master #5692 +/- ##
==========================================
+ Coverage 88.93% 89.98% +1.04%
==========================================
Files 989 739 -250
Lines 134508 114814 -19694
Branches 1501 0 -1501
==========================================
- Hits 119627 103311 -16316
+ Misses 14516 11503 -3013
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:
This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders.