Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6933: [Java] Suppor linear dictionary encoder #5692

Closed
wants to merge 2 commits into from

Conversation

liyafan82
Copy link
Contributor

For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:

  1. We need no extra memory space or data structure.
  2. The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders.

@github-actions
Copy link

@liyafan82
Copy link
Contributor Author

@emkornfield This is almost the same as #5058. Would you please take a look?


/**
* The dictionary for encoding/decoding.
* It must be sorted.
Copy link
Contributor

@emkornfield emkornfield Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think it needs to be sorted? or at least clarify sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the mistake. This sentence is removed.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of minor comments on documentation, can be merged afterwards.


/**
* Constructs a dictionary encoder.
* @param dictionary the dictionary. Its entries should be sorted in the non-increasing order of their frequency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting is performance not correctness correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is made explicity in the revised code.

private Range range;

/**
* Constructs a dictionary encoder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that dictionary encoding is false by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is made explicit in the revised code.

@codecov-io
Copy link

Codecov Report

Merging #5692 into master will increase coverage by 1.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5692      +/-   ##
==========================================
+ Coverage   88.93%   89.98%   +1.04%     
==========================================
  Files         989      739     -250     
  Lines      134508   114814   -19694     
  Branches     1501        0    -1501     
==========================================
- Hits       119627   103311   -16316     
+ Misses      14516    11503    -3013     
+ Partials      365        0     -365
Impacted Files Coverage Δ
python/pyarrow/memory.pxi 62.79% <0%> (-3.07%) ⬇️
python/pyarrow/plasma.py 56.16% <0%> (-2.74%) ⬇️
cpp/src/arrow/sparse_tensor.cc 98.55% <0%> (-0.79%) ⬇️
cpp/src/arrow/python/pyarrow.cc 28.88% <0%> (-0.66%) ⬇️
python/pyarrow/tests/test_plasma.py 95.76% <0%> (-0.48%) ⬇️
python/pyarrow/serialization.py 84.35% <0%> (-0.42%) ⬇️
cpp/src/arrow/dataset/test_util.h 90.65% <0%> (-0.26%) ⬇️
cpp/src/arrow/dataset/file_parquet.cc 95.89% <0%> (-0.06%) ⬇️
cpp/src/arrow/python/numpy_convert.cc 95.02% <0%> (-0.05%) ⬇️
python/pyarrow/tests/test_ipc.py 99.08% <0%> (-0.02%) ⬇️
... and 296 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3207ac9...78d7196. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants