[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter#117
[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter#117MickDavies wants to merge 2 commits intoapache:masterfrom
Conversation
|
Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite. I’ll give it another go. Mick
|
|
oops test is wrong - fixing now |
|
I have fixed the test I wrote which exercises the dictionary filtering for eq and neq cases. The generator produced some code for udfs. I'll look for some tests of this and add to those also. |
|
I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround. Have other people got thrift 0.7 running on Mac OS 10.9 or above? Mick
|
|
Is there a reason why parquet continues to depend on this old version of thrift? I think I read somewhere it was used just for testing. Would it be easy to upgrade? Thanks Mick
|
|
@MickDavies: I don't think you need to use thrift 0.7.0. I've built with 0.9.0 without problems, and newer versions (before 0.9.2 at least) should work. |
There was a problem hiding this comment.
The pattern we use elsewhere is that the concrete classes include an implementation rather than having the abstract class carry an instance variable. That would be better for two reasons:
- It would not require changing the constructor's signature
- It ensures that the concrete classes always match their reported primitive type
Otherwise, I could instantiate a PlainLongDictionary and pass in PrimitiveTypeName.BINARY without triggering a complaint.
There was a problem hiding this comment.
Thanks for the review, I think that the suggestion is good. It would mean creating a couple of new dictionary classes e.g. PlainInt96Dictionary.
Are you happy with Dictionary carrying the type information, or at least having abstract method to get information. I did this to make adding dictionary support to SimplePrimitiveConverter simpler, and I added dictionary support to SimplePrimitiveConverter so I could test using existing framework.
There was a problem hiding this comment.
I have made the change suggested above
|
Hi @MickDavies
|
|
Hi @julienledem I guess you could separate these two features, I hadn't done so in my mind. The former is important from measurements I have done for some datasets using Spark. For data I am using a query over a table with around 100M rows can run 25% faster. I guess the second one just seemed like an obvious optimization |
|
both features are useful and they will both bring gains for different reasons. |
|
Thanks for all the feedback, they comments were very useful. I have responded to most of the comments in the last check-in, with:
I still have to do the following:
|
|
I've added a unit test for FilteringPrimitiveConverter. I wanted to use a mocking library and saw that mockito and eaymock referenced in poms though only mockito seems to be being used, so I used that. Also I upgraded to junit 4.12 from 4.10, so that I could set the name in Parameterized junit annotation, which makes the test output look a bit nicer, but is not essential. |
|
I'm going to simplify this work removing changes to example simple implementation |
Add dictionary support to FilteringPrimitiveConverter. Add related tests
A little tidy up and add some comments
|
Is anyone free to review, I would like to get this change in or rejected. Thanks |
|
Can this MR be reviewed please? This is a big performance blocker when reading a large parquet file with many String columns using a Filter as new String instances are created for each record read even though cardinality is low. |
| @Override | ||
| public boolean hasDictionarySupport() { | ||
| return false; | ||
| return true; |
There was a problem hiding this comment.
Shouldn't this return delegate.hasDictionarySupport()?
There was a problem hiding this comment.
Hi
I'm not sure. It's 4 years since this PR, so it's pretty stale now.
I could try to resurrect it but I don't know if there is much interest from the project.
Mick
https://issues.apache.org/jira/browse/PARQUET-36
Please take a look. I have made some changes to Dictionary which need careful consideration and perhaps an alternative approach is better.
Also I made some changes to example implementation.
I am going to think about adding a few more tests.
Unfortunately I am having real problems building all project as I can get thrift 0.7 for my Mac. Does anyone know how to do this?