Skip to content

[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter#117

Open
MickDavies wants to merge 2 commits intoapache:masterfrom
MickDavies:PARQUET-36
Open

[PARQUET-36] Add support for dictionaries to FilteringPrimitiveConverter#117
MickDavies wants to merge 2 commits intoapache:masterfrom
MickDavies:PARQUET-36

Conversation

@MickDavies
Copy link

https://issues.apache.org/jira/browse/PARQUET-36

Please take a look. I have made some changes to Dictionary which need careful consideration and perhaps an alternative approach is better.

Also I made some changes to example implementation.

I am going to think about adding a few more tests.

Unfortunately I am having real problems building all project as I can get thrift 0.7 for my Mac. Does anyone know how to do this?

@julienledem
Copy link
Member

@MickDavies
Copy link
Author

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem notifications@github.com wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install

Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

@MickDavies
Copy link
Author

oops test is wrong - fixing now

@MickDavies
Copy link
Author

I have fixed the test I wrote which exercises the dictionary filtering for eq and neq cases. The generator produced some code for udfs. I'll look for some tests of this and add to those also.

@MickDavies
Copy link
Author

I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.

Have other people got thrift 0.7 running on Mac OS 10.9 or above?

Mick

On 5 Feb 2015, at 18:44, Michael Davies michael.belldavies@gmail.com wrote:

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem <notifications@github.com mailto:notifications@github.com> wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install

Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

@MickDavies
Copy link
Author

Is there a reason why parquet continues to depend on this old version of thrift? I think I read somewhere it was used just for testing.

Would it be easy to upgrade?

Thanks

Mick

On 7 Feb 2015, at 17:48, Michael Davies michael.belldavies@gmail.com wrote:

I am running into the issue that is detailed here - https://issues.apache.org/jira/browse/THRIFT-2229 https://issues.apache.org/jira/browse/THRIFT-2229. This prevents me building thrift 0.7, I’m not sure if there is a workaround.

Have other people got thrift 0.7 running on Mac OS 10.9 or above?

Mick

On 5 Feb 2015, at 18:44, Michael Davies <michael.belldavies@gmail.com mailto:michael.belldavies@gmail.com> wrote:

Thanks - I think I tried these instructions a few weeks ago and I couldn’t compile thrift. I can’t remember why not, I think it may have been to do with yosemite.

I’ll give it another go.

Mick

On 5 Feb 2015, at 18:36, Julien Le Dem <notifications@github.com mailto:notifications@github.com> wrote:

https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install https://github.com/Parquet/parquet-mr/wiki/Developer-documentation#version-of-the-thrift-compiler-to-install

Reply to this email directly or view it on GitHub https://github.com/apache/incubator-parquet-mr/pull/117#issuecomment-73100492.

@rdblue
Copy link
Contributor

rdblue commented Mar 4, 2015

@MickDavies: I don't think you need to use thrift 0.7.0. I've built with 0.9.0 without problems, and newer versions (before 0.9.2 at least) should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern we use elsewhere is that the concrete classes include an implementation rather than having the abstract class carry an instance variable. That would be better for two reasons:

  1. It would not require changing the constructor's signature
  2. It ensures that the concrete classes always match their reported primitive type

Otherwise, I could instantiate a PlainLongDictionary and pass in PrimitiveTypeName.BINARY without triggering a complaint.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I think that the suggestion is good. It would mean creating a couple of new dictionary classes e.g. PlainInt96Dictionary.

Are you happy with Dictionary carrying the type information, or at least having abstract method to get information. I did this to make adding dictionary support to SimplePrimitiveConverter simpler, and I added dictionary support to SimplePrimitiveConverter so I could test using existing framework.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change suggested above

@julienledem
Copy link
Member

Hi @MickDavies
Thanks for looking into this. I made some comments above.
There are two different features in there:

  • allow the dictionary ids to go through when the underlying converter supports them
  • make the filtering more efficient when there's a dictionary by knowing all the values before hand.
    I'd like someone to add in the description of PARQUET-36 a short description of how we achieve these.
    Does it sound good?
    Thanks again!
    Julien

@MickDavies
Copy link
Author

Hi @julienledem

I guess you could separate these two features, I hadn't done so in my mind. The former is important from measurements I have done for some datasets using Spark. For data I am using a query over a table with around 100M rows can run 25% faster.

I guess the second one just seemed like an obvious optimization

@julienledem
Copy link
Member

both features are useful and they will both bring gains for different reasons.
We don't need to split the JIRA in two but we should keep in mind those 2 distinct goals when reviewing.
I think there are several ways the second optimization could be done and that's why I think a short spec in the JIRA would be useful.

@MickDavies
Copy link
Author

Thanks for all the feedback, they comments were very useful.

I have responded to most of the comments in the last check-in, with:

  • Change method name update to evaluateFilterForDictionaryElement to clarify intent
  • Change Dictionary classes so that type information is abstract method associated with subclass, not carried as attribute
  • Change FilteringPrimitiveConverter so that it supports dictionaries even if delegates do not
  • Simplify implementation of SimplePrimitiveConverter by using type converter

I still have to do the following:

  • Update JIRA as requested by @julienledem
  • Add tests for dictionary filtering where delegate does not support dictionaries
  • Ensure build works completely - try to get build going with thrift 0.9.1 working - looks positive
  • Look at version conformance issues.

@MickDavies
Copy link
Author

I've added a unit test for FilteringPrimitiveConverter. I wanted to use a mocking library and saw that mockito and eaymock referenced in poms though only mockito seems to be being used, so I used that.

Also I upgraded to junit 4.12 from 4.10, so that I could set the name in Parameterized junit annotation, which makes the test output look a bit nicer, but is not essential.

@MickDavies MickDavies closed this Apr 2, 2015
@MickDavies
Copy link
Author

I'm going to simplify this work removing changes to example simple implementation

@MickDavies MickDavies reopened this Apr 2, 2015
Add dictionary support to FilteringPrimitiveConverter.

Add related tests
A little tidy up and add some comments
@MickDavies
Copy link
Author

Is anyone free to review, I would like to get this change in or rejected. Thanks

@mpeleshenko
Copy link

Can this MR be reviewed please? This is a big performance blocker when reading a large parquet file with many String columns using a Filter as new String instances are created for each record read even though cardinality is low.

@Override
public boolean hasDictionarySupport() {
return false;
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return delegate.hasDictionarySupport()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

I'm not sure. It's 4 years since this PR, so it's pretty stale now.

I could try to resurrect it but I don't know if there is much interest from the project.

Mick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants