New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] conditional range index configs based on an attribute value #1233

Merged
merged 2 commits into from Jan 31, 2017

Conversation

Projects
None yet
6 participants
@olvidalo
Copy link
Contributor

olvidalo commented Jan 19, 2017

This adds the possibility to add an attribute condition on complex range index config elements:

<create qname="tei:note">
  <condition attribute="type" value="text_type" />
  <field name="text_type" type="xs:string" case="no"></field>
</create>

Will only store the text of the tei:note element in the text_typefield if it has an attribute named "type" with the value "text_type".

Queries like

tei:note[@type="text_type"][.="some text type"] 

will be rewritten to

tei:note[range:field("text_type", "eq", "some text type")] 

I tried to add as little complexity as possible to keep the performance impact at a minimum, especially if there are no conditions defined. If anyone has any suggestions for some benchmarks I could do let me know...

Background: In a project I have to deal with data like

<note type="language">
    <term type="language">nicht ermittelt</term>
</note>
<note type="script">
    <term type="script"/>
    <note type="writing_direction">nicht ermittelt</note>
    <note type="ink"/>
    <note type="hand_desc">
        <p/>
    </note>
    <note type="structural_markers">
        <p/>
    </note>
</note>

And I want to build a faceted search feature with the contents of the note element as keys. Using conditional range index fields based on the note's type and retrieving them with range:index-keys-for-field gives better performance compared to defining separate index fields on the type and the text values and then using distinct-values() on them...

@joewiz

This comment has been minimized.

Copy link
Member

joewiz commented Jan 19, 2017

Very cool and thought-provoking! I can imagine this <condition> idea being extended to Lucene full text indexes too - to limit the terms in a full text search field to just those in <div> elements of a certain @type.

One question: This feature applies only to "new" range indexes - i.e., those wrapped in <range> elements in collection.xconf, right?

@olvidalo

This comment has been minimized.

Copy link
Contributor

olvidalo commented Jan 20, 2017

Yes it applies to the "new" range index.

You're right this could definitely be also useful for the lucene index. From a first glance there seems to be a lot of similar code so this shouldn't be too hard to implement. Probably even easier because we don't have to deal with query rewriting. Also conditions other than "an attribute equals" could be possible in the future... but let's get this merged first 😉

@dizzzz dizzzz requested a review from wolfgangmm Jan 20, 2017

@dizzzz dizzzz added the enhancement label Jan 20, 2017

@olvidalo

This comment has been minimized.

Copy link
Contributor

olvidalo commented Jan 20, 2017

I do have some ideas for refinement of the code, for example making ComplexRangeIndexConfigCondition abstract to allow for different kinds of conditions to be implemented as a subclass, and I also have a proof-of-concept ready for defining conditions for an arbitrary xpath (performance might be an issue though)...

But before I do any more work on this I'd like to wait for a preliminary comment on if this condition stuff would be considered for merging at all, maybe even already in 3.0...

@adamretter

This comment has been minimized.

Copy link
Member

adamretter commented Jan 21, 2017

@olvidalo For the Algolia Index plugin I am creating for @ttasovac I have an outstanding task to support simple predicates in the path expressions which define the Indexable objects. I wonder if we could perhaps both align on the same syntax generally and abstract this into a base class for indexes.

If we switched to supporting some simple predicate syntax, instead of your:

<create qname="tei:note">
  <condition attribute="type" value="text_type" />
  <field name="text_type" type="xs:string" case="no"/>
</create>

We would have something like:

<create qname="tei:note[@type eq 'text_type']">
  <field name="text_type" type="xs:string" case="no"/>
</create>

What do you think?

@olvidalo

This comment has been minimized.

Copy link
Contributor

olvidalo commented Jan 23, 2017

@adamretter sounds like a good idea agreeing on a syntax. I agree that the simple predicate syntax you are proposing looks nicer than my condition element solution and straightforward to understand for someone acquainted with xpath. I'm unsure though how it would look from a user perspective having to use "crippled" xpath in their index definitions. I'm just imagining the documentation... "you can use something like an xpath to select the node, but it is not an actual xpath but actually a nodepath that has these limitations, and then you can restrict the indexed elements with a predicate that looks quite like an xpath predicate but actually restricts you this very special syntax" - and then we would have to parse the predicate and fail if the expression is too complex, or uses operators we don't support, etc. And this would get even more complex if, for example, the Algolia index would support an extended or limited set of predicates from the range index.

For example I'm working on supporting arbitrary xpath expressions in addition to a simple attribute condition. I am using eXist's XQuery engine to evaluate them for the specified element. Keeping performance in mind though, I do want to keep both kinds of conditions implemented separately as an attribute condition is way chepaer to evaluate than an xpath expression. So my preliminary plan was to use the condition tag but also allow "xpath" as an attribute, so when parsing I can discriminate between attribute and xpath conditions and instantiate different classes with different evaluation code depending on which attributes are specified.

But then again I'm also intrigued by the simplicity of your syntax, I'm just unsure about the complexity it adds. I'm still open to the idea and not at all zeroed in on my solution, but maybe we can discuss this a little...

@shabanovd

This comment has been minimized.

Copy link
Member

shabanovd commented Jan 24, 2017

@adamretter I'm agree with @olvidalo, xpath expressions will be very limited and confusing at the end. Plus, complexity of parsing and validation.

Btw, I did add also functions contains, not-contains, startsWith, endsWith and equals-ignore-case to check value of attributes. You have only equals function. (That was coded for lucene index). Plus, check for parent-attribute. So far it cover all current needs.

Here and example of configuration syntax

<config>
    <content field-name="__generic-content" isFacet="false"/>
    <rules>
        <rule-eXist22 field-name="testing" qname="element-name" boost="0.5" isFacet="false"/>
        <rule-complex field-name="__topic-title" boost="3.0" isFacet="false">
            <place type="attribute" name="@class" fun="contains" param=" topic/title "/>
        </rule-complex>
        <rule-complex field-name="__topic-title-altcontent" boost="2.0" isFacet="false">
            <place type="attribute" name="@class" fun="contains" param=" topic/title "/>
            <require type="parent-attribute" name="@class" fun="not-contains" param=" topic/topic "/>
            <require type="parent-attribute" name="@class" fun="not-contains" param=" map/map "/>
        </rule-complex>
        <rule-complex index-attribute="id" field-name="myIdFieldName" boost="1.0" isFacet="false">
            <require type="attribute" name="@domains" fun="contains" param="(topic task)"/>
        </rule-complex>
    </rules>
</config>

I'm ready to discuss and implement common rules syntax to share it between indexes.

@olvidalo olvidalo force-pushed the olvidalo:feature/range-index-conditions branch from cffba1e to e28e7ee Jan 30, 2017

@wolfgangmm wolfgangmm merged commit c1b5b79 into eXist-db:develop Jan 31, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@joewiz

This comment has been minimized.

Copy link
Member

joewiz commented Jan 31, 2017

@olvidalo Congrats! Thanks for your contribution - the first low-level, indexing contribution from outside the core developers in some time, AFAIK. A good sign for the community.

@olvidalo

This comment has been minimized.

Copy link
Contributor

olvidalo commented Feb 8, 2017

Just a late thanks for the merge, I'm verry happy this got in in time for 3.0! 🎉 Which, as rumors have it, is about to be released every second now 😉

Meanwhile I've been working on a couple of extensions to the attribute condition (while maintaining syntax compatibility), also adding some features from @shabanovd's proposal. PR is coming up...

@joewiz

This comment has been minimized.

Copy link
Member

joewiz commented Jan 15, 2019

@olvidalo A question about this feature came up on Stack Overflow. I answered to the best of my ability, but if you have a moment would you please take a look and confirm if I was correct? Thanks! https://stackoverflow.com/questions/54174822

@olvidalo

This comment has been minimized.

Copy link
Contributor

olvidalo commented Jan 15, 2019

@joewiz yes this is correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment