Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EZP-26141: Search appears to conduct some form of 'OR' search not and 'AND' search #63

Closed
wants to merge 5 commits into from

Conversation

3 participants
@pspanja
Copy link
Collaborator

commented Sep 7, 2016

Resolves https://jira.ez.no/browse/EZP-26141

TODO

  • add tests to kernel repo
@andrerom

This comment has been minimized.

Copy link
Member

commented Sep 7, 2016

So what is supported here now? :)

Is this supported?

  • Red Apple will hit on content with both words in them
    • and rank those with exact phrase highest
  • "Red Apple" will only hit on exact phrases
  • Any kind of AND, OR, +, -

Just asking to get a clear picture, cause would be cool to support some or all of above , even if we don't support it on legacy search (yet)*.

@pspanja

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 8, 2016

For now I went for words and implicit OR operator, since that is how Legacy engine behaves (Fulltext criterion actually documents only implicit AND). But if we say Legacy engine will behave differently, we can go for all you described, both for Solr and Elastic, plus even grouping with parenthesis.

Only problem there is that we need to allow targeting specific fields for boosting, so it will require some parsing to escape everything properly. Also, fuzziness can't be applied to phrases/groups, only words, so we need to decide how to handle that case.

@andrerom

This comment has been minimized.

Copy link
Member

commented Sep 8, 2016

Ok, seems clear, if possible lets do the following for now to improve it slightly:

  • if possible boost the whole phrase so user will get search results in following order:
    • Exact hits on "Red Apple"
    • Hits that contains both words
    • Hits that contains both works with fuzziness
    • Hits that contain one of them
    • Hits that contain one of them with fuzziness

And then we can do a feature request on operators and such later.

@andrerom

This comment has been minimized.

Copy link
Member

commented Sep 8, 2016

Note: We should apply this to 1.0 branch as well.

@pspanja

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 8, 2016

Added exact phrase match (on boosted fields as well) in 2b96276

We will probably need to change this when introducing operators as phrase won't be exactly defined in that case.

@pspanja

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 8, 2016

Added more tests that show what is generated in cfb582d

$criterion = new Criterion\FullText('Hello World');
$this->assertEquals(
'(text:(Hello World) OR text:"Hello World")',

This comment has been minimized.

Copy link
@andrerom

andrerom Sep 8, 2016

Member

ok so text:(Hello World) was given to Solr all along, I miss read and thought it was text:(Hello) OR text:(World), so then adding the phrasing provably does not make a difference then. Solr is potentially already ranking exact hits first, or?

This comment has been minimized.

Copy link
@pspanja

pspanja Sep 9, 2016

Author Collaborator

Actually, text:(Hello World) is functionally identical totext:(Hello) OR text:(World).
So matching phrase does make a difference, for example, for the added test in kernel, the order is same, but score difference is more pronounced. Without phrase match:

  1. ID:60, 0.9023705
  2. ID:62, 0.73013973
  3. ID:61, 0.1964612

And with phrase match:

  1. ID:60, 2.6263494
  2. ID:62, 0.95123565
  3. ID:61, 0.04279712

But I'm not sure if we really need this. Would rather go with simpler approach and add proper explicit phrase matching later on.

This comment has been minimized.

Copy link
@andrerom

andrerom Sep 9, 2016

Member

-del-

ok, see your point, it will mess with boosted values that should be higher.

This comment has been minimized.

Copy link
@andrerom

andrerom Sep 9, 2016

Member

So assuming ID 60 will not anymore score highest with "World Hello" (without phrase match), then I agree we delete it.

This comment has been minimized.

Copy link
@pspanja

pspanja Sep 9, 2016

Author Collaborator

Indeed it can mess with boost, but not on that example. Better example, indexing:

  1. ID:60, name:'red', sort_name:'red apple'
  2. ID:61: name:'apple red', short_name:'one'

Then, with phrase matching the result for 'red apple' is:

  1. ID:60, 2.4462724
  2. ID:61: 0.15659831

And for 'apple red' the order is changed:

  1. ID:61, 0.750654
  2. ID:60: 0.15101993

Without phrase matching order and score are kept:

  1. ID:60, 0.8453717
  2. ID:61: 0.7003289

I'll remove phrase matching.

This comment has been minimized.

Copy link
@pspanja

pspanja Sep 9, 2016

Author Collaborator

Rebased to previous version.

@pspanja pspanja force-pushed the EZP-26141_fix_multitoken_fulltext branch from 51611c5 to d7e5dcb Sep 9, 2016

@andrerom

This comment has been minimized.

Copy link
Member

commented Sep 11, 2016

+1

1 similar comment
@alongosz

This comment has been minimized.

Copy link
Member

commented Sep 12, 2016

+1

@andrerom

This comment has been minimized.

Copy link
Member

commented Sep 12, 2016

Ok, ready to be merged into 1.0 and master 👌🏻

@pspanja pspanja changed the base branch from master to 1.0 Sep 13, 2016

@pspanja pspanja changed the base branch from 1.0 to master Sep 13, 2016

@pspanja

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 13, 2016

Closing in favor of #64

@pspanja pspanja closed this Sep 13, 2016

@pspanja pspanja deleted the EZP-26141_fix_multitoken_fulltext branch Sep 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.