Skip to content

GEODE-9004: Fix issues in queries targeting a Map field#6200

Closed
albertogpz wants to merge 4 commits intoapache:developfrom
Nordix:feature/GEODE-9004_1
Closed

GEODE-9004: Fix issues in queries targeting a Map field#6200
albertogpz wants to merge 4 commits intoapache:developfrom
Nordix:feature/GEODE-9004_1

Conversation

@albertogpz
Copy link
Contributor

Queries in which map fields are involved
in the condition, sometimes do not return the same entries
if map indexes are used, compared to when map indexes
are not used.

Differences were found when the condition on the
map field was of type '!=' and also when the condition
on the map field was comparing the value with null.

Example1:
Running a query with the following condition:
positions['SUN'] = null

in a region with the following entries:
entry1.positions=null
entry2.positions={'a'=>'b'}
entry3.positions={'SUN'=>null}

  • will return entry1 and entry3 if there are no
    indexes defined.
  • will return no entries if the following index is defined:
    positions['SUN','c']
  • will return entry3 if the following index is defined:
    positions[*]

Example2:
Running a query with the following condition:
positions['SUN'] != '3'

in a region with the following entries:
entry1.positions=null
entry2.positions={'a'=>'b'}
entry3.positions={'SUN'=>'4'}
entry4.positions={'SUN'=>'3'}
entry5.positions={'SUN'=>null}

  • will return all entries except for entry4 if:
    there are no indexes defined.
  • will return entry3 if the following index is defined:
    positions['SUN','c']
  • will return entry3 and entry5 if the following index is defined:
    positions[*]

In order to have the same results for these
queries no matter if indexes are used,
the following changes have been made:

  • When running queries, the result of getting the
    value from a Map that does not contain the key used
    to access the value will be
    considered UNDEFINED instead of null.
    This will imply that queries where the condition is
    positions['SUN'] = null will not return entries
    that do not have the 'SUN' key in the positions Map.

  • Queries with map indexes will not be able to use
    the indexes if the query is of type '!=' against
    any of the indexed map fields.
    Example:
    for index positions['SUN', 'ERIC'] or
    positions[*]
    the following query will not use indexes:
    positions['SUN'] != '44'

These changes must be documented accordingly
in the corresponding indexes section.

Thank you for submitting a contribution to Apache Geode.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • Has your PR been rebased against the latest commit within the target branch (typically develop)?

  • Is your initial contribution a single, squashed commit?

  • Does gradlew build run cleanly?

  • Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

Note:

Please ensure that once the PR is submitted, check Concourse for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.

Queries in which map fields are involved
in the condition, sometimes do not return the same entries
if map indexes are used, compared to when map indexes
are not used.

Diferences were found when the condition on the
map field was of type '!=' and also when the condition
on the map field was comparing the value with null.

Example1:
Running a query with the following condition:
positions['SUN'] = null

in a region with the following entries:
entry1.positions=null
entry2.positions={'a'=>'b'}
entry3.positions={'SUN'=>null}

- will return entry1 and entry3 if there are no
  indexes defined.
- will return no entries if the following index is defined:
  positions['SUN','c']
- will return entry3 if the following index is defined:
  positions[*]

Example2:
Running a query with the following condition:
positions['SUN'] != '3'

in a region with the following entries:
entry1.positions=null
entry2.positions={'a'=>'b'}
entry3.positions={'SUN'=>'4'}
entry4.positions={'SUN'=>'3'}
entry5.positions={'SUN'=>null}

- will return all entries except for entry4 if:
  there are no indexes defined.
- will return entry3 if the following index is defined:
  positions['SUN','c']
- will return entry3 and entry5 if the following index is defined:
  positions[*]

In order to have the same results for these
queries no matter if indexes are used,
the following changes have been made:
- When running queries, the result of getting the
value from a Map that does not contain the key used
to access the value will be
considered UNDEFINED instead of null.
This will imply that queries where the condition is
positions['SUN'] = null will not return entries
that do not have the 'SUN' key in the positions Map.

- Queries with map indexes will not be able to use
the indexes if the query is of type '!=' against
any of the indexed map fields.
Example:
for index positions['SUN', 'ERIC'] or
positions[*]
the following query will not use indexes:
positions['SUN'] != '44'

These changes must be documented accordingly
in the corresponding indexes section.
@DonalEvans
Copy link
Contributor

The PR description says that "These changes must be documented accordingly in the corresponding indexes section." Is there a GEODE ticket for this docs work? If not, one should be created, or the docs changes included as part of this PR.

@albertogpz
Copy link
Contributor Author

The PR description says that "These changes must be documented accordingly in the corresponding indexes section." Is there a GEODE ticket for this docs work? If not, one should be created, or the docs changes included as part of this PR.

I agree, Donal. My plan is to update the documentation as part of this PR if the solution is accepted. Given that the outcome of some queries without indexes change and that indexes will not be used with queries using "!=" with Map indexes, I want to have first the blessing from the community and only then update the documentation.

@DonalEvans
Copy link
Contributor

I'm not a query expert, so I don't understand what the reasoning is behind map indexes no longer being used for queries using !=. Would it be possible to get an explanation of the motivation behind this change?

@albertogpz
Copy link
Contributor Author

I'm not a query expert, so I don't understand what the reasoning is behind map indexes no longer being used for queries using !=. Would it be possible to get an explanation of the motivation behind this change?

With the current implementation, map indexes are used (if defined) in queries that have a condition using "!=" against a value in the map field indexed. The problem is that these queries offer different outcome if the index is defined vs when the index is not defined.
In this PR I have made the queries offer the same output no matter if the index is defined or not but at the expense of not using the index if the condition is "!=" when the index is defined. I have not been able to come up with a solution without losing this "feature". That's the reason behind "map indexes no longer being used for queries using !=."

Copy link
Contributor

@DonalEvans DonalEvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requests for code clean-up/clarifying comments. The content of the changes looks okay to me, but the conceptual changes need someone with a better understanding of the expected behaviour of queries and indexes.

p6.positions.put("ERIC", "hey");
region.put(6, p6);

// One more with null map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have two region entries with null maps? p2 also has the same null positions field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I have changed a bit some of the entries to cover more cases and removed having two region entries with null maps.

assertThat(keys).isEqualTo(4);
assertThat(mapIndexKeys).isEqualTo(0);
assertThat(values).isEqualTo(7);
assertThat(uses).isEqualTo(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious where these values are coming from. Woudl it be possible to add a comment explaining why they are what they are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comments about these and the other. Nevertheless, there is no documentation on them so I am not 100% sure that they are working correctly. Anyhow, it is not something changed by this PR.

assertThat(keys).isEqualTo(3);
assertThat(mapIndexKeys).isEqualTo(1);
assertThat(values).isEqualTo(3);
assertThat(uses).isEqualTo(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about it being unclear where these values are coming from.

assertThat(keys).isEqualTo(5);
assertThat(mapIndexKeys).isEqualTo(4);
assertThat(values).isEqualTo(5);
assertThat(uses).isEqualTo(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about it being unclear where these values are coming from.


@Test
public void testQueriesForValueInMapFieldWithoutIndex() throws Exception {
region =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple warnings in this file can be cleaned up by making region a Region<Object, Object> where it's declared.

@albertogpz albertogpz requested a review from DonalEvans March 30, 2021 16:07
@albertogpz albertogpz force-pushed the feature/GEODE-9004_1 branch from ea1b37f to 9e2c7a6 Compare March 30, 2021 17:59
@albertogpz albertogpz force-pushed the feature/GEODE-9004_1 branch from 9e2c7a6 to 1784eac Compare March 30, 2021 19:28
// positions map takes for each key (null not included)
// for each entry in the region:
// "something", null, "nothing", "more", "hey", "tip"
assertThat(keys).isEqualTo(7);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is now in disagreement with the comment above, as the list of values includes only 6 values. Either the explanation in the comment needs to be changed, or the value being returned here is incorrect and there is a bug somewhere. Also, it is a little confusing in terms of what "null not included" is referring to, since the list of values includes null.

Comment on lines 450 to 454
// The number of values must be equal to the number of values the
// positions map takes for each key (null not included)
// for each entry in the region:
// "something", null, "nothing", "more", "empty", "hey", "more", "tip"
assertThat(values).isEqualTo(8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing in terms of what "null not included" is referring to, since the list of values includes null.


// Both RangeIndex should be used
assertEquals(100 /* Execution time */, keyIndexStats.getTotalUses());
// Index should not be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this comment (and others in this class) be expanded slightly to explain why the index should not be used here? I am assuming it's because the query uses !=, but it would be good to make it explicit.

@albertogpz albertogpz requested a review from DonalEvans March 31, 2021 09:08
Copy link
Contributor

@DonalEvans DonalEvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with the caveat that someone with deeper knowledge of queries and indexes should sign off on this, as it represents a potentially significant change in behaviour.

@nabarunnag
Copy link
Contributor

We had a small impromtu discussion with the query engine team and some research on how NoSQL treats null (or nil in some worlds)

  • in MongoDB we founnd that
    The { item : null } query matches documents that either contain the item field whose value is null or that do not contain the item field.
  • in redis nil is returned if the field does not exist.

Hence we think that the current behavior i.e. querying without indexes is the right result and we should try to make the indexes return the same result.

@agingade
Copy link

agingade commented Mar 31, 2021

@albertogpz
Thanks for your contributions to make the query engine robust. Sorry for delay in responding...

Here is what I believe should be happening.

  • The result for a query should be consistent in both using index or non-index case.

  • The query engine returns UNDEFINED when it is unable to find the next level field.
    E.g: if address.city and address is null (This is documented).
    This is not same when you are looking for a non existing "key" in the map; UNDEFINED needs to be returned when positions is null and query is trying to access field from it.
    E.g.: positions['*'] should be returning UNDEFINED.

  • Query engine supports heterogenous objects stored in a region.
    E.g: Employee or Customer.
    Inline with supporting this, its designed/architected such that if a field is not found in the object it will be ignored.
    E.g query with employeeID is not going to return customer objects unless it has that field.

  • To be inline with the above design (query expectation), when a map field is not present available it should ignore that entry/object from adding to the result.
    E.g. if positions['SUN'] if SUN key is not present query should ignore that object.
    This is also different from null check.
    If there is a SUN key with null value it should be returned for queries looking for null value. And non null check will return if the key is there and its value is not null.
    Try the query with non map field and the behavior should be same.

Please let me know if you have any questions on the expected behavior.

The overall behavior of the query on map should be in-line with non-map fields.

The usage of index should not be avoided; unless the results are inconsistent with non-index query results. Instead of avoiding/blocking the use of index, it will be good to address the issues with indexed queries and make the behavior consistent.

Can you confirm above requirements are met in this PR.

Copy link

@agingade agingade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please respond to my comment. For that reason I am setting this PR as request changes.

@albertogpz
Copy link
Contributor Author

@agingade
Thanks for your comments. Please, let me answer inline to your questions.

@albertogpz
Thanks for your contributions to make the query engine robust. Sorry for delay in responding...

Here is what I believe should be happening.

* The result for a query should be consistent in both using index or non-index case.

This is what I tried to fix with this PR and the test new cases now show that results are the same with and without indexes while prior to the PR, they returned different results.

* The query engine returns UNDEFINED when it is unable to find the next level field.
  E.g: if address.city and address is null (This is documented).
  This is not same when you are looking for a non existing "key" in the map; UNDEFINED needs to be returned when positions is null and query is trying to access field from it.
  E.g.: positions['*']  should be returning UNDEFINED.

So, you mean if positions is not null but it only contains a mapping for the 'SUN" key, positions["ERIC"] should return null instead of UNDEFINED?

* Query engine supports heterogenous objects stored in a region.
  E.g: Employee or Customer.
  Inline with supporting this, its designed/architected such that if a field is not found in the object it will be ignored.
  E.g query with employeeID is not going to return customer objects unless it has that field.

ok. That should not have been changed.

* To be inline with the above design (query expectation), when a map field is not present available it should ignore that entry/object from adding to the result.
  E.g. if positions['SUN'] if SUN key is not present query should ignore that object.
  This is also different from null check.
  If there is a SUN key with null value it should be returned for queries looking for null value. And non null check will return if the key is there and its value is not null.

Currently if you have an entry for which positions has the following mappings: {"SUN" => null} and another entry for which positions has the following mapping: {"ERICSSON" => "3"} querying for positions["SUN"] = null will return both entries.
With my PR, only the first one would be returned. What should be the right behavior?

  Try the query with non map field and the behavior should be same.

Please let me know if you have any questions on the expected behavior.

The overall behavior of the query on map should be in-line with non-map fields.

The usage of index should not be avoided; unless the results are inconsistent with non-index query results. Instead of avoiding/blocking the use of index, it will be good to address the issues with indexed queries and make the behavior consistent.

I tried to support the use of indexes in all queries with my PR but I found no way to do it with != queries using an index of type: positions[*] (index on all keys) or positions["SUN", "ERICSSON"].
For the second type of index, I have an alternative solution in a draft PR that will allow the use of the index although it is more costly in memory (see #6238)

Can you confirm above requirements are met in this PR.

@albertogpz
Copy link
Contributor Author

We had a small impromtu discussion with the query engine team and some research on how NoSQL treats null (or nil in some worlds)

* in MongoDB we founnd that
  `The { item : null } query matches documents that either contain the item field whose value is null or that do not contain the item field.`

* in redis nil is returned if the field does not exist.

Hence we think that the current behavior i.e. querying without indexes is the right result and we should try to make the indexes return the same result.

@nabarunnag Thanks for your comments. If @agingade agrees (I will check his answers after having provided mine to his comments) I will look into this possibility.

@albertogpz
Copy link
Contributor Author

We had a small impromtu discussion with the query engine team and some research on how NoSQL treats null (or nil in some worlds)

* in MongoDB we founnd that
  `The { item : null } query matches documents that either contain the item field whose value is null or that do not contain the item field.`

* in redis nil is returned if the field does not exist.

Hence we think that the current behavior i.e. querying without indexes is the right result and we should try to make the indexes return the same result.

@nabarunnag Thanks for your comments. If @agingade agrees (I will check his answers after having provided mine to his comments) I will look into this possibility.

@nabarunnag and @agingade : I have created a new pull request with a solution in which I keep the current behavior of the queries without indexes: #6279
Please, dismiss this PR and check the new one.
I have not been able to find a complete solution for the "allkeys" type of index (e.g. positions[*]) and with the solution, queries with this type of index and queries of type "!=" or comparing to null will not use the index even if configured. This is a limitation that could be tackled in a future PR.

@albertogpz albertogpz closed this Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants