Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Fix issue with field names containing "." #37364

Merged
merged 11 commits into from Jan 15, 2019

Conversation

Projects
None yet
6 participants
@matriv
Copy link
Contributor

commented Jan 11, 2019

Adjust FieldExtractor to be handle fields which contain . in their
name, regardless where they fall in in the document hierarchy. E.g.:

{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}

Fixes: #37128

SQL: Fix issue with field names containing "."
Adjust FieldExtractor to be handle fields which contain `.` in their
name, regardless where they fall in in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@elasticmachine

This comment has been minimized.

Copy link

commented Jan 11, 2019

@matriv matriv requested a review from astefan Jan 11, 2019

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2019

I tried to make use of the XContentMapValues methods extractRawValues and filter to replace the "ugly" code in the FieldExtractor but this seems not possible as the extractRawValues doesn't distinguish between a missing entry in the map and a null entry. For example if we have an Array [null, null] then the extractRawValues will return an empty List which prevents us from handling it correctly and throwing the "Arrays not supported" exception.

} else {
throw new SqlIllegalArgumentException("Cannot extract value [{}] from source", fieldName);
}
}
return unwrapMultiValue(value);
}

private Tuple<Object, Integer> extractAsDottedField(Map<String, Object> map, int idx, String node) {

This comment has been minimized.

Copy link
@matriv

matriv Jan 11, 2019

Author Contributor

I'm thinking to remove this extracted code back to the body of extractFromSource() as now we create an extra Tuple and Integer object which happens for every document returned. What do you think?

Also the code might be a bit ugly but I wanted to avoid calling recursively a method (with every combination of a.b, a.b.c, etc.) Happy to hear better suggestions.

This comment has been minimized.

Copy link
@costin

costin Jan 11, 2019

Member

+1. It looks like a small method and while it might be inlined the extra Tuple/Integer are boiler-plate.
If the method gets unrolled, both the index and the found value will be available without wrapping/boxing.

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2019

run gradle build tests 2

matriv added some commits Jan 11, 2019

@matriv matriv requested a review from imotov Jan 11, 2019

@costin

costin approved these changes Jan 11, 2019

Copy link
Member

left a comment

LGTM.
I like the lack of virtual calls - the structure is small enough that we can extract the values through loops. +1 on unrolling the method so there's no additional method call.

Regarding tests, I would add a few more examples such as : a.b { c { d.e } } a.b.c { d { e } and {a.b { c.d.e} } just to be on the safe side.

@@ -39,7 +40,7 @@
*/
private static String[] sourcePath(String name, boolean useDocValue, String hitName) {
return useDocValue ? Strings.EMPTY_ARRAY : Strings
.tokenizeToStringArray(hitName == null ? name : name.substring(hitName.length() + 1), ".");
.tokenizeToStringArray(hitName == null ? name : name.substring(hitName.length() + 1), ".");

This comment has been minimized.

Copy link
@costin

costin Jan 11, 2019

Member

There's definitely some formatting differences here.
Anyway, make sure you set your IDE to format only the lines that were modified as that prevents unnecessary changes like the above.

} else {
throw new SqlIllegalArgumentException("Cannot extract value [{}] from source", fieldName);
}
}
return unwrapMultiValue(value);
}

private Tuple<Object, Integer> extractAsDottedField(Map<String, Object> map, int idx, String node) {

This comment has been minimized.

Copy link
@costin

costin Jan 11, 2019

Member

+1. It looks like a small method and while it might be inlined the extra Tuple/Integer are boiler-plate.
If the method gets unrolled, both the index and the found value will be available without wrapping/boxing.

@@ -178,8 +204,8 @@ public boolean equals(Object obj) {
}
FieldHitExtractor other = (FieldHitExtractor) obj;
return fieldName.equals(other.fieldName)
&& hitName.equals(other.hitName)
&& useDocValue == other.useDocValue;
&& hitName.equals(other.hitName)

This comment has been minimized.

Copy link
@costin

costin Jan 11, 2019

Member

Again with the formatting...

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2019

@costin I don't know if you noticed already because it was my last commit but I added this: https://github.com/elastic/elasticsearch/pull/37364/files#diff-9aaee9be08445653bba7407b9f2b5ca3R269 Would you like to still have other cases in separate methods?

@costin

This comment has been minimized.

Copy link
Member

commented Jan 11, 2019

The random is fine :)

@astefan
Copy link
Contributor

left a comment

LGTM. Nice, elegant solution.

@imotov
Copy link
Member

left a comment

I think we can improve it a bit.

DELETE test

PUT /test/blah/1
{
  "name.full": "Shane Connelly",
  "name.parts.first": "Shane"
}

PUT /test/blah/2
{
  "name": {
    "full": "Shane Connelly"
  },
  "name.parts": {
    "first": "Shane"
  }
}

POST /_xpack/sql?format=txt
{
  "query": "DESCRIBE test"
}

POST /_xpack/sql?format=txt
{
  "query": "select * from test"
}

GET /test/_search
{
  "_source": {
    "includes": "name.parts.first"
  }
}
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2019

@imotov I don't get your comment, how to improve it? in your example

POST /_xpack/sql?format=txt
{
  "query": "select * from test"
}

returns:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane
@imotov

This comment has been minimized.

Copy link
Member

commented Jan 12, 2019

Interesting. I ran it twice with this PR by itself and with this PR merged into master and every time I am getting

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane           
Shane Connelly |null            

Could you add an additional record like this and see what do you get back?

PUT /test/blah/3
{
  "name.parts": {
    "first": "Shane I"
  },
  "name": {
    "parts.first": "Shane II",
    "full": "Shane Connelly"
  }
}

I am now getting

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane           
Shane Connelly |null            
Shane Connelly |Shane II        
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2019

I get:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane
Shane Connelly |Shane II

:-)

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2019

Hm, I cleaned and recompiled and I get the null too, will check it, thx!

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2019

@imotov @costin @astefan Adjusted the code to handle multiple entries in the map with common prefixes. Please take another look.

Many thanks @imotov for catching that!

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2019

Now the query yields:

   name.full   |name.parts.first
---------------+----------------
Shane Connelly |Shane
Shane Connelly |Shane
Shane Connelly |Shane I

so it's Shane I and not Shane II since we're searching for the longer path in the map that yields non-null value. What do you think?

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2019

The difference now is here: https://github.com/elastic/elasticsearch/pull/37364/files#diff-6d292f4ab3ddfc415649c328fd2faefeR157
So we don't stop once a non-null value is found in the map but we keep walking the path and holding the latest non-null value found.

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

Pushed the new approach to handle multiple values for the same path but different hierarchies.

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

run default distro tests

1 similar comment
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

run default distro tests

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

run gradle build tests 1

@imotov

imotov approved these changes Jan 14, 2019

Copy link
Member

left a comment

LGTM. Thanks!

@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

@imotov thanks for checking and catching this.

@costin

costin approved these changes Jan 14, 2019

Copy link
Member

left a comment

Left some minor comments. Let's get this in.


// Used to avoid recursive method calls
// Holds the sub-maps in the document hierarchy that are pending to be inspected.
LinkedList<Map<String, Object>> queue = new LinkedList<>();

This comment has been minimized.

Copy link
@costin

costin Jan 14, 2019

Member

minor nitpick - use the Deque instead of LinkedList.

This comment has been minimized.

Copy link
@costin

costin Jan 14, 2019

Member

Since the index and the Map are associated, how about using only one Deque which holds a Tuple instead of two Deque:

Deque<Tuple<Map<String, Object> index>>` queue = ...

if (node instanceof Map) {
   queue.add(new Tuple<>(node, Integer.valueOf(i));
}

This comment has been minimized.

Copy link
@matriv

matriv Jan 14, 2019

Author Contributor

Actually, I had those 2 before, while implementing. :-)

I chose LinkedList as we don't need any special operations of queue (we don't care about the order of dequeuing). I've just read about the better performance of ArrayDequeue over LinkedList and I'll change.

For the Tuple I had it as you said but I changed to a separate structure that holds integers to avoid the instantiation of Tuple objects on top of the Integer boxing. What do you think?

This comment has been minimized.

Copy link
@matriv

matriv Jan 14, 2019

Author Contributor

And after benchmarking: https://gist.github.com/matriv/07e5909a49bed9794f350f458a0b2c60
The results show:

Benchmark                      Mode  Cnt       Score       Error  Units
MyBenchmark.testArrayDequeue  thrpt   25  608384,000 ±  7893,014  ops/s
MyBenchmark.testLinkedList    thrpt   25  581316,884 ±  8028,331  ops/s

This comment has been minimized.

Copy link
@costin

costin Jan 14, 2019

Member

Autoboxing already happens and I wouldn't worry to much about it considering the depth is not that big. Same for Linked vs Array (in general arrays are faster except for inserting in the middle as that requires resizing/copying at which the linked structure excels).
I think the Tuple makes the code a bit more compact and safe (the queues cannot get out of sync) and more readable/simple code always trumps optimization (especially micro ones as here).

This comment has been minimized.

Copy link
@matriv

matriv Jan 14, 2019

Author Contributor

Just FYI regarding the Tuples:
https://gist.github.com/matriv/3d2557ae1621e99a3c9505b5e4e6998f
and the result:

Benchmark                      Mode  Cnt       Score      Error  Units
MyBenchmark.testNoTuples      thrpt   25  601543,170 ± 7859,704  ops/s
MyBenchmark.testTuples        thrpt   25  573953,828 ± 5987,749  ops/s
if (node instanceof Map) {
// Add the sub-map to the queue along with the current path index
queue.add((Map<String, Object>) node);
idxQueue.add(i);

This comment has been minimized.

Copy link
@costin

costin Jan 14, 2019

Member

Why not add i+1 to the queue since the next loop should start from the next position?

This comment has been minimized.

Copy link
@matriv

matriv Jan 14, 2019

Author Contributor

no reason, I just preferred to start the i index of the for loop from prevPosition + 1 just because personally I see it more "visible". Would you prefer it your way?

This comment has been minimized.

Copy link
@costin

matriv added some commits Jan 14, 2019

@matriv matriv merged commit b594e81 into elastic:master Jan 15, 2019

7 checks passed

CLA Commit author has signed the CLA
Details
elasticsearch-ci-1 Build finished.
Details
elasticsearch-ci-2 Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@matriv matriv deleted the matriv:mt/fix-37128 branch Jan 15, 2019

matriv added a commit that referenced this pull request Jan 15, 2019

SQL: Fix issue with field names containing "." (#37364)
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

Backported to 6.6 with 874b5ec

matriv added a commit that referenced this pull request Jan 15, 2019

SQL: Fix issue with field names containing "." (#37364)
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

Backported to 6.5 with c72bd51

matriv added a commit that referenced this pull request Jan 15, 2019

SQL: Fix issue with field names containing "." (#37364)
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:

```
{
  "a.b": "Elastic Search"
}

{
  "a": {
    "b.c": "Elastic Search"
  }
}

{
  "a.b": {
    "c": {
      "d.e" : "Elastic Search"
    }
  }
}
```

Fixes: #37128
@matriv

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

Backported to 6.x with fd3701b

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 15, 2019

Merge remote-tracking branch 'elastic/master' into sync-retention-leases
* elastic/master:
  Docs be explicit on how to turn off deprecated auditing (elastic#37316)
  Fix line length for monitor and remove suppressions (elastic#37456)
  Fix IndexShardTestCase.recoverReplica(IndexShard, IndexShard, boolean) (elastic#37414)
  Update the Flush API documentation (elastic#33551)
  [TEST] Muted testDifferentRolesMaintainPathOnRestart
  Remove dead code from ShardSearchStats (elastic#37421)
  Simplify testSendSnapshotSendsOps (elastic#37445)
  SQL: Fix issue with field names containing "." (elastic#37364)
  Restore lost @Inject annotation (elastic#37452)

imotov added a commit to imotov/elasticsearch that referenced this pull request Jan 15, 2019

SQL: fix object extraction from sources
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to elastic#37364
@imotov

This comment has been minimized.

Copy link
Member

commented Jan 15, 2019

@matriv I am sorry. It looks like I missed another edge case during review and immediately got bitten by it in geosql, which actually can extract objects from source in some cases. I opened #37502 to fix it.

imotov added a commit that referenced this pull request Jan 18, 2019

SQL: fix object extraction from sources (#37502)
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364

imotov added a commit that referenced this pull request Jan 18, 2019

SQL: fix object extraction from sources (#37502)
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364

imotov added a commit that referenced this pull request Jan 18, 2019

SQL: fix object extraction from sources (#37502)
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.

Relates to #37364

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.