Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement synthetic source support for range fields #107081

Merged
merged 28 commits into from Apr 24, 2024

Conversation

lkts
Copy link
Contributor

@lkts lkts commented Apr 4, 2024

This PR adds basic synthetic source support for range fields. There are
following notable properties of synthetic source produced:

  • Ranges are always normalized to be inclusive on both ends (this is how
    they are stored).
  • Duplicates are removed (same as for other fields).
  • Original order of ranges is not preserved. Ranges (except IP ranges) are always sorted in order of from and to values. For IP ranges order is unspecified.
  • IP ranges are always expressed as a range of IPs while it could
    have been originally provided as a CIDR.

This PR only implements retrieval of data for source reconstruction from
doc values similar to fields already supporting synthetic source.

Contributes to #106460.

This PR adds basic synthetic source support for range fields. There are
following notable properties of synthetic source produced:
* Ranges are always normalized to be inclusive on both ends (this is how
 they are stored).
* Original order of ranges is not preserved.
* Date ranges are always expressed in epoch millis, format is not
preserved.
* IP ranges are always expressed as a range of IPs while it could
have been originally provided as a CIDR.

This PR only implements retrieval of data for source reconstruction from
 doc values.
@elasticsearchmachine elasticsearchmachine added v8.14.0 needs:triage Requires assignment of a team area label labels Apr 4, 2024
@lkts lkts added >feature Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings labels Apr 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Apr 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @lkts, I've created a changelog YAML for you.

@lkts
Copy link
Contributor Author

lkts commented Apr 5, 2024

@elasticmachine update branch

@lkts
Copy link
Contributor Author

lkts commented Apr 5, 2024

This is ready for review.

docs/reference/mapping/types/range.asciidoc Outdated Show resolved Hide resolved
"lte": 299
}
}
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add this to the list of limitations for synthetic source? docs/reference/mapping/fields/synthetic-source.asciidoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to ranges and i followed existing docs that have specific examples on type page (for example ip).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to moving it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that in the docs about synthetic source, where we provide limitations we should mention also this one, not necessarily including an example but just explaining it so that users have it together with all other limitations.

@salvatore-campagna
Copy link
Contributor

Thanks for your PR, this looks great...very comprehensive test coverage. I left a few comments.

@lkts
Copy link
Contributor Author

lkts commented Apr 8, 2024

@elasticmachine update branch

configuration. Synthetic `_source` cannot be used together with
<<copy-to,`copy_to`>> or with <<doc-values,`doc_values`>> disabled.

Synthetic source always sorts values and removes duplicates for all `range` fields except `ip_range` . Ranges are sorted by their lower bound and then by upper bound. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move these documented limitations to a paragraph in synthetic-source.asciidoc (at the end of the file)? I think all the limitations are mentioned there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe summarize the limitations in synthetic-source.asciidoc and keep the extensive examples here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have such examples that are specific to a field type for other fields like ip, that's why i added it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have a general statement that order is not preserved in synthetic-source.

ip_range: { "gte": "::", "lte": "10.10.10.10" }
- match:
hits.hits.7._source:
ip_range: { "gte": "10.10.10.10", "lte": "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mix of IPv6 and IPv4 is pretty unfortunate but i don't see a good way to address that. We could look at other end of range and try to format it in the same way but it's pretty brittle in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually our docs say that we support mixed ranges so maybe this is okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when it come to this the semantic is something like:

  • apply gte to all IPv4 fields as if you only had the gte defined
  • apply the lte to all IPv6 fields as if you only had the lte defined

It is like saying for IPv4 fields the range is [10.10.10.10, null] while for IPv6 fields the range is [null, ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff]. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be ideal, let me see if i can do that.

@lkts
Copy link
Contributor Author

lkts commented Apr 13, 2024

@elasticmachine update branch

@lkts
Copy link
Contributor Author

lkts commented Apr 17, 2024

@elasticmachine update branch

@lkts
Copy link
Contributor Author

lkts commented Apr 17, 2024

@elasticmachine update branch

@salvatore-campagna
Copy link
Contributor

I left a couple of comments...everything else looks good.

@lkts
Copy link
Contributor Author

lkts commented Apr 23, 2024

@elasticmachine update branch

@lkts lkts merged commit cde894a into elastic:main Apr 24, 2024
14 checks passed
@lkts lkts deleted the feature/range_field_synthetic_source branch April 24, 2024 18:33
lkts added a commit that referenced this pull request Apr 25, 2024
#107910)

This reverts commit acb5139 (a part of #107081). This commit impacts search behaviour for IP range fields and so needs to be reverted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants