Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no way to ask ES to insert a timestamp at index time #15644

Closed
lam-juice opened this issue Dec 23, 2015 · 12 comments
Closed

There is no way to ask ES to insert a timestamp at index time #15644

lam-juice opened this issue Dec 23, 2015 · 12 comments
Labels
discuss :Search/Mapping Index mappings, including merging and defining field types

Comments

@lam-juice
Copy link

Due to the deprecation of _timestamp, there is no way to insert a timestamp at exactly the index time.

In my use case, the _timestamp field is useful for more than just ttl - I use ES as a timeseries storage, and if anyone wants to know how long the event pipeline takes to deliver the document, prior to ES 2.0 it can be calculated as (_timestamp - @timestamp) (provided that @timestamp is obtained from the parsed event time, and _timestamp is enabled)

when such events can be identified, time-sensitive operations that missed these events due to pipeline delays may attempt to recover the events by filtering on the pipeline delay time, i.e. "(_timestamp - @timestamp) > 2 hours"

asking Logstash to insert its own date field does not accurately capture any delay between the time when this new field is created, and the time the document is indexed.

i understand that the user may want to store multiple custom timestamps, e.g. moved, modified, etc. so let's not make one special _timestamp - however, i believe the time at which the document is indexed for the first time, is special and at least important for the purpose of identifying documents that needs some recovery action due to indexing delays.

@Ghost93
Copy link

Ghost93 commented Dec 26, 2015

+1

we use it to calculate delay in production - we have a field called arriveTime and use aggregation on arriveTime - _timestamp

@israel
Copy link

israel commented Jan 7, 2016

+1

If there is no _timestamp system meta field, a retry scenario becomes less elegant and less efficient since the client will have to modify the index request source with a new "timestamp" value for its user defined "timestamp" field.

@clintongormley clintongormley added discuss :Search/Mapping Index mappings, including merging and defining field types labels Jan 10, 2016
@l15k4
Copy link

l15k4 commented Feb 16, 2016

I'm dealing with the same thing, also I'm not sure whether I have to re-index all my data because of the deprecated _timestamp field : https://discuss.elastic.co/t/upgrade-from-1-7-x-to-2-2-0--timestamp-mapping-issue/41894

Documentation says it should be compatible, but it somehow is not...

@clintongormley
Copy link

In master this can be done using ingest pipelines. Closing in favour of #14049

@djschny
Copy link
Contributor

djschny commented Mar 14, 2016

Unfortunately ingest pipelines will not give what is desired though correct? Since it will not account for time an index request spends sitting in a threadpool queue prior to being worked on since the timestamp was added by an ingest node prior to getting to a data node.

@s1monw
Copy link
Contributor

s1monw commented Mar 14, 2016

Unfortunately ingest pipelines will not give what is desired though correct? Since it will not account for time an index request spends sitting in a threadpool queue prior to being worked on since the timestamp was added by an ingest node prior to getting to a data node.

there is always a delta between the assignment of a timestamp and the moment when it's actually indexed. The timestamp doesn't guarantee lineralizability in any way, that's a sequence ID. This feature can only be as good as some client app setting the timestamp no matter where you put it. It should not be more neither less.

@lam-juice
Copy link
Author

However, what is doable then - is to make a guarantee that the ingest pipeline would perform the field insertion at a stage no earlier than where _timestamp insertions are currently performed, correct?

If the guarantee can be made, please ignore everything below.

Even if there is always a delta, it is an issue involving its magnitude - i.e. not a binary issue - I believe in this case an effort should be made to minimize this delta - "delta always exists" - yes it's very well understood - however, it should not be used as a reason to arbitrarily alter this delta, while not providing a mean to accomplish what the small delta used to.

i.e. a small delta is better than a large delta - a timestamp inserted closer to actual indexing is better than a timestamp inserted by the client app. To say that it "can only as good as some client app setting the timestamp" totally ignores the meaning of the magnitude of the delta.

Do you agree that using _timestamp is a good way to measure pipeline delays? If not, is an alternative offered that works at least as well as _timestamp? Or, should we ignore Logstash delays altogether?

@s1monw
Copy link
Contributor

s1monw commented Mar 14, 2016

IMO the timestamp should be assigned as soon as the document enters the system. that is what ingest will do. I think that is a clear property of the timestamp. I don't understand what folks are concerned about here, how much time do you think it takes from sending the doc until it's really indexed? I don't understand your usecase either, do you rely on the actual time the doc is indexed? what does it buy you?

please be reasonable we can't give you any total ordering guarantees based on timestamps.

if you are concerned about the corner case of the document being stuck in the thread-pool queue? I think you can just ignore that, unless you totally overload your sever it should be like a second at most or something?

@lam-juice
Copy link
Author

You can probably find the answer if you ask yourselves, "Why was _timestamp created in the first place"?

What I'm concerned is having a knowledge - the best that a system can muster - the time elapsed between the moment an event is generated and the moment the same event is available for searches.

What does it buy us? ElasticSearch is only one component in the entire analytics pipeline. Let's say for example, perioidically, a process queries ElasticSearch and aggregates some results. When an event is delayed for a long enough time, it may miss such aggregation. A _timestamp provides a good approximation of the delay of an event, and provides a mean to rerun such aggregations on events that were missed. It's just one example.

Another example is that it can allow us to profile time spent in Logstash, so its configuration can be optimised - or if not possible - having Logstash replaced with something more efficient.

Judging from the replies I'm not the only person with such concern, so without the tool for measurement, nobody can say it's a "corner case", or "like a second at most or something", we're talking about.

@s1monw
Copy link
Contributor

s1monw commented Mar 14, 2016

sorry I am not sure _timestamp had the properties you are asking for. It was added at some point inside elasticsearch and if you hammering the index you can easily got stuck in a lock on the indexing etc. there is no guarantee of any sort neither in the new way nor in the old way. There can be hours between the doc was indexed and it being visible in search. If you are relying on exact numbers here your system has a huge flaw, IMO.

@lam-juice @djschny the new way is as good as the old way, it is assigned at a slightly earlier stage but for all usecases of _timestamp the neither place should matter. If it does, _timestamp is not what you are looking for.

@lam-juice
Copy link
Author

It depends on the guarantee we want - to us, ElasticSearch and what's feeding ElasticSearch its events are 2 separate entities, thus a facility is better than none in order to optimise and/or evaluate what's between ES and the event source (e.g. Logstash).

Does it provide 100% guarantee? No - and it's not what I need anyway. Has _timestamp statistically been providing what I need, which is 99% of the time a good approximation (even if it can have a long tail)? Definitely.

So you can see I'm not relying on exact numbers @s1monw - if it is assigned at a slightly earlier stage, so be it - deduplications provide the robustness I need anyway if I decide to observe the difference in the fatness of the tail, and reindex more conservatively - I think our fundamental disagreement is over the notion that "an ES-assigned timestamp does not accomplish more than a client-assigned timestamp" - because I believe a client-assigned timestamp does not allow even a rough estimate of time spent before an event crosses the boundary into ES.

@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2016

Does it provide 100% guarantee? No - and it's not what I need anyway. Has _timestamp statistically been providing what I need, which is 99% of the time a good approximation? Definitely.

that has not changed if you use ingest

"an ES-assigned timestamp does not accomplish more than a client-assigned timestamp"

that is what it basically was all the time no other guarantees given.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

No branches or pull requests

7 participants