New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is no way to ask ES to insert a timestamp at index time #15644
Comments
+1 we use it to calculate delay in production - we have a field called |
+1 If there is no _timestamp system meta field, a retry scenario becomes less elegant and less efficient since the client will have to modify the index request source with a new "timestamp" value for its user defined "timestamp" field. |
I'm dealing with the same thing, also I'm not sure whether I have to re-index all my data because of the deprecated Documentation says it should be compatible, but it somehow is not... |
In master this can be done using ingest pipelines. Closing in favour of #14049 |
Unfortunately ingest pipelines will not give what is desired though correct? Since it will not account for time an index request spends sitting in a threadpool queue prior to being worked on since the timestamp was added by an ingest node prior to getting to a data node. |
there is always a delta between the assignment of a timestamp and the moment when it's actually indexed. The timestamp doesn't guarantee lineralizability in any way, that's a sequence ID. This feature can only be as good as some client app setting the timestamp no matter where you put it. It should not be more neither less. |
However, what is doable then - is to make a guarantee that the ingest pipeline would perform the field insertion at a stage no earlier than where _timestamp insertions are currently performed, correct? If the guarantee can be made, please ignore everything below. Even if there is always a delta, it is an issue involving its magnitude - i.e. not a binary issue - I believe in this case an effort should be made to minimize this delta - "delta always exists" - yes it's very well understood - however, it should not be used as a reason to arbitrarily alter this delta, while not providing a mean to accomplish what the small delta used to. i.e. a small delta is better than a large delta - a timestamp inserted closer to actual indexing is better than a timestamp inserted by the client app. To say that it "can only as good as some client app setting the timestamp" totally ignores the meaning of the magnitude of the delta. Do you agree that using _timestamp is a good way to measure pipeline delays? If not, is an alternative offered that works at least as well as _timestamp? Or, should we ignore Logstash delays altogether? |
IMO the timestamp should be assigned as soon as the document enters the system. that is what ingest will do. I think that is a clear property of the timestamp. I don't understand what folks are concerned about here, how much time do you think it takes from sending the doc until it's really indexed? I don't understand your usecase either, do you rely on the actual time the doc is indexed? what does it buy you? please be reasonable we can't give you any total ordering guarantees based on timestamps. if you are concerned about the corner case of the document being stuck in the thread-pool queue? I think you can just ignore that, unless you totally overload your sever it should be like a second at most or something? |
You can probably find the answer if you ask yourselves, "Why was _timestamp created in the first place"? What I'm concerned is having a knowledge - the best that a system can muster - the time elapsed between the moment an event is generated and the moment the same event is available for searches. What does it buy us? ElasticSearch is only one component in the entire analytics pipeline. Let's say for example, perioidically, a process queries ElasticSearch and aggregates some results. When an event is delayed for a long enough time, it may miss such aggregation. A _timestamp provides a good approximation of the delay of an event, and provides a mean to rerun such aggregations on events that were missed. It's just one example. Another example is that it can allow us to profile time spent in Logstash, so its configuration can be optimised - or if not possible - having Logstash replaced with something more efficient. Judging from the replies I'm not the only person with such concern, so without the tool for measurement, nobody can say it's a "corner case", or "like a second at most or something", we're talking about. |
sorry I am not sure @lam-juice @djschny the new way is as good as the old way, it is assigned at a slightly earlier stage but for all usecases of |
It depends on the guarantee we want - to us, ElasticSearch and what's feeding ElasticSearch its events are 2 separate entities, thus a facility is better than none in order to optimise and/or evaluate what's between ES and the event source (e.g. Logstash). Does it provide 100% guarantee? No - and it's not what I need anyway. Has _timestamp statistically been providing what I need, which is 99% of the time a good approximation (even if it can have a long tail)? Definitely. So you can see I'm not relying on exact numbers @s1monw - if it is assigned at a slightly earlier stage, so be it - deduplications provide the robustness I need anyway if I decide to observe the difference in the fatness of the tail, and reindex more conservatively - I think our fundamental disagreement is over the notion that "an ES-assigned timestamp does not accomplish more than a client-assigned timestamp" - because I believe a client-assigned timestamp does not allow even a rough estimate of time spent before an event crosses the boundary into ES. |
that has not changed if you use ingest
that is what it basically was all the time no other guarantees given. |
Due to the deprecation of _timestamp, there is no way to insert a timestamp at exactly the index time.
In my use case, the _timestamp field is useful for more than just ttl - I use ES as a timeseries storage, and if anyone wants to know how long the event pipeline takes to deliver the document, prior to ES 2.0 it can be calculated as (_timestamp - @timestamp) (provided that @timestamp is obtained from the parsed event time, and _timestamp is enabled)
when such events can be identified, time-sensitive operations that missed these events due to pipeline delays may attempt to recover the events by filtering on the pipeline delay time, i.e. "(_timestamp - @timestamp) > 2 hours"
asking Logstash to insert its own date field does not accurately capture any delay between the time when this new field is created, and the time the document is indexed.
i understand that the user may want to store multiple custom timestamps, e.g. moved, modified, etc. so let's not make one special _timestamp - however, i believe the time at which the document is indexed for the first time, is special and at least important for the purpose of identifying documents that needs some recovery action due to indexing delays.
The text was updated successfully, but these errors were encountered: