Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Google Workspace] Cursor is not set correctly in httpjson.yml.hbs #4491

Closed
tlee-elastic opened this issue Oct 24, 2022 · 11 comments
Closed
Assignees
Labels

Comments

@tlee-elastic
Copy link

tlee-elastic commented Oct 24, 2022

Hi Team,

The cursor is not set correctly in the httpjson.yml.hbs.

cursor:
  last_execution_datetime:
    value: "[[formatDate now]]"

When the agent restarts, it will continue from the current time instead of progressing from the last API response time. This causes an outage as the agent cannot automatically backfill the data from the last successful document.

@elasticmachine
Copy link

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@jamiehynds jamiehynds added the Integration:google_workspace Google Workspace label Oct 24, 2022
@andrewkroh
Copy link
Member

@marc-gr Do you recall why "execution time" was used as opposed to a timestamp taken from the last received event?

cursor:
last_execution_datetime:
value: "[[formatDate now]]"

@andrewkroh
Copy link
Member

When the agent restarts, it will continue from the current time instead of progressing from the last API response time.

@tlee-elastic It should be resuming by using the execution time of the last request (as was persisted to disk). If that state was lost (or it was the first time the integration runs) then by default it will use now - 24h. Does your Agent container have persistent storage mounted such that on restart it retains the persisted state?

@tlee-elastic
Copy link
Author

@andrewkroh from what I can see in the logs, when the google workspace integration paginates, it still updates the cursor with the execution time. If there is an agent restart/disruption during this time. When it resumes, it continues from the cursor time and results in missing data. Logs below:

cursor.last_execution_datetime stored with 2022-10-24T01:32:59Z

last received page: &httpjson.response{page:15, url:url.URL{Scheme:"https", Opaque:"", User:(*url.Userinfo)(nil), Host:"www.googleapis.com", Path:"/admin/reports/v1/activity/users/all/applications/drive", RawPath:"", ForceQuery:false, RawQuery:"pageToken=<REDACTED>&startTime=2022-10-24T01%3A03%3A49Z", Fragment:"", RawFragment:""}

You can see that the start time in the pagination is still 2022-10-24T01:03:49Z but the cursor has already been updated to 2022-10-24T01:32:59Z

@marc-gr
Copy link
Contributor

marc-gr commented Oct 25, 2022

When the agent restarts, it will continue from the current time instead of progressing from the last API response time.

@tlee-elastic It should be resuming by using the execution time of the last request (as was persisted to disk). If that state was lost (or it was the first time the integration runs) then by default it will use now - 24h. Does your Agent container have persistent storage mounted such that on restart it retains the persisted state?

I do not remember any specific reason, and indeed it seems like using the provided time in the events would result in a more robust solution.

@marc-gr
Copy link
Contributor

marc-gr commented Oct 26, 2022

When the agent restarts, it will continue from the current time instead of progressing from the last API response time.

@tlee-elastic It should be resuming by using the execution time of the last request (as was persisted to disk). If that state was lost (or it was the first time the integration runs) then by default it will use now - 24h. Does your Agent container have persistent storage mounted such that on restart it retains the persisted state?

I do not remember any specific reason, and indeed it seems like using the provided time in the events would result in a more robust solution.

Answering to myself. Going through the integration I remember why I did not use the id.time field. I saw different date formats for the time from several samples, and hence supporting these all in the httpjson config would be a bit brittle and cumbersome. I can explore it again and amend it, as I indeed think would be better under a failing scenario like the proposed one.

@botelastic
Copy link

botelastic bot commented Oct 26, 2023

Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 26, 2023
@geekpete
Copy link
Member

Still relevant 👍

@kcreddy
Copy link
Contributor

kcreddy commented Dec 5, 2023

@geekpete The PR #4982 from @marc-gr was supposed to fix this problem using new cursor state.
Can you give more context into the issue you are facing? Which version of the integration and datastream you are using and also if it is the same issue (i.e., incorrect cursor state leading to data loss)?

@andrewkroh
Copy link
Member

This seems like it was a duplicate of #4796 and it should have been closed along with it.

@kcreddy
Copy link
Contributor

kcreddy commented Jan 2, 2024

Closed as duplicate.

@kcreddy kcreddy closed this as completed Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants