Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[slack] Fix 'oldest' API query param handling #6958

Merged
merged 12 commits into from Aug 9, 2023

Conversation

bhapas
Copy link
Contributor

@bhapas bhapas commented Jul 13, 2023

What does this PR do?

This PR fixes the way oldest query parameter is handled. According to Slack API docs , when latest oldest query params are passed in the API request, the logs are returned in descending order meaning most to least recent.

But the agent template is setting first_event.date_create into the oldest param after the first request is made which means for the second request the oldest query param has the date_create of the most recent event. Thereby the old data is never retrieved.

This PR preserves the oldest parameter and passes the date_create into the latest param for the next request so that the events are pulled as per the limit config.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Related issues

@elasticmachine
Copy link

elasticmachine commented Jul 13, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-09T13:55:54.457+0000

  • Duration: 16 min 16 sec

Test stats 🧪

Test Results
Failed 0
Passed 5
Skipped 0
Total 5

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Jul 13, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (1/1) 💚
Files 100.0% (1/1) 💚
Classes 100.0% (1/1) 💚
Methods 100.0% (14/14) 💚
Lines 98.715% (384/389)
Conditionals 100.0% (0/0) 💚

@bhapas bhapas marked this pull request as ready for review July 13, 2023 22:39
@bhapas bhapas requested a review from a team as a code owner July 13, 2023 22:39
@elasticmachine
Copy link

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@andrewkroh
Copy link
Member

andrewkroh commented Jul 14, 2023

the logs are returned in descending order meaning most to least recent

If logs are returned newest-to-oldest then our method of paginating won't work even if we change to using the timestamp of the last event returned. I think we would need to apply a fix similar to what was done for Google Workspace in #4982 to fix #4796.

From the Slack docs:

Use pagination to programmatically access more audit events, following the guidelines outlined in Slack's cursor-based pagination.

If we can use a cursor provided by the API that would be ideal. This issue gives a second confirmation that cursor is supported slackapi/python-slack-sdk#1147. It won't fix the underlying ordering issue but it is a more accurate way to iterate through logs without getting duplicates.

@bhapas
Copy link
Contributor Author

bhapas commented Jul 15, 2023

the logs are returned in descending order meaning most to least recent

If logs are returned newest-to-oldest then our method of paginating won't work even if we change to using the timestamp of the last event returned. I think we would need to apply a fix similar to what was done for Google Workspace in #4982 to fix #4796.

@andrewkroh We are using this cursor in our agent template - https://github.com/elastic/integrations/pull/6958/files#diff-7b5b9582e9bfd6a478411e73c6990e8565fb41d12d645cd785d121de5412bbfcR40-R42

So, A combination of cursor oldest latest API query parameters are formed to continue the pagination cycle. The parameter oldest was set using the wrong parameter and hence the older logs were never retrieved from second pagination cycle.

In case of Google workspace in #4982 , I think there is no specific cursor value and startTime was set in such a way that the entries are retrieved correctly and without duplication.

As per discussion here , the PR is tested and is working fine to retrieve logs based on initial_interval configuration

@andrewkroh
Copy link
Member

andrewkroh commented Jul 25, 2023

@bhapas and I talked via Zoom. Summarizing here:

  1. As of ccaf565 the implementation will continuously request logs from now - initial_interval. That is a lot of wasted resources and will result in duplicates documents even if an _id is used (b/c backing indices roll over).
  2. When making pagination requests, it should not include the oldest and latest request filters. Only include the cursor for these requests (under the assumption that the cursor includes state about the initial request, this needs to be verified). If that assumption is not true, then we need to ensure that during the pagination requests that the oldest and latest are not mutated from what they were during the first request (I think they are changing on each request today because now is used).
  3. If the input is interrupted during the pagination cycle then we need to ensure when it resumes that it requests the remaining older data. Because data flows from newest to oldest, care must be taken to ensure the proper information is persisted in the cursor to know that it completed pagination to the "end".

@chrisberkhout
Copy link
Contributor

I started reviewing this before seeing the new comments. Hope this still helps:

This would be much easier if it was possible to request order=ascending, but based on the docs that seems to be impossible.

As of ccaf565 it will paginate from now to the oldest data or initial_interval before the last request (url.params.oldest is reset each time). Then, it will try to restart, with the issue of oldest still being initial_interval ago, but latest will be set with cursor.last_timestamp, so it'll be looking at a range that's almost empty and only has old data. I'm not sure fail_on_template_error actually works in the cursor section, but even if it does, a non-empty response with no new cursor from Slack would leave our cursor.last_timestamp set for the next round.

Based on examples I saw in the Slack docs, I suspect the cursor will only include an item ID and no other context from the request in which it is returned. If you're working with a live endpoint you can base64 decode the cursor to see what's in there. However, I still think oldest and latest are unnecessary for the final solution. You'll have a starting point that's either now or a given cursor, and keep paginating until you reached far enough back - either there are none left, or you get an item/time you've already seen.

Here's some logic that I think would work for resuming interrupted pagination, both for the initial fetch for later:

if (cursor.next_stop == null):
  cursor.next_stop = now
if (cursor.stop == null):
  cursor.stop = now - initial_interval
if (cursor.cursor != null):
  use it for request
do request
cursor.cursor = .last_response.body.response_metadata.next_cursor
if (cursor.cursor == null OR last_event.date_create <= cursor.stop):
  cursor.cursor = null
  cursor.stop = cursor.next_stop
  cursor.next_stop = null
go to start

It'd take some care to translate that into request.transforms:, cursor: and response.pagination: config.

@bhapas
Copy link
Contributor Author

bhapas commented Jul 26, 2023

@chrisberkhout

However, I still think oldest and latest are unnecessary for the final solution. You'll have a starting point that's either now or a given cursor, and keep paginating until you reached far enough back - either there are none left, or you get an item/time you've already seen.

This is more like client side filtering that we try to achieve using cursor. But I think it is still good to get only enough data from the API by applying API filters efficiently. In this case , if we do not have oldest / latest params set in the request, then the API returns all the events from March, 2018 and that would be huge to handle every time on our resources.

Then, it will try to restart, with the issue of oldest still being initial_interval ago, but latest will be set with cursor.last_timestamp, so it'll be looking at a range that's almost empty and only has old data.

Yes , the oldest shall still be initial_interval ago and latest moved to last event's timestamp in the last page.
The workflow idea is something like this

  • Say , we have events for the last 10 days and each page returns 1 day's event logs
  • Then on first request oldest is 10 days ago and latest is now - response contains next_cursor
  • The second pagination request does not require oldest latest , Just passing next_cursor should inform API that this is the same request.
  • If the request flow is paused or broken at some point , then we need to save the last next_cursor value to be able to start from that point after resumption and if the pagination is complete the oldest should be replaced to now as in latest is now in the second step above.

Please let me know if I am not clear here :-)

@chrisberkhout
Copy link
Contributor

@bhapas

However, I still think oldest and latest are unnecessary for the final solution. You'll have a starting point that's either now or a given cursor, and keep paginating until you reached far enough back - either there are none left, or you get an item/time you've already seen.

This is more like client side filtering that we try to achieve using cursor. But I think it is still good to get only enough data from the API by applying API filters efficiently. In this case , if we do not have oldest / latest params set in the request, then the API returns all the events from March, 2018 and that would be huge to handle every time on our resources.

Agreed. Using oldest can avoid fetching items older than you care about. If you don't use oldest you can still stop at the first page with older items, but I agree it's better to to use oldest if you can.

The workflow idea is something like this

  • Say , we have events for the last 10 days and each page returns 1 day's event logs
  • Then on first request oldest is 10 days ago and latest is now - response contains next_cursor
  • The second pagination request does not require oldest latest , Just passing next_cursor should inform API that this is the same request.
  • If the request flow is paused or broken at some point , then we need to save the last next_cursor value to be able to start from that point after resumption and if the pagination is complete the oldest should be replaced to now as in latest is now in the second step above.

Yes, that workflow looks right to me and matches my earlier suggestion but using oldest. I would just add:

  • latest doesn't need to be in the request. On the first page it'll default to now, and on following pages the cursor will be the boundary for the latest.
  • The cursor probably won't maintain oldest, but you can keep that on the client side and re-apply it each time. It's possible I'm wrong but I'm saying this because the example cursor value in the documentation is dXNlcjpVMEc5V0ZYTlo= and base64 decoded that's just user:U0G9WFXNZ, so I think it tracks what item it's up to but not other context from the original request.

@bhapas
Copy link
Contributor Author

bhapas commented Jul 26, 2023

@chrisberkhout

latest doesn't need to be in the request. On the first page it'll default to now, and on following pages the cursor will be the boundary for the latest.

True. But the latest in the current request and oldest in the next request after pagination should be same so that we don't lose data in the time that we lost in processing the first request , if we do not store the now. WDYT?

The cursor probably won't maintain oldest, but you can keep that on the client side and re-apply it each time. It's possible I'm wrong but I'm saying this because the example cursor value in the documentation is dXNlcjpVMEc5V0ZYTlo= and base64 decoded that's just user:U0G9WFXNZ, so I think it tracks what item it's up to but not other context from the original request

Not 100% sure how the filtering works here.. So would like to test out this scenario with the real API

@chrisberkhout
Copy link
Contributor

@bhapas

True. But the latest in the current request and oldest in the next request after pagination should be same so that we don't lose data in the time that we lost in processing the first request , if we do not store the now. WDYT?

Right, that might be be a good way to keep hold of it. In #6649 I ended up adding an ignored query string parameter just to get access during pagination to a value set for the original request.

@bhapas
Copy link
Contributor Author

bhapas commented Jul 27, 2023

@andrewkroh @chrisberkhout

Looking through the API docs again, it seems that the audit logs API does not have a filter parameter cursor according to docs.

This might be reconfirmed by looking at pagination docs - https://api.slack.com/docs/pagination#cursors - The audit logs API is not in the list of supported methods for cursor-based pagination.

@chrisberkhout
Copy link
Contributor

@andrewkroh @bhapas

Yeah, the docs are ambiguous. On the Audit Logs API page it does say:

Use pagination to programmatically access more audit events, following the guidelines outlined in Slack's cursor-based pagination.

I assumed they probably added cursor-based pagination but only partially updated the docs.

There is some other code on Github that uses the same audit logs endpoint and uses cursors:
https://github.com/refractionPOINT/usp-adapters/blob/6401ca/slack/client.go#L20-L41

@andrewkroh
Copy link
Member

Looking through the API docs again, it seems that the audit logs API does not have a filter parameter cursor according to docs.

@bhapas, As I mentioned in #6958 (comment), the conversation in slackapi/python-slack-sdk#1147 indicates that cursor is supported.

Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the logic:

  • Tidying up: last_timestamp is stored in cursor data but not used.
  • Resuming a sequence: Rather than using last_timestamp, it'd be better to store .last_response.body.response_metadata.next_cursor in cursor data and use that for resuming a pagination sequence. That would mean using that cursor in request.transforms. With that there, response.pagination would still need be defined, but it wouldn't have to set any different values.
  • Endpoint of a new sequence: next_oldest_date will get stored after every non-empty page, but to begin a 2nd or later pagination sequence, we want the .first_event.date_create from the first page of the last sequence (or probably latest from that sequence if it contained no events).
  • Params for later pages in a sequence: In response.pagination, if the next_cursor value only indicates a start point for the next page as I suspect (to be verified one way or the other), you'll need to keep oldest around. It may be okay to delete latest, or it may be more convenient to keep it around.

To review, it'd be helpful to have see an example sequence of requests, maybe from logs, to see how the params get preserved / updated at each step.

@bhapas
Copy link
Contributor Author

bhapas commented Aug 1, 2023

Tidying up: last_timestamp is stored in cursor data but not used.

Done

Endpoint of a new sequence: next_oldest_date will get stored after every non-empty page, but to begin a 2nd or later pagination sequence, we want the .first_event.date_create from the first page of the last sequence (or probably latest from that sequence if it contained no events).

The first_event is the first picked up event in the context as I see here - https://github.com/elastic/beats/blob/main/x-pack/filebeat/input/httpjson/request.go#L540-L541 and is not overridden until the context is done. So this should always be the time of the first event in the first page of the cycle until the pagination is complete.

Resuming a sequence: Rather than using last_timestamp, it'd be better to store .last_response.body.response_metadata.next_cursor in cursor data and use that for resuming a pagination sequence. That would mean using that cursor in request.transforms. With that there, response.pagination would still need be defined, but it wouldn't have to set any different values.

Agree. For the pagination requests , Just the cursor is enough , So, will set the cursor value if pagination is incomplete. If it is a new request then the oldest and latest values are set.

Params for later pages in a sequence: In response.pagination, if the next_cursor value only indicates a start point for the next page as I suspect (to be verified one way or the other), you'll need to keep oldest around. It may be okay to delete latest, or it may be more convenient to keep it around.

As I read it in the API docs,

Cursor-paginated methods accept cursor and limit parameters.

So it should be safe to consider the cursor url param to retrieve the next page of results. Or WDYM?

To review, it'd be helpful to have see an example sequence of requests, maybe from logs, to see how the params get preserved / updated at each step.

I would love to as well. But the interrupted pagination is something difficult to automate as a test , hence wanted to test this with a slack API endpoint, if possible. But for the regular runs the log shall look like

2023-08-01 10:51:11 {"level":"debug","ts":"2023-08-01T08:51:11.368Z","caller":"httpserver/httpserver.go:130","msg":"Rule #1 matched: request #1 => Request path: GET /audit/v1/logs?latest=1690879871&limit=2&oldest=1673599871, Request Headers: 'User-Agent: [Elastic-Filebeat/8.9.1 (linux; arm64; 8757c8273a98d06d863746c41e404f066a9afc41; 2023-07-31 19:01:37 +0000 UTC)]' 'Accept: [application/json]' 'Authorization: [Bearer xoxp-1234567890]' 'Accept-Encoding: [gzip]' 'Connection: [close]' , Request Body: ","address":":8080"}


2023-08-01 10:51:13 {"level":"debug","ts":"2023-08-01T08:51:13.355Z","caller":"httpserver/httpserver.go:130","msg":"Rule #0 matched: request #1 => Request path: GET /audit/v1/logs?cursor=YXNkZmFzZGZhc2Rm&limit=2, Request Headers: 'Accept: [application/json]' 'Authorization: [Bearer xoxp-1234567890]' 'Accept-Encoding: [gzip]' 'Connection: [close]' 'User-Agent: [Elastic-Filebeat/8.9.1 (linux; arm64; 8757c8273a98d06d863746c41e404f066a9afc41; 2023-07-31 19:01:37 +0000 UTC)]' , Request Body: ","address":":8080"}

@bhapas
Copy link
Contributor Author

bhapas commented Aug 3, 2023

Initial request would be
GET /audit/v1/logs?latest=1691061097&limit=2&oldest=1673781097

Next request in pagination would be
GET /audit/v1/logs?cursor=YXNkZmFzZGZhc2Rm&limit=2


Resuming broken pagination request would look like

GET /audit/v1/logs?cursor=RJFHtFt4gU8JKFm&limit=2 where cursor=RJFHtFt4gU8JKFm is the next_cursor value in the last successful response.


If pagination is complete, then the next request after interval would be
GET /audit/v1/logs?latest=<First-Event-Time-In-Last-Request>&limit=2&oldest=<now>

@bhapas bhapas requested a review from andrewkroh August 3, 2023 11:24
Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, the values for oldest and latest should be swapped:

If pagination is complete, then the next request after interval would be GET /audit/v1/logs?latest=<First-Event-Time-In-Last-Request>&limit=2&oldest=<now>

Functionality

In setting url.params.oldest and url.params.latest, it looks like the intention is to have a default value on the first request, another value for the start of a new sequence (for latest it's the same value), and an empty string for pages that aren't the first in a sequence. There are a couple of issues with that:

  • Currently the empty string part will actually evaluate to '' (two single quotes). The else branch could be removed, or the template changed to [[- '' -]] to get the empty string, however...
  • If the value: part returns an empty string, it will use the default, even if it's not the initial request. I think it would actually be okay for both oldest and latest, but it doesn't match the plan. For requests with a cursor, if it's not resuming from interruption, the oldest and latest will be removed at 73-76, but that won't be done when resuming. This logic could be more simple and correct.

There should be a value for next_oldest_date for when there were no events in a sequence. Ideally it'll be the now of the request, with some padding (e.g. 5 mins back).

Potential improvements that don't change functionality

  • It's probably fine to never set url.params.latest, or to always set it to (now).Unix.
  • cursor.last_cursor could be called cursor.next_cursor. It came from the last request, but it was called next_cursor in the response data and it hasn't been used yet.
  • For setting values in the cursor: section, fail_on_template_error: true isn't an option (see cursor docs).
  • pagination_finished and last_cursor (or next_cursor) could be combined. If so, setting ignore_empty_value: false may be necessary.
  • White space trimming with - is redundant on line 25, so I'd remove that. On lines 37, 39, 52, 54 it's not technically needed because the preceding and following lines will trim across the line boundaries, but I think it's better style to have it there.

Assumptions to validate

  • An empty cursor parameter will be ignored.
  • A request with a cursor will not provide a next_cursor if it's reached oldest from the initial request.
  • Cursor values are unlikely to expire in minutes or hours (true if it contains an item ID, false if it's some kind of query reference or includes expiry information).

@bhapas
Copy link
Contributor Author

bhapas commented Aug 4, 2023

Functionality changes

  • Removed using url.params.latest as it defaults to now always.
  • Fixed the value template that evaluates the bool cursor values
  • If there is no event in the current request sequence then the next_oldest_date is defaulted to now

Improvements

  • It's probably fine to never set url.params.latest, or to always set it to (now).Unix.
  • cursor.last_cursor could be called cursor.next_cursor. It came from the last request, but it was called next_cursor in the response data and it hasn't been used yet.
  • For setting values in the cursor: section, fail_on_template_error: true isn't an option (see cursor docs).
    pagination_finished and last_cursor (or next_cursor) could be combined. If so, setting ignore_empty_value: false may be necessary.
  • White space trimming with - is redundant on line 25, so I'd remove that. On lines 37, 39, 52, 54 it's not technically needed because the preceding and following lines will trim across the line boundaries, but I think it's better style to have it there.

Fixed all these improvements

Test scenarios

  • Initial request with pagination - GET /audit/v1/logs?limit=2&oldest=1673860384

  • Paginated request in same sequence - GET /audit/v1/logs?cursor=YXNkZmFzZGZhc2Rm&limit=2

  • Next request with oldest value from first_event of previous Request sequence - GET /audit/v1/logs?limit=2&oldest=1683836277

  • Paginated request in same sequence - GET /audit/v1/logs?cursor=GytjmKHF5hFmty&limit=2

  • Next request with oldest value from first_event of previous Request sequence - GET /audit/v1/logs?limit=2&oldest=1683836285 - This is not a paginated response

  • Next request with oldest value from first_event of previous Request sequence - GET /audit/v1/logs?limit=2&oldest=1683836290

Assumptions validated

  • An empty cursor parameter will be ignored.
  • A request with a cursor will not provide a next_cursor if it's reached oldest from the initial request.
    From the docs -

You'll know that there are no further results to retrieve when a next_cursor field contains an empty string (""). You're not even paginating at all if you receive no response_metadata or its next_cursor value

@bhapas bhapas requested a review from andrewkroh August 7, 2023 07:41
@chrisberkhout
Copy link
Contributor

The last change seems good (only setting next_oldest_date when there is a result, and otherwise repeating the initial window or a previous next_oldest_date value).

I think this is a working version, given the assumption that a cursor knows when to stop.

I think that's still to be validated. The documentation you quoted makes it sound like it'll stop but consider this scenario:

  • There's 5 years of data.
  • We do our initial request with oldest set to 200 days ago.
  • We receive a page of results with with 10 days of data and a next_cursor value, make a new request with it and repeat this step until there is no new next_cursor value.
  • The end

Did the API decide there was "no further results to retrieve" after 200 days of data, or after the the full 5 years of data?
If it's the latter, we'd want to keep the oldest parameter for requests with a cursor, or if that's ignored, to check the dates received and stop when we have enough.

One non-functional point: if cursor.next_cursor used the option ignore_empty_value: false, then pagination_finished could go away and instead we could just check if there's a next_cursor value or not.

@bhapas
Copy link
Contributor Author

bhapas commented Aug 8, 2023

Did the API decide there was "no further results to retrieve" after 200 days of data, or after the the full 5 years of data?
If it's the latter, we'd want to keep the oldest parameter for requests with a cursor, or if that's ignored, to check the dates received and stop when we have enough.

The API will create a paginated response data for a given period and the paginated request [ The ones with next_cursor ] and the data will be confined to that period. If the request does not contain cursor then it is considered as a new request.

One non-functional point: if cursor.next_cursor used the option ignore_empty_value: false, then pagination_finished could go away and instead we could just check if there's a next_cursor value or not.

This might still be needed in case the response_metadata does not exist in cases the response is not paginated. So , it could be pagination-finished / no-pagination in the same bool pagiantion_finished.

@chrisberkhout chrisberkhout dismissed their stale review August 9, 2023 12:55

Assumption can be tested later

Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the assumption can be validated against the live API as follows:

  1. Set initial_interval to a shorter time duration than available historical data.
  2. Set limit to a smaller number of items than available historical data.
  3. Ensure that fetching of historical data is limited to initial_interval.

@bhapas bhapas merged commit ab4d3a1 into elastic:main Aug 9, 2023
4 checks passed
@bhapas bhapas deleted the slack-oldest-logs branch August 9, 2023 14:31
@elasticmachine
Copy link

Package slack - 1.10.1 containing this change is available at https://epr.elastic.co/search?package=slack

gizas pushed a commit that referenced this pull request Sep 5, 2023
* Fix oldest API query param handling

* Remove oldest and latest params in pagination request

* Fix the agent config

* Address pr comments

* Modify agent

* Fix agent config and added system test

* Remove default in next_oldest_date
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants