Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcp: add keep_json option for audit, dns, firewall and vpcflow datastreams #8299

Merged
merged 2 commits into from Nov 12, 2023

Conversation

efd6
Copy link
Contributor

@efd6 efd6 commented Oct 26, 2023

Proposed commit message

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Screenshots

@elasticmachine
Copy link

elasticmachine commented Oct 26, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-31T01:43:02.919+0000

  • Duration: 19 min 15 sec

Test stats 🧪

Test Results
Failed 0
Passed 68
Skipped 0
Total 68

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@P1llus
Copy link
Member

P1llus commented Oct 26, 2023

I think the change itself is great! I like these approaches and fully support it, however I would have liked the renames to stay as renames, rather than using set, as the user is usually after fields we are deleting and not the ones we have already renamed, this also makes the flattened object much smaller and we do not need to duplicate data.

This opens up one of the discussions we have had on the team a few times further in the past, and something I was always a bit against.
When we created integrations in the past from JSON payloads, what I liked to do was that I used the JSON processor from source field directly into the correct destination field (like gcp.audit), this has some pros and cons:
Pros:

  1. We do not need to constantly cherry-pick which fields goes from the original JSON to the destination, but rather remove the few we do not want.
  2. If new fields are introduced, they will not be automatically removed.
    Cons:
  3. New or missed fields will not have a field type mapped, which can cause issues down the line.

I like the flattened approach here! I believe we should discuss it further at a separate time, and change how we do our unpacking of JSON in general (default to flattened destination and cherry-pick renames out of the flattened field? Leaving any missed or newly added fields in flattened?

Still, I would really like to push for using renames rather than sets on as many as possible!

@elasticmachine
Copy link

elasticmachine commented Oct 26, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (6/6) 💚
Files 100.0% (6/6) 💚
Classes 100.0% (6/6) 💚
Methods 87.826% (101/115) 👎 -1.88
Lines 95.124% (1502/1579) 👎 -4.01
Conditionals 100.0% (0/0) 💚

@efd6
Copy link
Contributor Author

efd6 commented Oct 26, 2023

I am torn about the rename v copy; as you can see from the original issue I was using renames. The code changes are much simpler with renames since we don't need to spend so much care on making sure that everything remains in the final flattened field. We also have fewer (~0) futile allocations. The reason I went with the copies here (I did intend to note all this here, but forgot) is the simplicity in documenting behaviour in the copies case compared to the renames case; documenting the copies case is simple and clear, while the renames case has a bunch of understanding required by the user (including reading and understanding the ingest pipeline). I'd like to fully understand the perf implications of copy v rename here, but memory behaviour in Java is a dark art that I know little about.

@P1llus
Copy link
Member

P1llus commented Oct 27, 2023

After some further discussion I think its fine to keep it this way, I just felt it was a bit too close to event.original, and the only reason we have this is so we do not have to use the JSON processor on event.original again. and it felt like a waste, it also puts the pressure on the end-user to remove this large duplicate object at the end to not incur storage cost.

@P1llus
Copy link
Member

P1llus commented Oct 27, 2023

I would like to wait for @andrewkroh opinion on this as well.

We could also just not implement it, and ask users to run the JSON processor on event.original. If the performance is a concern, I would have liked to know a bit more about the statistics used to determine that the JSON processor is actually more heavy than having to pass this large duplicate object through the pipeline.

@efd6 efd6 marked this pull request as ready for review October 29, 2023 23:11
@efd6 efd6 requested review from a team as code owners October 29, 2023 23:11
@elasticmachine
Copy link

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@andrewkroh
Copy link
Member

If we copy, then downstream consumers of the field are less affected by changes when we map a new field (i.e. the flattened field's value is more stable). The downside is increased storage costs for keep_json. You would probably even want disable preserve_original_event if you used keep_json since those are close to duplicative.

If we rename, then the value in the flat field is less stable and less reflective of the original data which probably makes maintaining any follow-on custom pipelines more difficult.

As a custom pipeline author I would probably ask for "copy" because it gives me the greatest control.

@efd6
Copy link
Contributor Author

efd6 commented Oct 30, 2023

I'd be reluctant to automatically disable preserve_original_event (they may be retaining only temporarily to do some processing and will delete it at the end of the @custom), but I do think it's worth documenting in the keep_json option that this is recommended. I agree that the copy is least surprising, and will add docs to note the potential costs of retaining the data.

@efd6 efd6 merged commit af38b97 into elastic:main Nov 12, 2023
4 checks passed
@elasticmachine
Copy link

Package gcp - 2.31.0 containing this change is available at https://epr.elastic.co/search?package=gcp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:GCP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proposal: gcp: add a gcp.<datastream>.flattened field for each of the datastreams
4 participants