New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcp: add keep_json option for audit, dns, firewall and vpcflow datastreams #8299
Conversation
I think the change itself is great! I like these approaches and fully support it, however I would have liked the renames to stay as renames, rather than using set, as the user is usually after fields we are deleting and not the ones we have already renamed, this also makes the flattened object much smaller and we do not need to duplicate data. This opens up one of the discussions we have had on the team a few times further in the past, and something I was always a bit against.
I like the flattened approach here! I believe we should discuss it further at a separate time, and change how we do our unpacking of JSON in general (default to flattened destination and cherry-pick renames out of the flattened field? Leaving any missed or newly added fields in flattened? Still, I would really like to push for using renames rather than sets on as many as possible! |
🌐 Coverage report
|
I am torn about the rename v copy; as you can see from the original issue I was using renames. The code changes are much simpler with renames since we don't need to spend so much care on making sure that everything remains in the final flattened field. We also have fewer (~0) futile allocations. The reason I went with the copies here (I did intend to note all this here, but forgot) is the simplicity in documenting behaviour in the copies case compared to the renames case; documenting the copies case is simple and clear, while the renames case has a bunch of understanding required by the user (including reading and understanding the ingest pipeline). I'd like to fully understand the perf implications of copy v rename here, but memory behaviour in Java is a dark art that I know little about. |
After some further discussion I think its fine to keep it this way, I just felt it was a bit too close to event.original, and the only reason we have this is so we do not have to use the JSON processor on event.original again. and it felt like a waste, it also puts the pressure on the end-user to remove this large duplicate object at the end to not incur storage cost. |
I would like to wait for @andrewkroh opinion on this as well. We could also just not implement it, and ask users to run the JSON processor on event.original. If the performance is a concern, I would have liked to know a bit more about the statistics used to determine that the JSON processor is actually more heavy than having to pass this large duplicate object through the pipeline. |
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
If we copy, then downstream consumers of the field are less affected by changes when we map a new field (i.e. the flattened field's value is more stable). The downside is increased storage costs for keep_json. You would probably even want disable If we rename, then the value in the flat field is less stable and less reflective of the original data which probably makes maintaining any follow-on custom pipelines more difficult. As a custom pipeline author I would probably ask for "copy" because it gives me the greatest control. |
I'd be reluctant to automatically disable |
Package gcp - 2.31.0 containing this change is available at https://epr.elastic.co/search?package=gcp |
Proposed commit message
Checklist
changelog.yml
file.Author's Checklist
How to test this PR locally
Related issues
gcp.<datastream>.flattened
field for each of the datastreams #8184Screenshots