Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement azure blob storage action #13069

Merged
merged 9 commits into from
Jun 3, 2024

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented May 17, 2024

Fixes https://emqx.atlassian.net/browse/EMQX-12280

Release version: e5.8

Summary

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@thalesmg thalesmg force-pushed the azure-blob-storage-action-m-20240508 branch 16 times, most recently from f5c3cc2 to acb4c32 Compare May 22, 2024 20:57
@thalesmg thalesmg marked this pull request as ready for review May 23, 2024 12:45
@thalesmg thalesmg requested review from JimMoen and a team as code owners May 23, 2024 12:45
qzhuyan
qzhuyan previously approved these changes May 23, 2024
Copy link
Contributor

@keynslug keynslug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

desc => ?DESC("aggregation_max_records")
}
)},
{max_block_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it here feels odd, because AFAICS it doesn't affect the aggregation itself, only the upload process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Moved to action parameters. ✔️

Comment on lines 407 to 428
case BufferSize > MaxBlockSize of
true ->
{IOData, NewBuffer} = take_chunk(Buffer, MaxBlockSize),
?tp(azure_blob_storage_will_write_chunk, #{}),
do_process_write(IOData, NewBuffer, TransferState0);
false ->
NewBuffer = [],
do_process_write(Buffer, NewBuffer, TransferState0)
end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently having this append / write split backfires slightly. 🫠

Honestly both of those things feel suboptimal:

  1. Call Azure API on every append, which is usually on the scale of 1-10s of KiBs assuming moderate event rate.
  2. Cut >4GiB large buffer somewhere close to its end, which will likely cause 2-3x of that in heap memory consumption. This most likely will never happen, but still.

I wonder what if:

  1. Introduce min_block_size config parameter, with something like 10 MiB as the default.
  2. During append: append as usual, but put the buffer under some next_block map entry if its size is larger that min_block_size (i.e. minimize latency impact / overhead / cost (?) of too frequent API requests).
  3. However, if it's to become larger than max_block_size does not append but instead put existing buffer under next_block key (i.e. avoid cutting iolists).
  4. During write, write next_block if it's defined.

On more thing: currently, having a single append larger than 4 GiB is impossible I think, because of this. Admittedly, it's sort of an implicitly defined limit, but I'd argue still could be relied upon to get rid of complexity of "cut off exactly this much of this iolist()".

Comment on lines 467 to 486
lookup_buffer_var([<<"datetime">>, Format], #{since := Since}) ->
{ok, format_timestamp(Since, Format)};
lookup_buffer_var([<<"datetime_until">>, Format], #{until := Until}) ->
{ok, format_timestamp(Until, Format)};
lookup_buffer_var([<<"sequence">>], #{seq := Seq}) ->
{ok, Seq};
lookup_buffer_var([<<"node">>], #{}) ->
{ok, mk_fs_safe_string(atom_to_binary(erlang:node()))};
lookup_buffer_var(_Binding, _Context) ->
{error, undefined}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good time to introduce something like a small emqx_connector_aggreg_buffer_ctx module, that implements emqx_template behaviour for #buffer{}s, and put this stuff (except for <<"node">>) there? Though we'd have to keep mk_fs_safe_string/1 on this level. Not the prettiest solution, but could help minimize code duplication and the need for that weird buffer_map() type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

@thalesmg thalesmg force-pushed the azure-blob-storage-action-m-20240508 branch from acb4c32 to 35c16ff Compare May 27, 2024 17:59
@thalesmg thalesmg force-pushed the azure-blob-storage-action-m-20240508 branch 2 times, most recently from 48c4e20 to c8ae506 Compare May 28, 2024 21:29
Copy link
Contributor

@keynslug keynslug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! A couple of nits.

];
fields(common_action_parameters) ->
[
{max_block_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is there anything this parameter may affect for direct uploads? The idea is to prematurely drop the event on the floor if it's for some reason >4GiB in size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, currently we don't check it, but this limit indeed is Azure's maximum upload per block.
(More precisely, for a single "Put Block Blob" call, it's possible to send 5000 MiB, and each "Put Blob" is 4000 MiB)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check and test here: a92eb3c

@thalesmg thalesmg requested review from keynslug May 29, 2024 20:44
container() ->
{container,
hoconsc:mk(
%% TODO: Support selectors once there are more than one container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was just transported from its previous location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. also the product of a refactoring

@thalesmg thalesmg force-pushed the azure-blob-storage-action-m-20240508 branch from a92eb3c to efa4432 Compare May 31, 2024 14:10
@thalesmg thalesmg merged commit c554754 into emqx:master Jun 3, 2024
175 checks passed
@thalesmg thalesmg deleted the azure-blob-storage-action-m-20240508 branch June 3, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants