Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website] A journey with Apache Arrow - Part 1 - POST #340

Merged
merged 12 commits into from
Apr 11, 2023
Merged

Conversation

lquerel
Copy link
Contributor

@lquerel lquerel commented Apr 3, 2023

This PR is a markdown version of the article proposed on the mailing list (see https://lists.apache.org/thread/jxpypxwjh4jhpk2xvj0z3woy7yr0z0sk).

The author field currently contains my full name and not the apacheId because I was unable to get jekyll to take into account the change I made in contributors.yml.

@alamb
Copy link
Contributor

alamb commented Apr 3, 2023

Thanks @lquerel -- this looks great. I hope to review this PR sometime this week

@@ -0,0 +1,382 @@
---
layout: post
title: "A journey with Apache Arrow (part 1)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding something related to OpenTelemetry to the title?
https://arrow.apache.org/ is the Apache Arrow web site. So all contents are related to Apache Arrow. In the context, the current title may be too generic. If we can add some contexts in the title, it may help readers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea -- maybe something like "Storing tracing, metrics and logs efficiently with OpenTelemtry and Apache Arrow" 🤔

_posts/2023-04-02-a-journey-with-apache-arrow-part-1.md Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lquerel -- I read this entire article carefully and I found it well done and informative and a great primer on some of the aspects to consider when mapping data to the Arrow model

I left some suggestions that I think would help strengthen the piece, but I don't think any of them are necessary to publish this article.

Unless there are objections, I'll plan to merge this PR (which will publish the article) this next week (April 11, 2023) to ensure there is sufficient time for anyone else who would like to review to or request more time to do so.

Again, thank you very much

_posts/2023-04-02-a-journey-with-apache-arrow-part-1.md Outdated Show resolved Hide resolved
_posts/2023-04-02-a-journey-with-apache-arrow-part-1.md Outdated Show resolved Hide resolved
_posts/2023-04-02-a-journey-with-apache-arrow-part-1.md Outdated Show resolved Hide resolved
@@ -0,0 +1,382 @@
---
layout: post
title: "A journey with Apache Arrow (part 1)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea -- maybe something like "Storing tracing, metrics and logs efficiently with OpenTelemtry and Apache Arrow" 🤔

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very nice 👌

@@ -24,7 +24,7 @@ limitations under the License.
{% endcomment %}
-->

Apache Arrow is a technology widely adopted in big data, analytics, and machine learning applications. This article discusses our journey with Arrow, specifically its application to telemetry, and the challenges we encountered while optimizing the OpenTelemetry protocol to significantly reduce bandwidth costs. The promising results we achieved inspired us to share our insights. This article specifically focuses on transforming data from an XYZ format into an efficient Arrow representation that optimizes both compression ratio, transport, and data processing. Our benchmarks thus far have shown promising results, with compression ratio improvements ranging from 1.5x to 6x, depending on the data type (metrics, logs, traces) and distribution. The approaches presented for addressing these challenges may be applicable to other Arrow domains as well. This article serves as the first installment in a two-part series.
Apache Arrow is a technology widely adopted in big data, analytics, and machine learning applications. In this article, we share F5's experience with Arrow, specifically its application to telemetry, and the challenges we encountered while optimizing the OpenTelemetry protocol to significantly reduce bandwidth costs. The promising results we achieved inspired us to share our insights. This article specifically focuses on transforming relatively complex data structure from various formats into an efficient Arrow representation that optimizes both compression ratio, transport, and data processing. We also explore the trade-offs between different mapping and normalization strategies, as well as the nuances of streaming and batch communication using Arrow and Arrow Flight. Our benchmarks thus far have shown promising results, with compression ratio improvements ranging from 1.5x to 6x, depending on the data type (metrics, logs, traces) and distribution. The approaches presented for addressing these challenges may be applicable to other Arrow domains as well. This article serves as the first installment in a two-part series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

I plan to update the date on this post and publish it shortly

Edit: I see the date is set to 4/11 -- thus I will plan to publish it tomorrow

@lquerel
Copy link
Contributor Author

lquerel commented Apr 10, 2023

I see the date is set to 4/11 -- thus I will plan to publish it tomorrow

Thanks.
FYI, I have updated the performance section to reflect my latest tests.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks again @lquerel

@alamb alamb merged commit fd7fc44 into apache:main Apr 11, 2023
@alamb
Copy link
Contributor

alamb commented Apr 11, 2023

@lquerel
Copy link
Contributor Author

lquerel commented Apr 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants