Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions _posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-1.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Sorting is one of the most fundamental operations in modern databases and other

Sorting is also one of the most well studied topics in computer science. The classic survey paper for databases is [Implementing Sorting in Database Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which provides a thorough academic treatment and is still very applicable today. However, it may not be obvious how to apply the wisdom and advanced techniques described in that paper to modern systems. In addition, the excellent [DuckDB blog on sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights many sorting techniques, and mentions a comparable row format, but it does not explain how to efficiently sort variable length strings or dictionary encoded data.

In this series we explain in detail the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), and how we used to make sorting more than [3x](https://github.com/apache/arrow-rs/pull/2929) faster than an alternate comparator based approach. The benefits are especially pronounced for strings, dictionary encoded data, and sorts with large numbers of columns.
In this series we explain in detail the new [row format](https://docs.rs/arrow/27.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), and how we used to make sorting more than [3x](https://github.com/apache/arrow-rs/pull/2929) faster than an alternate comparator based approach. The benefits are especially pronounced for strings, dictionary encoded data, and sorts with large numbers of columns.


## Multicolumn / Lexicographical Sort Problem
Expand Down Expand Up @@ -227,6 +227,6 @@ You can find more information on how to leverage such representation in the "Bin

## Next up: Row Format

This post has introduced the concept and challenges of multi column sorting, and shown why a comparable byte array representation, such as the [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) introduced to the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), is such a compelling primitive.
This post has introduced the concept and challenges of multi column sorting, and shown why a comparable byte array representation, such as the [row format](https://docs.rs/arrow/27.0.0/arrow/row/index.html) introduced to the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/), is such a compelling primitive.

In [the next post]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-2 %}) we explain how this encoding works, but if you just want to use it, check out the [docs](https://docs.rs/arrow/latest/arrow/row/index.html) for getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues). As always, the [Arrow community](https://github.com/apache/arrow-rs#arrow-rust-community) very much looks forward to seeing what you build with it!
In [the next post]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-2 %}) we explain how this encoding works, but if you just want to use it, check out the [docs](https://docs.rs/arrow/27.0.0/arrow/row/index.html) for getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues). As always, the [Arrow community](https://github.com/apache/arrow-rs#arrow-rust-community) very much looks forward to seeing what you build with it!
4 changes: 2 additions & 2 deletions _posts/2022-11-07-multi-column-sorts-in-arrow-rust-part-2.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ limitations under the License.

## Introduction

In [Part 1]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-1 %}) of this post, we described the problem of Multi-Column Sorting and the challenges of implementing it efficiently. This second post explains how the new [row format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/) works and is constructed.
In [Part 1]({% post_url 2022-11-07-multi-column-sorts-in-arrow-rust-part-1 %}) of this post, we described the problem of Multi-Column Sorting and the challenges of implementing it efficiently. This second post explains how the new [row format](https://docs.rs/arrow/27.0.0/arrow/row/index.html) in the [Rust implementation](https://github.com/apache/arrow-rs) of [Apache Arrow](https://arrow.apache.org/) works and is constructed.


## Row Format
Expand Down Expand Up @@ -236,7 +236,7 @@ Similarly, supporting SQL compatible sorting also requires a format that can spe

## Conclusion

Hopefully these two articles have given you a flavor of what is possible with a comparable row format and how it works. Feel free to check out the [docs](https://docs.rs/arrow/latest/arrow/row/index.html) for instructions on getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues).
Hopefully these two articles have given you a flavor of what is possible with a comparable row format and how it works. Feel free to check out the [docs](https://docs.rs/arrow/27.0.0/arrow/row/index.html) for instructions on getting started, and report any issues on our [bugtracker](https://github.com/apache/arrow-rs/issues).

Using this format for lexicographic sorting is more than [3x](https://github.com/apache/arrow-rs/pull/2929) faster than the comparator based approach, with the benefits especially pronounced for strings, dictionaries and sorts with large numbers of columns.

Expand Down