New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add support for multi-column sort on Table #24398
Comments
Wes McKinney / @wesm: |
Scott Wilson: |
Wes McKinney / @wesm: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing |
Scott Wilson: |
Wes McKinney / @wesm: |
Scott Wilson: I hope you and yours are staying healthy in this strange new world! I've taken a stab at creating a DataFrame like cover for arrow::Table. My Thanks, Scott **** Code, also included as attachment #include #include <arrow/api.h> #include <boost/iterator/iterator_facade.hpp> using namespace std; // SBW 2020.04.15 For ArrayCoverRaw::iterator, we can simply use the the // STL container-like cover for arrow::Array. ArrayCoverRaw(std::shared_ptr& array) : _array(array) {} size_type size() const { return _array->length(); } // Should non-const versions fail if Array is immutable? // We could return std::optional to encapsulate IsNull() info, but this protected: // TODO: Add ArrayCoverString and iterators, perhaps others. // Use template on RefType so we can create iterator and const_iterator by explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr = bool IsNull() const private:
{ void decrement()
void advance(difference_type n) difference_type distance_to(ChunkedArrayIterator<CType, RefType> const& // Helper std::shared_ptrarrow::ChunkedArray _ch_arr; // This implementation is a subclass for Arrays that use GetView(i), explicit ChunkedArrayIteratorIndexImpl(std::shared_ptrarrow::ChunkedArray bool IsNull() const protected:
{ void decrement()
void advance(difference_type n) difference_type distance_to(ChunkedArrayIteratorIndexImpl const& // Helper std::shared_ptrarrow::ChunkedArray _ch_arr; // SBW 2020.04.23 for EVAL2() macro, even though code not called, need lhs explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr = // Cache value to avoid returning pointer to temp. private: template<> explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr = // Cache value to avoid returning pointer to temp. private: // STL container-like cover for arrow::ChunkedArray. ChunkedArrayCover(std::shared_ptr& array) : _array(array) {} size_type size() const { return _array->length(); } // Should non-const versions fail if Array is immutable? protected: #if 0 // SBW 2020.04.23 No longer needed no that we're using ContRefString ChunkedArrayCover(std::shared_ptr& array) : _array(array) {} size_type size() const { return _array->length(); } // Should non-const versions fail if Array is immutable? protected: struct TestFrame auto find_column(const char* name) { return _table->GetColumnByName(name); } template typename ChunkedArrayCover::iterator std::shared_ptrarrow::Table _table; // Generalizing std::transform() to take any number of input iterators. // Use BOOST_PP_VARIADIC_TO_SEQ(VA_ARGS) #define DF_INPUT_ITER(r, data, i, elem) #define LAMBDA_INPUT(r, data, i, elem) // Variable args are input 2-tuples (type, name). int main(int argc, char *argv[]) auto pool = default_memory_pool(); auto r_table_reader = csv::TableReader::Make(pool, r_input.ValueOrDie(), PrettyPrintOptions options{0}; // Test covers and iterators. default: // 1 cout << is_number_type<CTypeTraits::ArrowType>::value << endl; // Testing code, to check templates. if (true) return 1; – |
No, to be honest from a glance it's a different direction from what I've been thinking. My thoughts there actually are for the data frame internally to be a mix of yet-to-be-scanned Datasets (e.g. from CSV or Parquet files), manifest (materialized in-memory) chunked arrays, and unevaluated expressions. Analytics requests translate requests into physical query plans to be executed by the to-be-developed query engine. I haven't been able to give this my full attention since writing the design docs last year but I intend to spend a large fraction of my time on it the rest of the year. The reasoning for wanting to push data frame operations into a query engine is to get around the memory use issues and performance problems associated with "eager evaluation" data frame libraries like pandas (for example, a join in pandas materializes the entire joined data frame in memory). There are similar issues around sorting (particular with the knowledge of what you want to do with the sorted data – e.g. sort followed by a slice can be executed as a Top-K operation for substantially less memory use) That said, I know a number of people have expressed interest in having STL interface layers in Arrow to the data structures. This would be a valuable thing to contribute to the project. It's not mutually exclusive with the stuff I wrote above but wanted to give some idea |
Scott Wilson: On Thu, Apr 23, 2020 at 10:47 AM Wes McKinney (Jira) jira@apache.org – |
Scott Wilson: I hope you and yours are doing well in this strange time. I'm just writing to thank you for all the work you did on Arrow and the The only kluge I put into place has to do with support for null values. I I've attached the DataFrame header in case it's of interest. Thanks again, Scott – |
Wes McKinney / @wesm: |
Scott Wilson: – |
Wes McKinney / @wesm: For what it's worth, people have a lot of different expectations when they hear "data frame", and realistically we may end up with different kinds of data frame interfaces. From what I can see in the code, this is different than what I've proposed in https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h but have not been able to do any development on personally. I'm not personally able to invest time in this project in the near term unfortunately. |
Scott Wilson: – |
Antoine Pitrou / @pitrou: |
I'm just coming up to speed with Arrow and am noticing a dearth of examples ... maybe I can help here.
I'd like to implement multi-column sorting for Tables and just want to ensure that I'm not duplicating existing work or proposing a bad design.
My thought was to create a Table-specific version of SortToIndices() where you can specify the columns and sort order.
Then I'd create Array "views" that use the Indices to remap from the original Array values to the values in sorted order. (Original data is not sorted, but could be as a second step.) I noticed some of the array list variants keep offsets, but didn't see anything that supports remapping per a list of indices, but this may just be my oversight?
Thanks in advance, Scott
Reporter: Scott Wilson
Assignee: Kouhei Sutou / @kou
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-8199. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: