Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add support for multi-column sort on Table #24398

Closed
asfimport opened this issue Mar 24, 2020 · 14 comments
Closed

[C++] Add support for multi-column sort on Table #24398

asfimport opened this issue Mar 24, 2020 · 14 comments

Comments

@asfimport
Copy link

asfimport commented Mar 24, 2020

I'm just coming up to speed with Arrow and am noticing a dearth of examples ... maybe I can help here.

I'd like to implement multi-column sorting for Tables and just want to ensure that I'm not duplicating existing work or proposing a bad design.

My thought was to create a Table-specific version of SortToIndices() where you can specify the columns and sort order.

Then I'd create Array "views" that use the Indices to remap from the original Array values to the values in sorted order. (Original data is not sorted, but could be as a second step.) I noticed some of the array list variants keep offsets, but didn't see anything that supports remapping per a list of indices, but this may just be my oversight?

Thanks in advance, Scott

Reporter: Scott Wilson
Assignee: Kouhei Sutou / @kou

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-8199. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Sorting Tables is a bit complicated on account of chunked arrays. There isn't a multi-column sort implemented but one will eventually need to be implemented. It might be worth soliciting ideas about sorting with chunked data in general on the mailing list (we will need to employ the sorting strategies used by analytic databases which operate on chunks of tables at a time – there are various open source DBMSs that we can use for inspiration).

@asfimport
Copy link
Author

Scott Wilson:
Thanks Wes. I'm trying to replace portions of my python/pandas ML pipeline with arrow C++. As such, I don't lose anything by not initially being able to support chunked arrays. Does my approach seem reasonable assuming I  start with non-chunked arrays? Any suggestions?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I don't have particular suggestions. This functionality will eventually find a home in the future "C++ data frame API" that has been discussed for addition to the project.

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

@asfimport
Copy link
Author

Scott Wilson:
Ah, that's already helpful. I thought you meant for Table to become the DataFrame replacement, but now I see that's not true. This makes sense given the levels of abstraction I've seen so far in Arrow. Thanks!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Right, Table is just a data structure. We definitely are trying to keep the in-memory data structures simple and focused on providing access to the data, so computation will live in different APIs that act on the data structures

@asfimport
Copy link
Author

Scott Wilson:
Hi Wes,

I hope you and yours are staying healthy in this strange new world!

I've taken a stab at creating a DataFrame like cover for arrow::Table. My
first milestone was to see if I could come up with a df.eval() like
representation for single-line transforms – see the EVAL2 macro. Attached
is my code, I'm not quite sure where, if anywhere, I should post it to get
your thoughts so I'm sending this email. (I posted an earlier version on
Jira Arrow-602.) Mainly I'd like to know if this looks like the direction
you're thinking for arrow::DataFrame?

Thanks, Scott

**** Code, also included as attachment

#include
#include
#include
#include
#include
#include
#include
#include

#include <arrow/api.h>
#include <arrow/filesystem/localfs.h>
#include <arrow/csv/api.h>
#include <arrow/result.h>
#include <arrow/builder.h>

#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/preprocessor.hpp>

using namespace std;
using namespace arrow;

// SBW 2020.04.15 For ArrayCoverRaw::iterator, we can simply use the the
pointer interface.
// Wes suggests returning std::optional, but sizeof(double) <
sizeof(std::optional) and
// is not a drop-in replacement for T, i.e. optional can't be used in
expression, need optional.value().

// STL container-like cover for arrow::Array.
// Only works for Array types that support raw_values().
template
class ArrayCoverRaw
{
public:
using T = typename ArrType::value_type;
using pointer = T*;
using const_pointer = const T*;
using reference = T&;
using const_reference = const T&;
// Match size_type to Array offsets rather than using size_t and ptrdiff_t.
using size_type = int64_t;
using difference_type = int64_t;
using iterator = pointer;
using const_iterator = const_pointer;
using reverse_iterator = pointer;
using const_reverse_iterator = const_pointer;

ArrayCoverRaw(std::shared_ptr& array) : _array(array) {}

size_type size() const { return _array->length(); }

// Should non-const versions fail if Array is immutable?
iterator begin() { return const_cast(_array->raw_values()); }
iterator end() { return
const_cast(_array->raw_values()+_array->length()); }
reverse_iterator rbegin() { return
const_cast(_array->raw_values()+_array->length()-1); }
reverse_iterator rend() { return
const_cast(_array->raw_values()-1); }
const_iterator cbegin() const { return _array->raw_values(); }
const_iterator cend() const { return _array->raw_values()+_array->length();
}
const_reverse_iterator crbegin() const { return
_array->raw_values()+_array->length()-1; }
const_reverse_iterator crend() const { return _array->raw_values()-1; }

// We could return std::optional to encapsulate IsNull() info, but this
would seem to break the expected semantics.
reference operator[](const difference_type off) {
assert(_array->data()>is_mutable()); return _array>raw_values()+off; }
const_reference operator[](const difference_type off) const { return
_array->raw_values()+off; }
// ISSUE: is there an interface for setting IsNull() if array is mutable.
bool IsNull(difference_type off) const { return _array->IsNull(off); }

protected:
std::shared_ptr _array;
};

// TODO: Add ArrayCoverString and iterators, perhaps others.

// Use template on RefType so we can create iterator and const_iterator by
changing Value.
// Use class specializations to support Arrays that don't have raw_values().
template <typename CType, typename RefType>
class ChunkedArrayIterator
: public boost::iterator_facade<ChunkedArrayIterator<CType, RefType>,
RefType, boost::random_access_traversal_tag>
{
public:
using difference_type = int64_t;
using T = CType;
using ArrayType = typename CTypeTraits::ArrayType;
using pointer = T*;

explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr =
0, difference_type pos = 0)
: _ch_arr(ch_arr)
{
set_position(pos);
}

bool IsNull() const
{
auto arr = _ch_arr->chunk(_chunk_index);
return arr->IsNull(_current-_first);
}

private:
friend class boost::iterator_core_access;

bool equal(ChunkedArrayIterator<CType, RefType> const& other) const
{
    return this->_position == other._position;
}

void increment()

{
_position++;
// Need to move to next chunk?
if ((_current == _last) && ((_chunk_index+1) < _ch_arr->num_chunks()))
{
_chunk_index++;
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _first;
}
else
{
_current++;
}
}

void decrement()
{
_position--;
// Need to move to previous chunk?
if ((_current == _first) && (_chunk_index > 0))
{
_chunk_index--;
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _last;
}
else
{
_current--;
}
}

RefType& dereference() const { return \*_current; }

void advance(difference_type n)
{
_position += n;
while (n > 0)
{
difference_type max_delta = _last - _current;
if ((max_delta >= n) || ((_chunk_index+1) == _ch_arr->num_chunks()))
{
_current += n;
return;
}
// Move to next chunk.
n -= max_delta;
_chunk_index++;
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _first;
}
while (n < 0)
{
difference_type max_delta = _first - _current;
if ((max_delta <= n) || (_chunk_index == 0))
{
_current += n;
return;
}
// Move to previous chunk.
n -= max_delta;
_chunk_index--;
assert(_chunk_index >= 0);
auto arr = _ch_arr->chunk(_chunk_index);
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast(typed_arr->raw_values());
_last = _first + arr->length() - 1;
_current = _last;
}
}

difference_type distance_to(ChunkedArrayIterator<CType, RefType> const&
other)
{
return other._position - this->_position;
}

// Helper
void set_position(difference_type pos)
{
_position = pos;
const int nchunks = _ch_arr->num_chunks();
int64_t offset = 0;
for (_chunk_index = 0; _chunk_index < nchunks; _chunk_index++)
{
auto arr = _ch_arr->chunk(_chunk_index);
int64_t arr_rows = arr->length();
if (((offset+arr_rows) > pos) || ((_chunk_index+1)==nchunks))
{
auto typed_arr = std::static_pointer_cast(arr);
_first = const_cast<T*>(typed_arr->raw_values());
_last = _first + arr_rows - 1;
_current = _first + (pos-offset);
return;
}
offset += arr_rows;
}
assert(false);
}

std::shared_ptrarrow::ChunkedArray _ch_arr;
// Which Array we're looking at.
int _chunk_index = 0;
// Pointers into current Array. Use first/last rather than begin/end for
symmetry of moving forward/backward.
pointer _first = 0;
pointer _current = 0;
pointer _last = 0;
// Cache position across all chunks for support of random access.
difference_type _position = 0;
};

// This implementation is a subclass for Arrays that use GetView(i),
GetString(i), etc.
// Concrete subclass only needs to implement dereference(i).
template
class ChunkedArrayIteratorIndexImpl
{
public:
using difference_type = int64_t;
using ArrayType = typename CTypeTraits::ArrayType;

explicit ChunkedArrayIteratorIndexImpl(std::shared_ptrarrow::ChunkedArray
ch_arr = 0, difference_type pos = 0)
: _ch_arr(ch_arr)
{
set_position(pos);
}

bool IsNull() const
{
auto arr = _ch_arr->chunk(_chunk_index);
return arr->IsNull(_current);
}

protected:
friend class boost::iterator_core_access;

bool equal(ChunkedArrayIteratorIndexImpl<CType> const& other) const
{
    return this->_position == other._position;
}

void increment()

{
_position++;
// Need to move to next chunk?
if ((_current == _last) && ((_chunk_index+1) < _ch_arr->num_chunks()))
{
_chunk_index++;
auto arr = _ch_arr->chunk(_chunk_index);
_typed_arr = std::static_pointer_cast(arr);
_last = arr->length() - 1;
_current = 0;
}
else
{
_current++;
}
}

void decrement()
{
_position--;
// Need to move to previous chunk?
if ((_current == _first) && (_chunk_index > 0))
{
_chunk_index--;
auto arr = _ch_arr->chunk(_chunk_index);
_typed_arr = std::static_pointer_cast(arr);
_last = arr->length() - 1;
_current = _last;
}
else
{
_current--;
}
}

// RefType& dereference() const { return \*_current; }

void advance(difference_type n)
{
_position += n;
while (n > 0)
{
difference_type max_delta = _last - _current;
if ((max_delta >= n) || ((_chunk_index+1) == _ch_arr->num_chunks()))
{
_current += n;
return;
}
// Move to next chunk.
n -= max_delta;
_chunk_index++;
auto arr = _ch_arr->chunk(_chunk_index);
_typed_arr = std::static_pointer_cast(arr);
_last = arr->length() - 1;
_current = 0;
}
while (n < 0)
{
difference_type max_delta = 0 - _current;
if ((max_delta <= n) || (_chunk_index == 0))
{
_current += n;
return;
}
// Move to previous chunk.
n -= max_delta;
_chunk_index--;
auto arr = _ch_arr->chunk(_chunk_index);
_typed_arr = std::static_pointer_cast(arr);
_last = arr->length() - 1;
_current = _last;
}
}

difference_type distance_to(ChunkedArrayIteratorIndexImpl const&
other)
{
return other._position - this->_position;
}

// Helper
void set_position(difference_type pos)
{
_position = pos;
const int nchunks = _ch_arr->num_chunks();
int64_t offset = 0;
for (_chunk_index = 0; _chunk_index < nchunks; _chunk_index++)
{
auto arr = _ch_arr->chunk(_chunk_index);
int64_t arr_rows = arr->length();
if (((offset+arr_rows) > pos) || ((_chunk_index+1)==nchunks))
{
_typed_arr = std::static_pointer_cast(arr);
_last = arr_rows - 1;
_current = (pos-offset);
return;
}
offset += arr_rows;
}
assert(false);
}

std::shared_ptrarrow::ChunkedArray _ch_arr;
// Which Array we're looking at.
int _chunk_index = 0;
// Current Array. Use first/last rather than begin/end for symmetry of
moving forward/backward.
std::shared_ptr _typed_arr;
difference_type _current = 0;
difference_type _last = 0;
// Cache position across all chunks for support of random access.
difference_type _position = 0;
};

// SBW 2020.04.23 for EVAL2() macro, even though code not called, need lhs
iterator for unused code branch to compile.
using ConstRefString = std::string;
template<>
class ChunkedArrayIterator<std::string, ConstRefString> :
public ChunkedArrayIteratorIndexImplstd::string,
public boost::iterator_facade<ChunkedArrayIterator<std::string,
ConstRefString>, ConstRefString, boost::random_access_traversal_tag>
{
public:
using difference_type = int64_t;
using RefType = ConstRefString;

explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr =
0, difference_type pos = 0)
: ChunkedArrayIteratorIndexImpl(ch_arr, pos)
{
}

// Cache value to avoid returning pointer to temp.
RefType& dereference() const { _cached =
_typed_arr->GetString(_current); return _cached; }

private:
mutable std::string _cached;
};
using ChunkedArrayIteratorString = ChunkedArrayIterator<std::string,
ConstRefString>;

template<>
class ChunkedArrayIterator<bool, const bool> :
public ChunkedArrayIteratorIndexImpl,
public boost::iterator_facade<ChunkedArrayIterator<bool, const bool>, const
bool, boost::random_access_traversal_tag>
{
public:
using difference_type = int64_t;
using RefType = const bool;

explicit ChunkedArrayIterator(std::shared_ptrarrow::ChunkedArray ch_arr =
0, difference_type pos = 0)
: ChunkedArrayIteratorIndexImpl(ch_arr, pos)
{
}

// Cache value to avoid returning pointer to temp.
RefType& dereference() const { _cached = _typed_arr->GetView(_current);
return _cached; }

private:
mutable bool _cached;
};
using ChunkedArrayIteratorBoolean = ChunkedArrayIterator<bool, const bool>;

// STL container-like cover for arrow::ChunkedArray.
// Only works for ChunkedArrays composed of Array types that support
raw_values().
template
class ChunkedArrayCover
{
public:
// Match size_type to Array offsets rather than using size_t and ptrdiff_t.
using size_type = int64_t;
using difference_type = int64_t;
using iterator = typename ChunkedArrayIterator<CType, CType>;
using const_iterator = typename ChunkedArrayIterator<CType, const CType>;
using reverse_iterator = iterator;
using const_reverse_iterator = const_iterator;

ChunkedArrayCover(std::shared_ptr& array) : _array(array) {}

size_type size() const { return _array->length(); }

// Should non-const versions fail if Array is immutable?
iterator begin() { return iterator(_array); }
iterator end() { return iterator(_array, size()); }
reverse_iterator rbegin() { return iterator(_array, size()-1); }
reverse_iterator rend() { return iterator(_array, -1); }
const_iterator cbegin() const { return const_iterator(_array); }
const_iterator cend() const { return const_iterator(_array, size()); }
const_reverse_iterator crbegin() const { return const_iterator(_array,
size()-1); }
const_reverse_iterator crend() const { return const_iterator(_array, -1); }

protected:
std::shared_ptr _array;
};

#if 0 // SBW 2020.04.23 No longer needed no that we're using ContRefString
= std::string.
template<>
class ChunkedArrayCoverstd::string
{
public:
using CType = std::string;
using size_type = int64_t;
using difference_type = int64_t;
// ISSUE: no specialization for ChunkedArrayIterator<std::string,
std::string>
// Not sure how to handle setting of std::string values since StringArray
doesn't provide LHS access.
using iterator = typename ChunkedArrayIterator<CType, const CType>;
using const_iterator = typename ChunkedArrayIterator<CType, const CType>;
using reverse_iterator = iterator;
using const_reverse_iterator = const_iterator;

ChunkedArrayCover(std::shared_ptr& array) : _array(array) {}

size_type size() const { return _array->length(); }

// Should non-const versions fail if Array is immutable?
iterator begin() { return iterator(_array); }
iterator end() { return iterator(_array, size()); }
reverse_iterator rbegin() { return iterator(_array, size()-1); }
reverse_iterator rend() { return iterator(_array, -1); }
const_iterator cbegin() const { return const_iterator(_array); }
const_iterator cend() const { return const_iterator(_array, size()); }
const_reverse_iterator crbegin() const { return const_iterator(_array,
size()-1); }
const_reverse_iterator crend() const { return const_iterator(_array, -1); }

protected:
std::shared_ptr _array;
};
#endif

struct TestFrame
{
TestFrame(std::shared_ptrarrow::Table table = nullptr) : _table(table)
{
}

auto find_column(const char* name) { return _table->GetColumnByName(name); }

template typename ChunkedArrayCover::iterator
begin(const char* name)
{
auto col = _table->GetColumnByName(name);
assert(col != nullptr);
ChunkedArrayCover cover(col);
return cover.begin();
}
template typename ChunkedArrayCover::iterator
end(const char* name)
{
auto col = _table->GetColumnByName(name);
assert(col != nullptr);
ChunkedArrayCover cover(col);
return cover.end();
}
// Append to end if Index==-1.
template bool add_column(const char* name, int Index = -1)
{
vector values(_table->num_rows());
return add_column(name, values, Index);
}
template bool add_column(const char* name, const
vector& values, int Index = -1)
{
using Builder = typename CTypeTraits::BuilderType;
assert(values.size() == _table->num_rows());
Builder builder;
builder.Resize(values.size());
builder.AppendValues(values);
shared_ptrarrow::Array array;
arrow::Status st = builder.Finish(&array);
auto ch_arr = std::make_shared(array);
auto field = arrow::field(name, builder.type());
// Watch for existing name and delete if necessary.
int icol = _table->schema()->GetFieldIndex(name);
if (icol >= 0)
{
std::shared_ptrarrow::Table out;
_table->RemoveColumn(icol, &out);
_table = out;
}
std::shared_ptrarrow::Table out;
if (Index < 0)
Index = _table->num_columns();
st = _table->AddColumn(Index, field, ch_arr, &out);
if (st.ok())
{
_table = out;
return true;
}
return false;
}

std::shared_ptrarrow::Table _table;
};

// Generalizing std::transform() to take any number of input iterators.
//
https://medium.com/@vgasparyan1995/generalizing-std-transform-8d2c41e1f958
// https://github.com/vgasparyan1995/transform
#include "F:\Dev\transform-master\transform.h"

// Use BOOST_PP_VARIADIC_TO_SEQ(VA_ARGS)
// Use BOOST_PP_SEQ_FOR_EACH_I twice, once to build list of input
iterators, once to build arg list for lambda.
// Use BOOST_PP_TUPLE_ELEM to pull values out of tuple.

#define DF_INPUT_ITER(r, data, i, elem)
data.begin<BOOST_PP_TUPLE_ELEM(2, 0,
elem)>(BOOST_PP_STRINGIZE(BOOST_PP_TUPLE_ELEM(2, 1, elem)))
,
BOOST_PP_IF(i, , data.end<BOOST_PP_TUPLE_ELEM(2, 0,
elem)>(BOOST_PP_STRINGIZE(BOOST_PP_TUPLE_ELEM(2, 1, elem))))
BOOST_PP_COMMA_IF(BOOST_PP_NOT(i))

#define LAMBDA_INPUT(r, data, i, elem)
BOOST_PP_COMMA_IF(i)
auto BOOST_PP_TUPLE_ELEM(2, 1, elem)

// Variable args are input 2-tuples (type, name).
// Initially used BOOST_PP_VARIADIC_TO_SEQ(VA_ARGS), but I think this
is clearer to add inputs as pp_sequence.
// Function signature: transform(in1.begin(), in1.end(), in2.begin() ...
inN.begin(), dest.begin(), lambda).
#define EVAL2(df, dest_tuple, func_body, input_seq)
{
using dest_type = BOOST_PP_TUPLE_ELEM(2, 0, dest_tuple);
const char* dest_name = BOOST_PP_STRINGIZE(BOOST_PP_TUPLE_ELEM(2, 1,
dest_tuple));
if (is_number_type<CTypeTraits<dest_type>::ArrowType>::value)
{
if (df.find_column(dest_name) == nullptr)
df.add_column<dest_type>(dest_name);
my::transform(
BOOST_PP_SEQ_FOR_EACH_I(DF_INPUT_ITER, df, input_seq)
df.begin<dest_type>(dest_name) ,
[] (BOOST_PP_SEQ_FOR_EACH_I(LAMBDA_INPUT, df, input_seq)) func_body );
}
else
{
using Builder = typename CTypeTraits<dest_type>::BuilderType;
vector<dest_type> dest(df._table->num_rows());
my::transform(
BOOST_PP_SEQ_FOR_EACH_I(DF_INPUT_ITER, df, input_seq)
dest.begin() ,
[] (BOOST_PP_SEQ_FOR_EACH_I(LAMBDA_INPUT, df, input_seq)) func_body );
df.add_column<dest_type>(dest_name, dest);
}
}

int main(int argc, char *argv[])
{
auto fs = make_sharedfs::LocalFileSystem();
auto r_input = fs->OpenInputStream("c:/temp/_DatasetP14Seizures.csv");

auto pool = default_memory_pool();
auto read_options = arrow::csv::ReadOptions::Defaults();
auto parse_options = arrow::csv::ParseOptions::Defaults();
auto convert_options = arrow::csv::ConvertOptions::Defaults();

auto r_table_reader = csv::TableReader::Make(pool, r_input.ValueOrDie(),
read_options, parse_options, convert_options);
auto r_read = r_table_reader.ValueOrDie()->Read();
auto p_table = r_read.ValueOrDie();

PrettyPrintOptions options{0};
arrow::PrettyPrint(*p_table, options, &std::cout);

// Test covers and iterators.
const Table& tlb = *p_table;
const int64_t rows = tlb.num_rows();
const int cols = tlb.num_columns();
for (int c = 0; c < cols; c++)
{
auto f = tlb.field(c);
const string& name = f->name();
int type_id = f->type()->id();
auto ch_arr = tlb.column(c);
auto values_buffer = ch_arr->chunk(0)>data()>buffers[1];
cout << "is_mutable: " << values_buffer->is_mutable() << endl;
switch (type_id)
{
case Type::DOUBLE:
{
#if 0
using iterator = ChunkedArrayIteratorRaw<arrow::DoubleArray, double>;
iterator it(ch_arr, 2);
cout << it.IsNull() << endl;
boost::iterator_range range(it-2, it+8);
for (double val : range)
cout << val << endl;
#else
using cover = ChunkedArrayCover;
using iterator = typename cover::iterator;
using range = typename boost::iterator_range;
cover cvr(ch_arr);
auto begin = cvr.begin();
auto end = cvr.end();
auto rbegin = cvr.rbegin();
auto rend = cvr.rend();
auto it = begin;
it += 2;
cout << "value_isnull: " << it.IsNull() << endl;
range rng(it-2, it+8);
for (double val : rng)
cout << val << endl;
#endif
}
break;
case Type::STRING:
{
using iterator = ChunkedArrayIteratorString;
iterator it(ch_arr, 2);
cout << "value_isnull: " << it.IsNull() << endl;
boost::iterator_range range(it-2, it+8);
for (const std::string val : range)
cout << val << endl;
}
break;
case Type::BOOL:
{
using iterator = ChunkedArrayIteratorBoolean;
iterator it(ch_arr, 2);
cout << "value_isnull: " << it.IsNull() << endl;
boost::iterator_range range(it-2, it+8);
for (bool val : range)
cout << val << endl;
}
break;

default:
break;
}
}

// 1 cout << is_number_type<CTypeTraits::ArrowType>::value << endl;
// 1 cout << is_number_type<CTypeTraits::ArrowType>::value << endl;
// 0 cout << is_number_type<CTypeTraits::ArrowType>::value << endl;
// 0 cout << is_number_type<CTypeTraitsstd::string::ArrowType>::value <<
endl;

// Testing code, to check templates.
if (false)
{
TestFrame df;
auto beg = df.begin("foo");
EVAL2(df, (double, dest), { return foo*foo; }, ((double, foo)));
EVAL2(df, (double, dest), { return foo*foo; }, ((double,
foo))((std::string, name)));
}

if (true)
{
TestFrame df(p_table);
auto beg = df.begin("Duration");
EVAL2(df, (double, LogDur), { return log10(Duration); }, ((double,
Duration)));
EVAL2(df, (std::string, Length), { return (LogDur>3) ? "long" : "short"; },
((double, LogDur)));
arrow::PrettyPrint(*df._table, options, &std::cout);
}

return 1;
}


Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075

@asfimport
Copy link
Author

Wes McKinney / @wesm:

Mainly I'd like to know if this looks like the direction you're thinking for arrow::DataFrame?

No, to be honest from a glance it's a different direction from what I've been thinking. My thoughts there actually are for the data frame internally to be a mix of yet-to-be-scanned Datasets (e.g. from CSV or Parquet files), manifest (materialized in-memory) chunked arrays, and unevaluated expressions. Analytics requests translate requests into physical query plans to be executed by the to-be-developed query engine. I haven't been able to give this my full attention since writing the design docs last year but I intend to spend a large fraction of my time on it the rest of the year.

The reasoning for wanting to push data frame operations into a query engine is to get around the memory use issues and performance problems associated with "eager evaluation" data frame libraries like pandas (for example, a join in pandas materializes the entire joined data frame in memory). There are similar issues around sorting (particular with the knowledge of what you want to do with the sorted data – e.g. sort followed by a slice can be executed as a Top-K operation for substantially less memory use)

That said, I know a number of people have expressed interest in having STL interface layers in Arrow to the data structures. This would be a valuable thing to contribute to the project. It's not mutually exclusive with the stuff I wrote above but wanted to give some idea

@asfimport
Copy link
Author

Scott Wilson:
Ah. That will be very cool. Thanks for your feedback. I’ll continue with
this approach, we’re moving our ML pipeline from python to c++, until yours
materializes.

On Thu, Apr 23, 2020 at 10:47 AM Wes McKinney (Jira) jira@apache.org


Sent from Gmail Mobile

@asfimport
Copy link
Author

Scott Wilson:
Hey Wes,

I hope you and yours are doing well in this strange time.

I'm just writing to thank you for all the work you did on Arrow and the
various discussions you've posted about the design decisions that drove
this development, post pandas. I've largely completed my C++ DataFrame and
replaced python/pandas code that we use for our ML pipeline. Using the
Arrow framework, I've been able to create a DataFrame object that wraps one
or more arrow tables. The implementation supports no-copy subsets, joins
and concatenations, and stl-like iterators. Also supported are transforms
using in-place lambda functions. The net is that a ~1 TB data processing
step that used to take 13 h now requires 15 m.

The only kluge I put into place has to do with support for null values. I
allow in-place editing of values, but no changes to array sizes or types.
This is possible because the typed arrays offer access to the underlying
raw values. To offer the same for null values I had to create derived
classes for Array and ChunkedArray offer access to the cached null_counts.

I've attached the DataFrame header in case it's of interest.

Thanks again, Scott


Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075

@asfimport
Copy link
Author

Wes McKinney / @wesm:
That's great news. Thanks for attaching the code – if you apply an open source license to it (like Apache 2.0) then others may be able to reuse parts of it.

@asfimport
Copy link
Author

Scott Wilson:
Is there a way it could become part of the Apache Arrow project?


Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Sure, it would need to be contributed at least as pull request – depending on discussions on the mailing list about the origins of the software, since it was externally-developed we might need to obtain a software grant from your company. Then there is the question of "productionizing" it – conforming it to the code style of the project and writing unit tests.

For what it's worth, people have a lot of different expectations when they hear "data frame", and realistically we may end up with different kinds of data frame interfaces. From what I can see in the code, this is different than what I've proposed in https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h but have not been able to do any development on personally. I'm not personally able to invest time in this project in the near term unfortunately.

@asfimport
Copy link
Author

Scott Wilson:
Ok, thanks! Seems like a fair amount of work without a really compelling
reason....


Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 8612
#8612

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants