Skip to content

Commit

Permalink
ARROW-4782: [C++] Prototype array and scalar expression types to help…
Browse files Browse the repository at this point in the history
… with building an deferred compute graph

Basic ideas

* `Expr` is the base class for the "edges" of the graph, i.e. data dependencies
* `Operation` is the base for nodes in the graph. An Operation takes input Expr dependencies, plus any static / non-Expr arguments, and produces an output Expr. During this output resolution, type checking and validation is performed

This patch does not get into various other necessary expression types, particularly table level operations like projections, filters, aggregations, and joins. I'll look at those in follow up patches.

I also added the `arrow::compute::LogicalType` idea which provides the idea of an "unbound" / non-concrete type. The idea of this type is to permit instances of "type classes" like "number" or "integer" in addition to concrete data types like "int32". I think we need to have type objects that are decoupled from the metadata used for the Arrow columnar format, even though in some cases there is a 1-to-1 mapping. There are some other things to contemplate in future patches like "unbound" names in nested fields. We may not always know the field names when building an expression, and the "binding" to physical data may need to occur later.

The general approach here is inspired by a pure Python expression algebra system I designed in 2015 called Ibis

https://github.com/ibis-project/ibis/tree/master/ibis/expr

With this system I was able to accurately model a superset of SQL and, with the help of some other open source developers, create compilers from the expression algebra to SQL or other backend execution.

Author: Wes McKinney <wesm+git@apache.org>
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes #3820 from wesm/compute-expr-prototyping and squashes the following commits:

3a83735 <Wes McKinney> Code review comments, add some type convenience aliases
1571b3b <Krisztián Szűcs> virtual destructor for LogicalType
0142cfe <Wes McKinney> Fix issues on Windows
7a6b1ed <Wes McKinney> Smoke tests for example ops
8628328 <Wes McKinney> More expr boxing scaffolding and logical type tests
c00b629 <Wes McKinney> Add some basic expression type factories
a665167 <Wes McKinney> Get some very basic unit tests passing
03a3ed8 <Wes McKinney> Remove superfluous file
a56b81c <Wes McKinney> More scaffolding, a bit cleaner API, factory methods for expr types
8eaadd9 <Wes McKinney> More boilerplate
9859982 <Wes McKinney> Prototyping
65dbdcb <Wes McKinney> Prototyping
77303f0 <Wes McKinney> Prototype
86ec152 <Wes McKinney> More prototyping / scaffolding
8d424aa <Wes McKinney> Prototyping
510bb7d <Wes McKinney> Prototyping
c73f420 <Wes McKinney> Prototyping
1fff4b8 <Wes McKinney> seed
  • Loading branch information
wesm committed Mar 8, 2019
1 parent d518839 commit 08ca13f
Show file tree
Hide file tree
Showing 17 changed files with 1,508 additions and 3 deletions.
6 changes: 6 additions & 0 deletions cpp/build-support/run_cpplint.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,14 @@
from functools import partial


# NOTE(wesm):
#
# * readability/casting is disabled as it aggressively warns about functions
# with names like "int32", so "int32(x)", where int32 is a function name,
# warns with
_filters = '''
-whitespace/comments
-readability/casting
-readability/todo
-build/header_guard
-build/c++11
Expand Down
7 changes: 6 additions & 1 deletion cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -143,14 +143,19 @@ if(ARROW_COMPUTE)
set(ARROW_SRCS
${ARROW_SRCS}
compute/context.cc
compute/expression.cc
compute/logical_type.cc
compute/operation.cc
compute/kernels/aggregate.cc
compute/kernels/boolean.cc
compute/kernels/cast.cc
compute/kernels/count.cc
compute/kernels/hash.cc
compute/kernels/mean.cc
compute/kernels/sum.cc
compute/kernels/util-internal.cc)
compute/kernels/util-internal.cc
compute/operations/cast.cc
compute/operations/literal.cc)
endif()

if(ARROW_CUDA)
Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/compute/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ arrow_add_pkg_config("arrow-compute")
#

add_arrow_test(compute-test)
add_arrow_test(expression-test PREFIX "arrow-compute")
add_arrow_test(operations/operations-test PREFIX "arrow-compute")
add_arrow_benchmark(compute-benchmark)

add_subdirectory(kernels)
151 changes: 151 additions & 0 deletions cpp/src/arrow/compute/expression-test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <cstdint>
#include <memory>
#include <string>
#include <vector>

#include <gtest/gtest.h>

#include "arrow/status.h"
#include "arrow/table.h"
#include "arrow/testing/gtest_common.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/type.h"

#include "arrow/compute/expression.h"
#include "arrow/compute/logical_type.h"
#include "arrow/compute/operation.h"

namespace arrow {
namespace compute {

// A placeholder operator implementation to use for testing various Expr
// behavior
class DummyOp : public Operation {
public:
Status ToExpr(ExprPtr* out) const override { return Status::NotImplemented("NYI"); }
};

TEST(TestLogicalType, NonNestedToString) {
std::vector<std::pair<LogicalTypePtr, std::string>> type_to_name = {
{type::any(), "Any"},
{type::null(), "Null"},
{type::boolean(), "Bool"},
{type::number(), "Number"},
{type::floating(), "Floating"},
{type::integer(), "Integer"},
{type::signed_integer(), "SignedInteger"},
{type::unsigned_integer(), "UnsignedInteger"},
{type::int8(), "Int8"},
{type::int16(), "Int16"},
{type::int32(), "Int32"},
{type::int64(), "Int64"},
{type::uint8(), "UInt8"},
{type::uint16(), "UInt16"},
{type::uint32(), "UInt32"},
{type::uint64(), "UInt64"},
{type::float16(), "Float16"},
{type::float32(), "Float32"},
{type::float64(), "Float64"},
{type::binary(), "Binary"},
{type::utf8(), "Utf8"}};

for (auto& entry : type_to_name) {
ASSERT_EQ(entry.second, entry.first->ToString());
}
}

class DummyExpr : public Expr {
public:
using Expr::Expr;
std::string kind() const override { return "dummy"; }
};

TEST(TestLogicalType, Any) {
auto op = std::make_shared<DummyOp>();
auto t = type::any();
ASSERT_TRUE(t->IsInstance(*scalar::int32(op)));
ASSERT_TRUE(t->IsInstance(*array::binary(op)));
ASSERT_FALSE(t->IsInstance(*std::make_shared<DummyExpr>(op)));
}

TEST(TestLogicalType, Number) {
auto op = std::make_shared<DummyOp>();
auto t = type::number();

ASSERT_TRUE(t->IsInstance(*scalar::int32(op)));
ASSERT_TRUE(t->IsInstance(*scalar::float64(op)));
ASSERT_FALSE(t->IsInstance(*scalar::boolean(op)));
ASSERT_FALSE(t->IsInstance(*scalar::null(op)));
ASSERT_FALSE(t->IsInstance(*scalar::binary(op)));
}

TEST(TestLogicalType, IntegerBaseTypes) {
auto op = std::make_shared<DummyOp>();
auto all_ty = type::integer();
auto signed_ty = type::signed_integer();
auto unsigned_ty = type::unsigned_integer();

ASSERT_TRUE(all_ty->IsInstance(*scalar::int32(op)));
ASSERT_TRUE(all_ty->IsInstance(*scalar::uint32(op)));
ASSERT_FALSE(all_ty->IsInstance(*array::float64(op)));
ASSERT_FALSE(all_ty->IsInstance(*array::binary(op)));

ASSERT_TRUE(signed_ty->IsInstance(*array::int32(op)));
ASSERT_FALSE(signed_ty->IsInstance(*scalar::uint32(op)));

ASSERT_TRUE(unsigned_ty->IsInstance(*scalar::uint32(op)));
ASSERT_TRUE(unsigned_ty->IsInstance(*array::uint32(op)));
ASSERT_FALSE(unsigned_ty->IsInstance(*array::int8(op)));
}

TEST(TestLogicalType, NumberConcreteIsinstance) {
auto op = std::make_shared<DummyOp>();

std::vector<LogicalTypePtr> types = {
type::null(), type::boolean(), type::int8(), type::int16(), type::int32(),
type::int64(), type::uint8(), type::uint16(), type::uint32(), type::uint64(),
type::float16(), type::float32(), type::float64(), type::binary(), type::utf8()};

std::vector<ExprPtr> exprs = {
scalar::null(op), array::null(op), scalar::boolean(op), array::boolean(op),
scalar::int8(op), array::int8(op), scalar::int16(op), array::int16(op),
scalar::int32(op), array::int32(op), scalar::int64(op), array::int64(op),
scalar::uint8(op), array::uint8(op), scalar::uint16(op), array::uint16(op),
scalar::uint32(op), array::uint32(op), scalar::uint64(op), array::uint64(op),
scalar::float16(op), array::float16(op), scalar::float32(op), array::float32(op),
scalar::float64(op), array::float64(op)};

for (auto ty : types) {
int num_matches = 0;
for (auto expr : exprs) {
const auto& v_expr = static_cast<const ValueExpr&>(*expr);
const bool ty_matches = v_expr.type()->id() == ty->id();
ASSERT_EQ(ty_matches, ty->IsInstance(v_expr))
<< "Expr: " << expr->kind() << " Type: " << ty->ToString();
num_matches += ty_matches;
}
// Each logical type is represented twice in the list of exprs, once in
// array form, the other in scalar form
ASSERT_LE(num_matches, 2);
}
}

} // namespace compute
} // namespace arrow
Loading

0 comments on commit 08ca13f

Please sign in to comment.