Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions rust/ballista/rust/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ members = [
"scheduler",
]

[profile.release]
lto = true
codegen-units = 1
#[profile.release]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan This was an accidental commit. I had to comment this out so that I could build and test without really slow build times. Is there a better way for me to work around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would really be convenient for this to have this feature rust-lang/cargo#6988 as I agree if you just want to run it with "reasonable" performance it shouldn't take ages to compile Ballista. In my experience it roughly 2x as long for project to build with lto (maybe worse when you compare it with incremental builds).

It is possible to do via flags as well, but earlier this didn't work because of the structure of the Ballista projects (multiple binaries per crate as far as I remember), maybe we can just temporary remove this (and be a bit slower) and see if we can enable it in a different way.
Another route I saw is just appending lto = true in a build script when creating binaries.

#lto = true
#codegen-units = 1
7 changes: 3 additions & 4 deletions rust/ballista/rust/benchmarks/tpch/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,9 @@ edition = "2018"
[dependencies]
ballista = { path="../../client" }

arrow = { git = "https://github.com/apache/arrow", rev="46161d2" }
datafusion = { git = "https://github.com/apache/arrow", rev="46161d2" }
parquet = { git = "https://github.com/apache/arrow", rev="46161d2" }

arrow = { path = "../../../../arrow" }
datafusion = { path = "../../../../datafusion" }
parquet = { path = "../../../../parquet" }

env_logger = "0.8"
tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread"] }
Expand Down
5 changes: 3 additions & 2 deletions rust/ballista/rust/client/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,6 @@ ballista-core = { path = "../core" }
futures = "0.3"
log = "0.4"
tokio = "1.0"
arrow = { git = "https://github.com/apache/arrow", rev="46161d2" }
datafusion = { git = "https://github.com/apache/arrow", rev="46161d2" }

arrow = { path = "../../../arrow" }
datafusion = { path = "../../../datafusion" }
14 changes: 9 additions & 5 deletions rust/ballista/rust/client/src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ use ballista_core::{
};

use arrow::datatypes::Schema;
use datafusion::catalog::TableReference;
use datafusion::execution::context::ExecutionContext;
use datafusion::logical_plan::{DFSchema, Expr, LogicalPlan, Partitioning};
use datafusion::physical_plan::csv::CsvReadOptions;
Expand Down Expand Up @@ -148,7 +149,10 @@ impl BallistaContext {
for (name, plan) in &state.tables {
let plan = ctx.optimize(plan)?;
let execution_plan = ctx.create_physical_plan(&plan)?;
ctx.register_table(name, Arc::new(DFTableAdapter::new(plan, execution_plan)));
ctx.register_table(
TableReference::Bare { table: name },
Arc::new(DFTableAdapter::new(plan, execution_plan)),
)?;
}
let df = ctx.sql(sql)?;
Ok(BallistaDataFrame::from(self.state.clone(), df))
Expand Down Expand Up @@ -267,7 +271,7 @@ impl BallistaDataFrame {
))
}

pub fn select(&self, expr: &[Expr]) -> Result<BallistaDataFrame> {
pub fn select(&self, expr: Vec<Expr>) -> Result<BallistaDataFrame> {
Ok(Self::from(
self.state.clone(),
self.df.select(expr).map_err(BallistaError::from)?,
Expand All @@ -283,8 +287,8 @@ impl BallistaDataFrame {

pub fn aggregate(
&self,
group_expr: &[Expr],
aggr_expr: &[Expr],
group_expr: Vec<Expr>,
aggr_expr: Vec<Expr>,
) -> Result<BallistaDataFrame> {
Ok(Self::from(
self.state.clone(),
Expand All @@ -301,7 +305,7 @@ impl BallistaDataFrame {
))
}

pub fn sort(&self, expr: &[Expr]) -> Result<BallistaDataFrame> {
pub fn sort(&self, expr: Vec<Expr>) -> Result<BallistaDataFrame> {
Ok(Self::from(
self.state.clone(),
self.df.sort(expr).map_err(BallistaError::from)?,
Expand Down
7 changes: 4 additions & 3 deletions rust/ballista/rust/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,10 @@ sqlparser = "0.8"
tokio = "1.0"
tonic = "0.4"
uuid = { version = "0.8", features = ["v4"] }
arrow = { git = "https://github.com/apache/arrow", rev="46161d2" }
arrow-flight = { git = "https://github.com/apache/arrow", rev="46161d2" }
datafusion = { git = "https://github.com/apache/arrow", rev="46161d2" }

arrow = { path = "../../../arrow" }
arrow-flight = { path = "../../../arrow-flight" }
datafusion = { path = "../../../datafusion" }


[dev-dependencies]
Expand Down
6 changes: 6 additions & 0 deletions rust/ballista/rust/core/proto/ballista.proto
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ message LogicalExprNode {
InListNode in_list = 14;
bool wildcard = 15;
ScalarFunctionNode scalar_function = 16;
TryCastNode try_cast = 17;
}
}

Expand Down Expand Up @@ -172,6 +173,11 @@ message CastNode {
ArrowType arrow_type = 2;
}

message TryCastNode {
LogicalExprNode expr = 1;
ArrowType arrow_type = 2;
}

message SortExprNode {
LogicalExprNode expr = 1;
bool asc = 2;
Expand Down
1 change: 1 addition & 0 deletions rust/ballista/rust/core/src/datasource.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ impl TableProvider for DFTableAdapter {
_projection: &Option<Vec<usize>>,
_batch_size: usize,
_filters: &[Expr],
_limit: Option<usize>,
) -> DFResult<Arc<dyn ExecutionPlan>> {
Ok(self.plan.clone())
}
Expand Down
46 changes: 27 additions & 19 deletions rust/ballista/rust/core/src/serde/logical_plan/from_proto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,13 @@ impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode {
match plan {
LogicalPlanType::Projection(projection) => {
let input: LogicalPlan = convert_box_required!(projection.input)?;
let x: Vec<Expr> = projection
.expr
.iter()
.map(|expr| expr.try_into())
.collect::<Result<Vec<_>, _>>()?;
LogicalPlanBuilder::from(&input)
.project(
&projection
.expr
.iter()
.map(|expr| expr.try_into())
.collect::<Result<Vec<_>, _>>()?,
)?
.project(x)?
.build()
.map_err(|e| e.into())
}
Expand Down Expand Up @@ -89,7 +88,7 @@ impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode {
.map(|expr| expr.try_into())
.collect::<Result<Vec<_>, _>>()?;
LogicalPlanBuilder::from(&input)
.aggregate(&group_expr, &aggr_expr)?
.aggregate(group_expr, aggr_expr)?
.build()
.map_err(|e| e.into())
}
Expand Down Expand Up @@ -148,7 +147,7 @@ impl TryInto<LogicalPlan> for &protobuf::LogicalPlanNode {
.map(|expr| expr.try_into())
.collect::<Result<Vec<Expr>, _>>()?;
LogicalPlanBuilder::from(&input)
.sort(&sort_expr)?
.sort(sort_expr)?
.build()
.map_err(|e| e.into())
}
Expand Down Expand Up @@ -511,10 +510,10 @@ fn typechecked_scalar_value_conversion(
ScalarValue::Date32(Some(*v))
}
(Value::TimeMicrosecondValue(v), PrimitiveScalarType::TimeMicrosecond) => {
ScalarValue::TimeMicrosecond(Some(*v))
ScalarValue::TimestampMicrosecond(Some(*v))
}
(Value::TimeNanosecondValue(v), PrimitiveScalarType::TimeMicrosecond) => {
ScalarValue::TimeNanosecond(Some(*v))
ScalarValue::TimestampNanosecond(Some(*v))
}
(Value::Utf8Value(v), PrimitiveScalarType::Utf8) => {
ScalarValue::Utf8(Some(v.to_owned()))
Expand Down Expand Up @@ -547,10 +546,10 @@ fn typechecked_scalar_value_conversion(
PrimitiveScalarType::LargeUtf8 => ScalarValue::LargeUtf8(None),
PrimitiveScalarType::Date32 => ScalarValue::Date32(None),
PrimitiveScalarType::TimeMicrosecond => {
ScalarValue::TimeMicrosecond(None)
ScalarValue::TimestampMicrosecond(None)
}
PrimitiveScalarType::TimeNanosecond => {
ScalarValue::TimeNanosecond(None)
ScalarValue::TimestampNanosecond(None)
}
PrimitiveScalarType::Null => {
return Err(proto_error(
Expand Down Expand Up @@ -610,10 +609,10 @@ impl TryInto<datafusion::scalar::ScalarValue> for &protobuf::scalar_value::Value
ScalarValue::Date32(Some(*v))
}
protobuf::scalar_value::Value::TimeMicrosecondValue(v) => {
ScalarValue::TimeMicrosecond(Some(*v))
ScalarValue::TimestampMicrosecond(Some(*v))
}
protobuf::scalar_value::Value::TimeNanosecondValue(v) => {
ScalarValue::TimeNanosecond(Some(*v))
ScalarValue::TimestampNanosecond(Some(*v))
}
protobuf::scalar_value::Value::ListValue(v) => v.try_into()?,
protobuf::scalar_value::Value::NullListValue(v) => {
Expand Down Expand Up @@ -776,10 +775,10 @@ impl TryInto<datafusion::scalar::ScalarValue> for protobuf::PrimitiveScalarType
protobuf::PrimitiveScalarType::LargeUtf8 => ScalarValue::LargeUtf8(None),
protobuf::PrimitiveScalarType::Date32 => ScalarValue::Date32(None),
protobuf::PrimitiveScalarType::TimeMicrosecond => {
ScalarValue::TimeMicrosecond(None)
ScalarValue::TimestampMicrosecond(None)
}
protobuf::PrimitiveScalarType::TimeNanosecond => {
ScalarValue::TimeNanosecond(None)
ScalarValue::TimestampNanosecond(None)
}
})
}
Expand Down Expand Up @@ -829,10 +828,10 @@ impl TryInto<datafusion::scalar::ScalarValue> for &protobuf::ScalarValue {
ScalarValue::Date32(Some(*v))
}
protobuf::scalar_value::Value::TimeMicrosecondValue(v) => {
ScalarValue::TimeMicrosecond(Some(*v))
ScalarValue::TimestampMicrosecond(Some(*v))
}
protobuf::scalar_value::Value::TimeNanosecondValue(v) => {
ScalarValue::TimeNanosecond(Some(*v))
ScalarValue::TimestampNanosecond(Some(*v))
}
protobuf::scalar_value::Value::ListValue(scalar_list) => {
let protobuf::ScalarListValue {
Expand Down Expand Up @@ -962,6 +961,15 @@ impl TryInto<Expr> for &protobuf::LogicalExprNode {
let data_type = arrow_type.try_into()?;
Ok(Expr::Cast { expr, data_type })
}
ExprType::TryCast(cast) => {
let expr = Box::new(parse_required_expr(&cast.expr)?);
let arrow_type: &protobuf::ArrowType = cast
.arrow_type
.as_ref()
.ok_or_else(|| proto_error("Protobuf deserialization error: CastNode message missing required field 'arrow_type'"))?;
let data_type = arrow_type.try_into()?;
Ok(Expr::TryCast { expr, data_type })
}
ExprType::Sort(sort) => Ok(Expr::Sort {
expr: Box::new(parse_required_expr(&sort.expr)?),
asc: sort.asc,
Expand Down
28 changes: 14 additions & 14 deletions rust/ballista/rust/core/src/serde/logical_plan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ mod roundtrip_tests {
CsvReadOptions::new().schema(&schema).has_header(true),
Some(vec![3, 4]),
)
.and_then(|plan| plan.sort(&[col("salary")]))
.and_then(|plan| plan.sort(vec![col("salary")]))
.and_then(|plan| plan.build())
.map_err(BallistaError::DataFusionError)?,
);
Expand Down Expand Up @@ -212,8 +212,8 @@ mod roundtrip_tests {
ScalarValue::LargeUtf8(None),
ScalarValue::List(None, DataType::Boolean),
ScalarValue::Date32(None),
ScalarValue::TimeMicrosecond(None),
ScalarValue::TimeNanosecond(None),
ScalarValue::TimestampMicrosecond(None),
ScalarValue::TimestampNanosecond(None),
ScalarValue::Boolean(Some(true)),
ScalarValue::Boolean(Some(false)),
ScalarValue::Float32(Some(1.0)),
Expand Down Expand Up @@ -252,11 +252,11 @@ mod roundtrip_tests {
ScalarValue::LargeUtf8(Some(String::from("Test Large utf8"))),
ScalarValue::Date32(Some(0)),
ScalarValue::Date32(Some(i32::MAX)),
ScalarValue::TimeNanosecond(Some(0)),
ScalarValue::TimeNanosecond(Some(i64::MAX)),
ScalarValue::TimeMicrosecond(Some(0)),
ScalarValue::TimeMicrosecond(Some(i64::MAX)),
ScalarValue::TimeMicrosecond(None),
ScalarValue::TimestampNanosecond(Some(0)),
ScalarValue::TimestampNanosecond(Some(i64::MAX)),
ScalarValue::TimestampMicrosecond(Some(0)),
ScalarValue::TimestampMicrosecond(Some(i64::MAX)),
ScalarValue::TimestampMicrosecond(None),
ScalarValue::List(
Some(vec![
ScalarValue::Float32(Some(-213.1)),
Expand Down Expand Up @@ -610,8 +610,8 @@ mod roundtrip_tests {
ScalarValue::Utf8(None),
ScalarValue::LargeUtf8(None),
ScalarValue::Date32(None),
ScalarValue::TimeMicrosecond(None),
ScalarValue::TimeNanosecond(None),
ScalarValue::TimestampMicrosecond(None),
ScalarValue::TimestampNanosecond(None),
//ScalarValue::List(None, DataType::Boolean)
];

Expand Down Expand Up @@ -679,7 +679,7 @@ mod roundtrip_tests {
CsvReadOptions::new().schema(&schema).has_header(true),
Some(vec![3, 4]),
)
.and_then(|plan| plan.sort(&[col("salary")]))
.and_then(|plan| plan.sort(vec![col("salary")]))
.and_then(|plan| plan.explain(true))
.and_then(|plan| plan.build())
.map_err(BallistaError::DataFusionError)?;
Expand All @@ -689,7 +689,7 @@ mod roundtrip_tests {
CsvReadOptions::new().schema(&schema).has_header(true),
Some(vec![3, 4]),
)
.and_then(|plan| plan.sort(&[col("salary")]))
.and_then(|plan| plan.sort(vec![col("salary")]))
.and_then(|plan| plan.explain(false))
.and_then(|plan| plan.build())
.map_err(BallistaError::DataFusionError)?;
Expand Down Expand Up @@ -742,7 +742,7 @@ mod roundtrip_tests {
CsvReadOptions::new().schema(&schema).has_header(true),
Some(vec![3, 4]),
)
.and_then(|plan| plan.sort(&[col("salary")]))
.and_then(|plan| plan.sort(vec![col("salary")]))
.and_then(|plan| plan.build())
.map_err(BallistaError::DataFusionError)?;
roundtrip_test!(plan);
Expand Down Expand Up @@ -784,7 +784,7 @@ mod roundtrip_tests {
CsvReadOptions::new().schema(&schema).has_header(true),
Some(vec![3, 4]),
)
.and_then(|plan| plan.aggregate(&[col("state")], &[max(col("salary"))]))
.and_then(|plan| plan.aggregate(vec![col("state")], vec![max(col("salary"))]))
.and_then(|plan| plan.build())
.map_err(BallistaError::DataFusionError)?;

Expand Down
14 changes: 4 additions & 10 deletions rust/ballista/rust/core/src/serde/logical_plan/to_proto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -641,12 +641,12 @@ impl TryFrom<&datafusion::scalar::ScalarValue> for protobuf::ScalarValue {
datafusion::scalar::ScalarValue::Date32(val) => {
create_proto_scalar(val, PrimitiveScalarType::Date32, |s| Value::Date32Value(*s))
}
datafusion::scalar::ScalarValue::TimeMicrosecond(val) => {
datafusion::scalar::ScalarValue::TimestampMicrosecond(val) => {
create_proto_scalar(val, PrimitiveScalarType::TimeMicrosecond, |s| {
Value::TimeMicrosecondValue(*s)
})
}
datafusion::scalar::ScalarValue::TimeNanosecond(val) => {
datafusion::scalar::ScalarValue::TimestampNanosecond(val) => {
create_proto_scalar(val, PrimitiveScalarType::TimeNanosecond, |s| {
Value::TimeNanosecondValue(*s)
})
Expand Down Expand Up @@ -939,10 +939,7 @@ impl TryInto<protobuf::LogicalPlanNode> for &LogicalPlan {
})
}
LogicalPlan::Extension { .. } => unimplemented!(),
// _ => Err(BallistaError::General(format!(
// "logical plan to_proto {:?}",
// self
// ))),
LogicalPlan::Union { .. } => unimplemented!(),
}
}
}
Expand Down Expand Up @@ -1161,10 +1158,7 @@ impl TryInto<protobuf::LogicalExprNode> for &Expr {
Expr::Wildcard => Ok(protobuf::LogicalExprNode {
expr_type: Some(protobuf::logical_expr_node::ExprType::Wildcard(true)),
}),
// _ => Err(BallistaError::General(format!(
// "logical expr to_proto {:?}",
// self
// ))),
Expr::TryCast { .. } => unimplemented!(),
}
}
}
Expand Down
Loading