Leverage more engine capabilities in data skipping 2/n #83

roeap · 2023-12-09T21:34:32Z

based on #81

This PR updated the data skipping logic to better leverage the engines capabilities and reduce our arrow-exposure in core kernel. This best reviewed commit by commit as it permeates quite far. If more helpful, I can also split up this PR via these commits.

handle partition values

Introduce new Struct expression as suggested by @ryan-johnson-databricks
Add parsing for serialized partition values

use json handler for stats parsing

rewrite json parsing to replace hack_parse function.

evaluate full skipping predicate via evaluator & feedback ... rest

Add new DISTINCT binary expression
Evaluate skipping predicate via expression evaluator

closes: #69
closes: #68

kernel/src/client/conversion.rs

ryan-johnson-databricks · 2023-12-12T14:43:15Z

kernel/src/scan/mod.rs

+                        // TODO we should be passing an empty batch here, but not sure how
+                        partiton_arrays.push(evaluator.evaluate(&batch)?);


Not sure what this TODO means, sorry?

Update: We're evaluating a Literal expression that needs no inputs...

Another approach might be to get the (top-level) column names, and create a struct of expressions that becomes the output batch:

let mut fields = Vec::with_capacity(...); for (column, field) in &partition_fields { let value_expression = ...; fields.push(value_expression); } for field in self.schema.fields { let column_expression = Expression::Column(field.name); fields.push(column_expression); } // TODO: Set this up once overall, rather than once per batch! let evaluator = expression_handler.get_evaluator(batch.schema(), Expression::Struct(fields)); evaluator.evaluate(&batch)

Added in commit evaluate full skipping predicate via evaluator.

ryan-johnson-databricks · 2023-12-12T15:00:16Z

kernel/src/scan/mod.rs

+    // TODO the protocol states that an empty string is always a null value
+    // does this mean that we cannot have empty strings as a string partition value?


Delta spark hit this as well. It's a limitation of hive value partitioning that Delta inherited, with some spark limitations thrown in for good measure.

The least-bad solution we could come up with was to forcibly interpret empty strings as null on the write path, so that the read path consistently returns null. This indeed means you can't store an empty string partition value (it coerces to null), and also means you can't store an empty string (= null) in a partition column with a not-null constraint. See e.g.
https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala#L193

ryan-johnson-databricks · 2023-12-12T16:43:36Z

kernel/src/scan/mod.rs

+    match data_type {
+        DataType::Primitive(primitive) => match primitive {
+            PrimitiveType::String => Ok(Scalar::String(raw.to_string())),
+            PrimitiveType::Integer => {
+                Ok(Scalar::Integer(raw.parse::<i32>().map_err(|_| {
+                    Error::ParseError(raw.to_string(), data_type.clone())
+                })?))
+            }


Can we define a helper method to capture this boilerplate?
where

Suggested change

match data_type {

DataType::Primitive(primitive) => match primitive {

PrimitiveType::String => Ok(Scalar::String(raw.to_string())),

PrimitiveType::Integer => {

Ok(Scalar::Integer(raw.parse::<i32>().map_err(|_| {

Error::ParseError(raw.to_string(), data_type.clone())

})?))

}

match data_type {

DataType::Primitive(primitive) => primitive.parse_scalar(raw)

where

impl PrimitiveType { pub fn parse_scalar(&self, raw: &str) -> Result<Scalar, Error> { match self { Self::String => Ok(Scalar::String(raw.to_string())), Self::Byte => self.str_parse_scalar(raw, |i| Scalar::Byte(i)), ... other numeric types ... Self::Double => self.str_parse_scalar(raw, |i| Scalar::Double(i)), ... remaining types (decimal, bool, date/time) } } fn str_parse_scalar<T: std::str::FromStr>( &self, raw: &str, f: impl FnOnce(T) -> Scalar ) -> Result<Scalar, Error> { match raw.parse() { Ok(val) => Ok(f(val)), Err(..) => Err(Error::ParseError(raw.to_string(), DataType::Primitive(self))), } } }

(I tested it in rust playground, and the compiler is in fact able to do the necessary type inference!)

We could probably factor out the error handling as well, since not all primitive types would use str_parse_scalar method:

fn parse_error(&self, raw: &str) -> Error { Error::ParseError(raw.to_string(), DataType::Primitive(self)) }

and then

match raw.parse() { Ok(val) => Ok(f(val)), Err(..) => Err(self.parse_error(raw)), }

(we can't "just" factor out the ParseError to top-level, because the error type of str::parse<T> result depends on T)

These updates are included in handle partition values.

ryan-johnson-databricks · 2023-12-12T16:45:38Z

...sic_partitioned/letter=e/part-00000-847cf2d1-1247-4aa0-89ef-2f90c68ea51e.c000.snappy.parquet

FYI Delta spec doesn't require the file names to use hive-style partitioning scheme (the actual partition values come from the file's Add metadata entry). But it doesn't hurt either.

kernel/src/client/expression.rs

ryan-johnson-databricks · 2024-01-29T18:11:59Z

kernel/src/scan/mod.rs

+                        .as_any()
+                        .downcast_ref::<StructArray>()
+                        .ok_or(Error::UnexpectedColumnType("Unexpected array type".into()))?
+                        .into()


Aside: This casting idiom seems to show up a lot, and very bloaty.
Any thoughts on how we might be able to factor out the bloat to a helper of some kind?

we can do the same as you recently did for expressions etc and define helpers

impl Error { pub fn unexpected_column_type(msg: impl Into<String>) -> Self ... }

this would also allow us to harmonize messages, which IIRC are still inconsistent:

impl Error { pub fn unexpected_column_type(expected: &DataType, found: &DataType) -> Self ... }

It seems we should do a pass trough the errors soon anyhow, since the internal errors still have variants for arrow / parquet, and we should floow up on the discussion you started in slack. Maybe something for todays sync?

You can make it a little cleaner with:

use arrow_array::cast::AsArray; ... evaluator .evaluate(&batch)? .as_struct_opt() .ok_or(Error::UnexpectedColumnType("Unexpected array type".into()))? .into()

Done, there is likely some more opportunity to simplify things with AsArray as we move forward.

ryan-johnson-databricks · 2024-01-29T18:14:58Z

kernel/src/scan/mod.rs

+fn get_partition_value(raw: &str, data_type: &DataType) -> DeltaResult<Scalar> {
+    match data_type {
+        DataType::Primitive(primitive) => primitive.parse_scalar(raw),
+        _ => todo!(),


Is it really a TODO? AFAIK the spec only allows primitive values? If anything, Primitive might be too permissive, if the spec fails to mention some primitive type we support?

Or does the spec require us to support some non-primitive scalar types? If so, it might be helpful to spec them out with individual todo!() clauses, and a catch-all clause that only errors out for the remaining unknown and/or unsupported types?

Update: A quick comparison of schema.rs vs. partition value spec suggests that we have a near-perfect match. Just missing the TimestampNTZ type on the rust side, which is a table feature and so arguably ok to leave as TODO for now. All other types should be rejected outright.

so then... nit: make this return an Error instead of the todo!() panic :)

kernel/src/schema.rs

...HIVE_DEFAULT_PARTITION__/part-00000-8eb7f29a-e6a1-436e-a638-bbf0a7953f09.c000.snappy.parquet

roeap

thanks for the review @ryan-johnson-databricks - not sure how you feel, but it seems the data type / schema validation is a bigger thing in itself, that deserves a dedicated discussion (and with that PR) and we would focus here on the null-if thig as well as the other comments?

roeap · 2023-12-16T18:11:32Z

kernel/src/scan/data_skipping.rs

+            .downcast_ref::<BooleanArray>()
+            .ok_or(Error::UnexpectedColumnType(
+                "Expected type 'BooleanArray'.".into(),
+            ))?;

        let before_count = actions.num_rows();
        let after = filter_record_batch(actions, skipping_vector)?;


@ryan-johnson-databricks - the only larger thing where we spill arrow into data skipping is the filtering.

IIRC, you mentioned that we may want to introduce a dedicated API to apply filter vectors to data? Related to that, do we have a plan already how we create and engine-specific filter vector form the deletion vector?

this comment seems to have been in pending state since december - i certainly did not add it now 😆

This seems like a good one to double check w/ kernel-jvm folks at our next Thursday sync. At least at one point they were passing around (selection vector, columnar batch) pairs, and letting engine combine those if it wanted. I don't know what they did for creating the DV selection vector -- we should check that as well -- but I favor creating a boolean array directly, and asking engine to copy that to whatever internal format it likes.

kernel/src/client/expression.rs

roeap · 2024-01-29T21:43:57Z

kernel/src/client/expression.rs

-        BinaryOperation { op, left, right } => {
-            let left_arr = evaluate_expression(left.as_ref(), batch)?;
-            let right_arr = evaluate_expression(right.as_ref(), batch)?;
+        (BinaryOperation { op, left, right }, _) => {


hmm - i guess the first question to answer is if we want to include some kind of casting here as well and allow e.g float + int, in which case i would probably take the "biggest" type of left / right and make sure the result is of that type, otherwise probably left == right == result? not sure, but i think arrow will always require inputs of the same type.

in case of comparisons it must be boolean, but we may proactively check equality of the input types.

this may raise the question - if we want to validate a lot of things, and not let arrow raise for us, we may want to consider splitting comparisons and arithmetics. but not totally sure that thats wirth it.

roeap · 2024-01-29T21:50:48Z

kernel/src/expressions/mod.rs

+    // TODO how to model required functions?
+    NullIf {


at the time i felt that the operations were modelled consistently, but with adding additional functions that don't fit in these categories, I at least wanted to discuss if we just keep adding a new variant per function (if we even need more), or should have some additional grouping ...

roeap · 2024-01-29T22:23:15Z

kernel/src/scan/data_skipping.rs

+            static ref FILTER_EXPR: Expr = Expr::is_null(Expr::null_if(
+                Expr::column("predicate"),
+                Expr::column("predicate"),
+            ));


both options also have the benefit of us not havig to pre-compute "predicate" as a separate batch ... out of those I think I like distinct better, due to it being (at least i think so :)) more directly what we want, in contrast to coalesce, which if i read that right ...

Returns the data type of expression with the highest data type precedence.

... is also kind of "dynamic" in its return type and allows for mixed inputs.

somehow I also keep wondering if and_kleene might help as well, do you think thats worth looking into?

roeap · 2024-01-29T22:24:28Z

kernel/src/client/expression.rs

            let reducer = match op {
                VariadicOperator::And => and,
                VariadicOperator::Or => or,
            };
            exprs
                .iter()
-                .map(|expr| evaluate_expression(expr, batch))
+                .map(|expr| evaluate_expression(expr, batch, Some(&DataType::BOOLEAN)))


i think yes ...

kernel/src/client/expression.rs

ryan-johnson-databricks · 2024-01-29T22:32:03Z

it seems the data type / schema validation is a bigger thing in itself, that deserves a dedicated discussion (and with that PR) and we would focus here on the null-if thig as well as the other comments?

For sure. Sorry if that wasn't clear from my comments.

kernel/src/client/expression.rs

kernel/src/client/conversion.rs

kernel/src/client/expression.rs

ryan-johnson-databricks · 2024-02-01T16:38:09Z

kernel/src/client/expression.rs

        }
+        (Distinct { lhs, rhs }, Some(&DataType::BOOLEAN)) => {


Why a new top-level expression type? Can Distinct to be a new BinaryOperator instead?

(BinaryOperation { op, left, right }, _) => { ... let eval: Operation = match op { ... Equal => |l, r| eq(l, r).map(wrap_comparison_result), NotEqual => |l, r| neq(l, r).map(wrap_comparison_result), + Distinct => |l, r| distinct(l, r).map(wrap_comparison_result), };

(bonus: whatever type checking we eventually add would then benefit all comparison operators)

Aside: In a future PR that adds better type checking, should we introduce a ComparisonOperator sub-enum, for things that map (T, T) -> bool? And if we did that, should we also add an AlgebraicOperator (**) sub-enum, for things that map (T, T) -> T? That would capture the vast majority of binary operations in a structured way, while still allowing to add arbitrary other binary operators if needed (***)?

Edit: In retrospect, this seems very related to your question #83 (comment)

(**) According to Wikipedia,

An algebraic operation may also be defined simply as a function from a Cartesian power of a set to the same set.

(***) Perhaps ironically, arrow's nullif function is one such operator

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a1ab3228a14926b86b9c993011926299

kernel/src/client/expression.rs

ryan-johnson-databricks · 2024-02-01T18:14:14Z

kernel/src/client/json.rs

+    }
+}
+
+fn read_from_json<R: BufRead>(


Does this combo of read_from_json and get_reader solve the hack parsing we used to have?
Or just move it?
(the new logic looks a lot more complex, trying to figure out why)

Actually, I'm having trouble understanding how the code works.
In particular, what happens if (when?) buffer read boundaries don't match json record boundaries?

Does this combo of read_from_json and get_reader solve the hack parsing we used to have? Or just move it?

I hope it does, or at least improve the situation, see also comment below.

Not sure if this is what you are referring to, but the main thing to wrap my head around here was the the decode function will return, once it has filled a batch of size batch_size, which may or may not have consumed to whole buffer?

At lest from the docs it seemed, that the decoder can handle seeing incomplete data. Not sure though if that also holds true when we flush, at which point we should have always consumed the whole (reader) buffer though, so I guess that would be a error in the inout data?

Yeah, I would:

Add a comment below that states that the closure either reads batch_size or until the buffer is empty

add a comment below that decoded != read implies that we read more data into buf than a single batch, so we return and leave data in the buffer that will be handled in the next time the closure is called. perhaps also note that the data is left in the reader, so it will be in the buf again when we call fill_buf because we only consumed the decoded amount.

this could also maybe be more clear if you just made a mut res = vec!() and then looped and appended to it. I think the collect into a vec won't be much more efficient anyway.

ryan-johnson-databricks · 2024-02-01T18:30:44Z

kernel/src/client/json.rs

+    let columns = schema
+        .fields
+        .iter()
+        .map(|field| new_null_array(field.data_type(), null_count))


Doesn't new_null_array handle complex types? If so, why do we need to map over individual fields?
I think we just need to convert the schema to a DataType::Struct, perhaps via From for Fields? Tho in typical arrow-rust fashion, top-level vs. nested is just different enough that it might not be worth the trouble...

I think its like you said ...

top-level vs. nested is just different enough that it might not be worth the trouble
... or at least I thought so.

To create a Record batch we need a Vec, so we do make use of new_null_arrays ability to create complex types, but as it always creates a single array, we invoke it for every top level field.

Once we move to passing EngineData around, I thought about not using record batches anymore, but rather do everything via ArrayRef, casting to StructArray where we have RecordBatch right now. This would have the benefit of just having one type we pass around, rather then two - e.g. the ExpressionEvaluator takes a RecordBatch and returns an ArrayRef.

I like the idea of just using ArrayRef everywhere.

Given that RecordBatch implements From<StructArray> and StructArray implements From<RecordBatch>, it would seem arrow-rust at least tacitly recognizes the redundancy as well.

ryan-johnson-databricks · 2024-02-01T18:48:39Z

kernel/src/client/json.rs

+        let mut value_count = 0;
+        let mut value_start = 0;
+
+        for it in 0..json_strings.len() {


This logic is probably correct, but seems hard to grok and maintain. Is there any way we could simplify it?

If I'm not mistaken, the loop is basically breaking the input array into alternating null and non-null segments, and then replacing each segment with either its parsed result or nulls?

Suggested change

for it in 0..json_strings.len() {

// Early out here, because loop below can't handle an empty batch

if (... empty ...) return ...;

// Algo: Start a run that includes only element 0. Keep adding to the run as long

// as the "polarity" (null vs. non-null) matches. Upon encountering a polarity change,

// emit the previous run and start a new run with the new polarity. When the loop

// exits, we just need to emit the final (possibly also first) run and we're done!

let schema_as_struct = /* see other PR comment */;

let run_is_null = json_strings.is_null(mark)

let mut mark = 0;

// I forgot the magic incantation for inner functions that capture state...

fn emit_run|...|(...) {

if (run_is_null) {

insert_nulls(&mut batches, it - mark, &schema_as_struct);

} else {

// ... parse and insert the run of json values

}

}

for it in 1..json_strings.len() {

let value_is_null = json_strings.is_null(it);

if run_is_null != value_is_null {

// polarity change! emit the previous run

emit_run(...);

run_is_null = value_is_null;

mark = it;

}

emit_run(...);

}

Also: It would be helpful to explain why we go through such care to parse each non-null array segment as a group, instead of one-at-a-time or all-at-once?

I suspect we don't want to parse single values because it's expensive to fire up the parsing machinery and we want to amortize the cost as much as possible.

We also can't "just" parse the raw array as-is because the Arrow spec for variable-sized binary data says that, although the offsets for entries must be monotonically increasing,

It should be noted that a null value may have a positive slot length. That is, a null value may occupy a non-empty memory space in the data buffer. When this is true, the content of the corresponding memory space is undefined.

... so we can only safely parse contiguous chunks of non-null values.

That said, given that we anyway have to feed values into the parser's buffer before consuming them, I wonder if we instantiate the parsing machinery just once, and feed values to it as needed? If so, it should cost about the same parse one at a time in a simple for-loop. As a bonus, it would also harden our implementation against the worst-case where nulls and non-nulls alternate with run sizes all 1.

Yeah, this part of the code I was never really happy with.

A main difference to hack_parse is that we would create a new decoder (via a higer level API) for each row, and in addition, have every rows represented as a &str and convert back to bytes before passing to the parser. So what I think our main gain here is to create a single decoder, and always operate on the raw bytes without creating an intermediary string. Although I think the conversion is quite optimized, sice arrow string arrays are guaranteed utf8 encoded, and i think they con omit some cheks..

As for the loops ... The decoder will not give us back null rows, where the input value is null, which is why we have to fill these. From that I though we do have to keep track of the null runs at the very least. I guess instantiating the reader / cursor should not be too expensive and we could pass each valid row immediately to the decoder. However, this comes at a price as well, since the decoder will emit a new batch once the internal buffer has reached batch_size number of rows and we do have to flush whenever we switch polarity to have the matching non-null / null layout. Here I felt just tracking the run might be easier?

I'll try and simplify a bit using emit_run, but see no clear way yet to make it fulyl clean. Then again, during your reviews we already eliminated a lot of comlexity before, so lets see :).

One thing maybe worth mentioning, the arrow Decoder seems to implemnt a fairly sophisticated algorithm which tries to vectorize the parsing - "inspired by" simdjson's approach.

Without having actually looked at the internals i felt this might benefit from recievieng larger chunks of data at once, so that some of these optimizations can take effect?

Yeah, probably our best hope is to keep tracking runs, but simplify the run-tracking code as suggested above. It would still behave poorly in the worst case, but for Delta metadata reads, we expect either long runs of nulls (non-file actions), or long runs of values (file actions) and so the worst case should be super rare.

ryan-johnson-databricks · 2024-02-01T19:39:30Z

kernel/src/expressions/scalars.rs

+    fn str_parse_scalar<T: std::str::FromStr>(
+        &self,
+        raw: &str,
+        f: impl FnOnce(T) -> Scalar,


This took a while to unpack. If I understand correctly we're passing some Scalar variant's (implicit) constructor function as the FnOnce here? So e.g. a caller who passes Scalar::Double sets T: f64 (inferred from the constructor's argument type), leverages FromStr for f64 to convert string to f64, and then the result is passed to the constructor?

(maybe a code comment explaining the magic could help)

Yeah, agree on a comment.

Also nit: Just call this parse_scalar

that name is already taken :)

parse_scalar_impl, to make clear it's a helper for parse_scalar?
(the current str_parse_scalar name carries no intuition for me, at least)

hah right. well, not to bikeshed too much. I'd suggest parse_str_as_scalar then, but fine as is too.

a comment explaining what it does would be great.

kernel/src/scan/mod.rs

ryan-johnson-databricks

Forgot to add:

This PR updated the data skipping logic to better leverage the engines capabilities and reduce our arrow-exposure in core kernel. This best reviewed commit by commit as it permeates quite far. If more helpful, I can also split up this PR via these commits.

Splitting is usually good... but I don't know how much the three overlap? If it's just one giant mess of conflicts, maybe not?

This seems out of date?

Add new NULLIF expression

Evaluate is_null(null_if(..)) logic via expression evaluator

ryan-johnson-databricks · 2024-02-01T23:23:26Z

kernel/src/expressions/scalars.rs

@@ -109,6 +123,78 @@ impl From<String> for Scalar {

 // TODO: add more From impls


qq: We currently have From for i32 and i64, but we have scalars for i8 and i16. If somebody says Scalar::from(10u8) will they get Scalar::Integer or a compiler error?

right now this would get an error, but just added the missing implementations. so now you would get Scalar::byte.

nicklan · 2024-02-01T23:30:42Z

kernel/src/expressions/scalars.rs

+    fn str_parse_scalar<T: std::str::FromStr>(
+        &self,
+        raw: &str,
+        f: impl FnOnce(T) -> Scalar,


Yeah, agree on a comment.

Also nit: Just call this parse_scalar

nicklan · 2024-02-01T23:38:58Z

kernel/src/scan/mod.rs

+                        .as_any()
+                        .downcast_ref::<StructArray>()
+                        .ok_or(Error::UnexpectedColumnType("Unexpected array type".into()))?
+                        .into()


You can make it a little cleaner with:

use arrow_array::cast::AsArray; ... evaluator .evaluate(&batch)? .as_struct_opt() .ok_or(Error::UnexpectedColumnType("Unexpected array type".into()))? .into()

nicklan · 2024-02-01T23:41:15Z

kernel/src/scan/mod.rs

+fn get_partition_value(raw: &str, data_type: &DataType) -> DeltaResult<Scalar> {
+    match data_type {
+        DataType::Primitive(primitive) => primitive.parse_scalar(raw),
+        _ => todo!(),


so then... nit: make this return an Error instead of the todo!() panic :)

nicklan · 2024-02-02T00:10:02Z

kernel/src/client/json.rs

+    }
+}
+
+fn read_from_json<R: BufRead>(


Yeah, I would:

Add a comment below that states that the closure either reads batch_size or until the buffer is empty

add a comment below that decoded != read implies that we read more data into buf than a single batch, so we return and leave data in the buffer that will be handled in the next time the closure is called. perhaps also note that the data is left in the reader, so it will be in the buf again when we call fill_buf because we only consumed the decoded amount.

this could also maybe be more clear if you just made a mut res = vec!() and then looped and appended to it. I think the collect into a vec won't be much more efficient anyway.

kernel/src/client/json.rs

kernel/Cargo.toml

kernel/src/scan/data_skipping.rs

ryan-johnson-databricks

I think it's ~all nits at this point, except #83 (comment); and even that can potentially be fixed as a follow-up to unblock this PR.

ryan-johnson-databricks · 2024-02-12T21:47:22Z

kernel/src/client/expression.rs

+                "Variadic {expression:?} is expected to return boolean results, got {:?}",
                result_type


why not:

Suggested change

"Variadic {expression:?} is expected to return boolean results, got {:?}",

result_type

"Variadic {expression:?} should return a boolean result, got {result_type:?}"

b/c it was late :D, fixed.

ryan-johnson-databricks · 2024-02-12T21:52:53Z

kernel/src/scan/mod.rs

+    data_type: &DataType,
+) -> DeltaResult<Scalar> {
+    match raw {
+        None | Some(None) => Ok(Scalar::Null(data_type.clone())),


Should we just make this the second case and use a match-all?

_ => Ok(Scalar::Null(data_type.clone())),

ryan-johnson-databricks · 2024-02-12T22:09:06Z

kernel/src/client/conversion.rs

-            .map(|f| <ArrowField as TryFrom<&StructField>>::try_from(*f))
-            .collect::<Result<Vec<ArrowField>, ArrowError>>()?;
-
+        let fields: Vec<ArrowField> = s.fields().map(TryInto::try_into).try_collect()?;


I think type inference would allow just:

Suggested change

let fields: Vec<ArrowField> = s.fields().map(TryInto::try_into).try_collect()?;

Ok(ArrowSchema::new(s.fields().map(TryInto::try_into).try_collect()?))

Needed this ...

Ok(ArrowSchema::new( s.fields() .map(TryInto::try_into) .try_collect::<_, Vec<ArrowField>, _>()?, ))

... which fmt wants on three lines, so thought the current way is a little more concise?

Agree current way is better. I guess could also do Vec<_>, since AFAIK that's what the type inference can't figure out on its own (not sure why). But type clarity is also good, so probably we should leave it as-is.

ryan-johnson-databricks · 2024-02-12T22:13:27Z

kernel/src/client/json.rs

+            .collect::<Vec<_>>()
+            .into_iter()


Maybe this is just copied code, but it seems redundant to call collect and into_iter back to back like this?

ryan-johnson-databricks · 2024-02-12T22:14:35Z

kernel/src/client/json.rs

+            stats_schema
+                .fields
+                .iter()
+                .map(|field| new_null_array(field.data_type(), 1))


I think we discussed this somewhere else, but just confirming: arrow for some reason treats "schema" and "struct" as somehow different concepts, with no obviously easy way to convert between them, so we have to manually build up a struct here, rather than passing the schema directly?

Tho looking at the docs, it seems like this might work?

Suggested change

stats_schema

.fields

.iter()

.map(|field| new_null_array(field.data_type(), 1))

new_null_array(DataType::Struct(stats_schema.fields.clone()), 1)

https://arrow.apache.org/rust/arrow_schema/struct.Schema.html
https://arrow.apache.org/rust/arrow_schema/fields/struct.Fields.html
https://arrow.apache.org/rust/arrow_schema/enum.DataType.html#variant.Struct

turs out it almost works. but trying to do

let arr = new_null_array(&DataType::Struct(stats_schema.fields.clone()), 1); Ok(arr.as_struct().into())

leads to runtime errors since it will make top level fields nullable, which the RecordBatch does not allow. Once we move to only moving ArrayRefs around we should be able to make this change though.

Oh... this comes down to the difference between null struct (= definitely not allowed nor even sensible for a record batch), vs. struct whose fields are all null (allowed, and what your code was doing). We probably want to keep the struct-of-nulls behavior even after moving to ArrayRef, because otherwise we'd have to check whether the whole thing is null before accessing any of its columns?

yes, as you said :)

ryan-johnson-databricks · 2024-02-12T22:55:43Z

kernel/src/scan/data_skipping.rs

+            .ok_or(Error::UnexpectedColumnType(
+                "Expected type 'StructArray'.".into(),
+            ))?


nit

Suggested change

.ok_or(Error::UnexpectedColumnType(

"Expected type 'StructArray'.".into(),

))?

.ok_or(Error::unexpectedColumnType("Expected type 'StructArray'."))?

ryan-johnson-databricks · 2024-02-12T22:58:15Z

kernel/src/scan/mod.rs

+                .filter(|f| {
+                    !self
+                        .snapshot
+                        .metadata()
+                        .partition_columns
+                        .contains(f.name())
+                })


nit: Is that a potentially expensive inner loop access? I wonder if it might be easier to grok (as well as cheaper) by capturing a variable instead?

let partition_columns = self.snapshot.metadata().partition_columns;

and then

Suggested change

.filter(|f| {

!self

.snapshot

.metadata()

.partition_columns

.contains(f.name())

})

.filter(|f| !partition_columns.contains(f.name())

(especially since we could reuse partition_columns at L161 below)

kernel/src/scan/mod.rs

ryan-johnson-databricks · 2024-02-12T23:46:29Z

kernel/src/client/expression.rs

+        (VariadicOperation { .. }, _) => {
+            // NOTE: If we get here, it would be a bug in our code. However it does swallow


Actually, would something like this work?

(VariadicOperation { op, exprs }, None | Some(&DataType::BOOLEAN)) => {

kernel/src/client/json.rs

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

ryan-johnson-databricks · 2024-02-13T19:38:58Z

kernel/src/client/expression.rs

            // NOTE: If we get here, it would be a bug in our code. However it does swallow
            // the error message from the compiler if we add variants to the enum and forget to add them here.


I believe the note is no longer accurate? Now shouldn't swallow any compile-time errors for new variants, because the match at L202-205 would become incomplete, and the generic case only applies if the caller passed Some incompatible data type?

ryan-johnson-databricks · 2024-02-13T19:47:20Z

kernel/src/client/json.rs

@@ -106,7 +104,7 @@ impl<E: TaskExecutor> JsonHandler for DefaultJsonHandler<E> {
            json_strings
                .iter()
                .map(|json_string| hack_parse(&output_schema, json_string))
-                .collect::<Result<Vec<_>, _>>()?


Another case where type inference isn't working nicely.
I think the original code worked better, with "only" two underscores instead of three...

ryan-johnson-databricks · 2024-02-13T19:51:15Z

kernel/src/expressions/mod.rs

+        Self::BinaryOperation {
+            op: BinaryOperator::Distinct,
+            left: Box::new(self),
+            right: Box::new(other),


Suggested change

Self::BinaryOperation {

op: BinaryOperator::Distinct,

left: Box::new(self),

right: Box::new(other),

Self::binary(BinaryOperator::Distinct, self, other)

ryan-johnson-databricks · 2024-02-13T19:52:03Z

kernel/src/scan/data_skipping.rs

+            .ok_or(Error::unexpected_column_type(
+                "Expected type 'StructArray'.",
            ))?


A bit surprising that this doesn't fit on one line, but rustfmt does what it does, I guess?

roeap

just some pending answers.

nicklan

this LGTM. Thanks so much!

There are a few open suggestions from Ryan that should probably be applied or resolved (if you don't want to apply them for some reason), but they are mostly minor so... Approved!

ryan-johnson-databricks · 2024-02-15T18:36:42Z

kernel/src/client/json.rs

+            json_strings
+                .iter()
+                .map(|json_string| hack_parse(&output_schema, json_string))
+                .collect::<Result<Vec<_>, _>>()?


If we really want to simplify this:

let output: Vec<_> = json_strings .iter() .map(|json_string| hack_parse(&output_schema, json_string)) .try_collect()?; Ok(concat_batches(&output_schema, output.iter())?)

roeap force-pushed the partition-values branch from cb01303 to ec20edf Compare December 11, 2023 23:39

ryan-johnson-databricks reviewed Dec 12, 2023

View reviewed changes

roeap mentioned this pull request Dec 16, 2023

feat: integrate file skipping with expression evaluator 1/n #81

Merged

roeap force-pushed the partition-values branch 4 times, most recently from 4d228b4 to 7c26fd4 Compare December 16, 2023 17:36

roeap changed the title ~~[WIP] feat: handle partition values~~ [WIP] leverage more engine capabilities in data skipping Dec 16, 2023

roeap force-pushed the partition-values branch 2 times, most recently from f3d18dc to d14eae8 Compare December 16, 2023 18:01

roeap marked this pull request as ready for review December 16, 2023 18:27

roeap force-pushed the partition-values branch 6 times, most recently from 039561b to 8eefc31 Compare December 17, 2023 13:18

roeap changed the title ~~[WIP] leverage more engine capabilities in data skipping~~ [WIP] leverage more engine capabilities in data skipping 2/n Dec 17, 2023

roeap mentioned this pull request Dec 19, 2023

[WIP] support projection and partial reads in parquet client 3/n #101

Closed

roeap force-pushed the partition-values branch from 8eefc31 to d644e81 Compare December 21, 2023 09:57

roeap force-pushed the partition-values branch from d644e81 to dc5119e Compare January 28, 2024 07:01

ryan-johnson-databricks reviewed Jan 29, 2024

View reviewed changes

roeap commented Jan 29, 2024

View reviewed changes

roeap added 4 commits January 31, 2024 22:05

feat: handle partition values

33854a4

feat: use json handler for stats parsing

db8ef02

feat: evaluate full skipping predicate via evaluator

6b5c85e

fix: RP feedback

aa44f36

roeap force-pushed the partition-values branch from dc5119e to aa44f36 Compare January 31, 2024 21:07

refactor: replace null_if with distinct in skipping predicate

5bc24e7

roeap force-pushed the partition-values branch from 5f4ba2f to 5bc24e7 Compare February 1, 2024 07:09

roeap changed the title ~~[WIP] leverage more engine capabilities in data skipping 2/n~~ Leverage more engine capabilities in data skipping 2/n Feb 1, 2024

ryan-johnson-databricks reviewed Feb 1, 2024

View reviewed changes

nicklan reviewed Feb 2, 2024

View reviewed changes

nicklan mentioned this pull request Feb 7, 2024

Switch to new data passing API #109

Merged

roeap added 4 commits February 9, 2024 20:21

style: simplify

8870b88

refactor: simplify json parsing

c8f638b

fix: revert to hack_parse

3bd54ad

fix: PR feedback

3a9e06c

ryan-johnson-databricks approved these changes Feb 12, 2024

View reviewed changes

roeap and others added 4 commits February 13, 2024 19:25

Update kernel/src/scan/mod.rs

0404430

Co-authored-by: Ryan Johnson <ryan.johnson@databricks.com>

fix: make DISTINCT a BinaryOperation

37ac4d1

style: various style fixes and docs

de03ca0

refactor: re-use partiton columns

dfb981c

ryan-johnson-databricks reviewed Feb 13, 2024

View reviewed changes

roeap commented Feb 13, 2024

View reviewed changes

style: some more PR comments.

34d6a3d

nicklan approved these changes Feb 15, 2024

View reviewed changes

ryan-johnson-databricks approved these changes Feb 15, 2024

View reviewed changes

fix: one more suggestion

6cb4fe4

roeap merged commit 5f48dea into delta-incubator:main Feb 15, 2024
3 checks passed

roeap deleted the partition-values branch February 15, 2024 18:59

		// TODO we should be passing an empty batch here, but not sure how
		partiton_arrays.push(evaluator.evaluate(&batch)?);

		// TODO the protocol states that an empty string is always a null value
		// does this mean that we cannot have empty strings as a string partition value?

-        for it in 0..json_strings.len() {
+// Early out here, because loop below can't handle an empty batch
+if (... empty ...) return ...;
+// Algo: Start a run that includes only element 0. Keep adding to the run as long
+// as the "polarity" (null vs. non-null) matches. Upon encountering a polarity change,
+// emit the previous run and start a new run with the new polarity. When the loop
+// exits, we just need to emit the final (possibly also first) run and we're done!
+let schema_as_struct = /* see other PR comment */;
+let run_is_null = json_strings.is_null(mark)
+let mut mark = 0;
+// I forgot the magic incantation for inner functions that capture state...
+fn emit_run|...|(...) {
+    if (run_is_null) {
+      insert_nulls(&mut batches, it - mark, &schema_as_struct);
+    } else {
+      // ... parse and insert the run of json values
+    }
+}
+for it in 1..json_strings.len() {
+  let value_is_null = json_strings.is_null(it);
+  if run_is_null != value_is_null {
+    // polarity change! emit the previous run
+    emit_run(...);
+    run_is_null = value_is_null;
+    mark = it;
+  }
+  emit_run(...);
+}

		@@ -109,6 +123,78 @@ impl From<String> for Scalar {

		// TODO: add more From impls

		"Variadic {expression:?} is expected to return boolean results, got {:?}",
		result_type

	"Variadic {expression:?} is expected to return boolean results, got {:?}",
	result_type
	"Variadic {expression:?} should return a boolean result, got {result_type:?}"

	let fields: Vec<ArrowField> = s.fields().map(TryInto::try_into).try_collect()?;
	Ok(ArrowSchema::new(s.fields().map(TryInto::try_into).try_collect()?))

		(VariadicOperation { .. }, _) => {
		// NOTE: If we get here, it would be a bug in our code. However it does swallow

		// NOTE: If we get here, it would be a bug in our code. However it does swallow
		// the error message from the compiler if we add variants to the enum and forget to add them here.

Leverage more engine capabilities in data skipping 2/n #83

Leverage more engine capabilities in data skipping 2/n #83

Conversation

roeap commented Dec 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap commented Dec 9, 2023 •

edited

Loading

roeap Feb 1, 2024 •

edited

Loading

ryan-johnson-databricks Feb 1, 2024 •

edited

Loading

roeap Feb 1, 2024 •

edited

Loading

ryan-johnson-databricks Feb 1, 2024 •

edited

Loading