Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datum based comparison kernels (#4596) #4701

Merged
merged 11 commits into from Aug 18, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4596

Rationale for this change

Adds datum based comparison kernels, deprecating the old kernels and drastically reducing the amount of generated code.

Benchmarks show no major performance regressions, and if anything non-trivial performance improvements.

Additionally a release build on master with dyn_cmp_dict takes

________________________________________________________
Executed in   39.60 secs    fish           external
   usr time  128.19 secs  583.00 micros  128.19 secs
   sys time    2.71 secs  102.00 micros    2.71 secs

But with the changes in this PR takes

________________________________________________________
Executed in   15.87 secs    fish           external
   usr time   60.84 secs    0.00 micros   60.84 secs
   sys time    2.28 secs  643.00 micros    2.28 secs

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 15, 2023

[features]
dyn_cmp_dict = []
simd = ["arrow-array/simd"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SIMD variants added a non-trivial amount of code complexity, and behaved differently. I opted to simply remove them, we can always add back some sort of SIMD specialization for primitives in future should users feel strongly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other kernels as I recall we found properly written rust code would be vectorized by llvm more effectively than our hand rolled simd kernels. Do you think that is still the case?

cc @jhorstmann

Copy link
Contributor Author

@tustvold tustvold Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is empirically not the case here, LLVM is really bad at optimising horizontal operations such as creating a packed bitmask from a comparison. The LLVM generated kernels are ~2x slower

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, though I feel like llvm has gotten a little better at this. on the other hand, the simd kernels also did not generate "perfect" code due to combining multiple masks into a single u64 (except for u8 comparisons which have 64 lanes).

I think the tradeoff against improved compile time is ok, and longer term this allows more code cleanups. The decimal types for example already had a quite hacky support for simd. The last simd usages then would be the aggregation kernels.

Ok(BooleanArray::new(values, nulls))
}

fn values(a: &dyn Array) -> (Option<Vec<usize>>, &dyn Array) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation not only allows for mixed dictionary and non-dictionary arrays (as before), but also allows for mixed dictionary key sizes.

The cost of hydrating the keys in this way is irrelevant compared to the execution time of the kernels, and avoids a huge amount of codegen

@tustvold
Copy link
Contributor Author

Benchmarks

eq Float32              time:   [7.6371 µs 7.6427 µs 7.6492 µs]
                        change: [-17.269% -17.067% -16.928%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

eq scalar Float32       time:   [5.2425 µs 5.2445 µs 5.2466 µs]
                        change: [-2.5786% -2.3116% -2.0499%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

neq Float32             time:   [7.6537 µs 7.6596 µs 7.6663 µs]
                        change: [-25.233% -25.014% -24.801%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

neq scalar Float32      time:   [5.2445 µs 5.2465 µs 5.2487 µs]
                        change: [-35.310% -35.116% -34.928%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

lt Float32              time:   [15.896 µs 15.905 µs 15.916 µs]
                        change: [-0.6139% -0.3269% -0.0448%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

lt scalar Float32       time:   [9.8077 µs 9.8141 µs 9.8213 µs]
                        change: [+0.1201% +0.3694% +0.5205%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

lt_eq Float32           time:   [15.904 µs 15.913 µs 15.923 µs]
                        change: [-10.285% -10.072% -9.9334%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

lt_eq scalar Float32    time:   [9.8315 µs 9.8352 µs 9.8395 µs]
                        change: [-21.082% -21.007% -20.928%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

gt Float32              time:   [16.006 µs 16.070 µs 16.157 µs]
                        change: [+4.4889% +5.8947% +7.4152%] (p = 0.00 < 0.05)
                        Performance has regressed.

gt scalar Float32       time:   [9.8282 µs 9.8316 µs 9.8356 µs]
                        change: [+0.3787% +0.5923% +0.7215%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

gt_eq Float32           time:   [15.906 µs 15.913 µs 15.922 µs]
                        change: [-9.9407% -9.8782% -9.8165%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

gt_eq scalar Float32    time:   [9.7886 µs 9.7941 µs 9.7995 µs]
                        change: [-21.558% -21.436% -21.244%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

eq Int32                time:   [7.6285 µs 7.6364 µs 7.6451 µs]
                        change: [-13.069% -12.808% -12.552%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

eq scalar Int32         time:   [5.1502 µs 5.1527 µs 5.1559 µs]
                        change: [-4.3827% -4.2114% -3.9813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

neq Int32               time:   [7.5440 µs 7.5503 µs 7.5576 µs]
                        change: [-20.459% -20.251% -20.113%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  9 (9.00%) high severe

neq scalar Int32        time:   [5.1496 µs 5.1521 µs 5.1550 µs]
                        change: [-36.346% -36.169% -35.980%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  8 (8.00%) high mild
  9 (9.00%) high severe

lt Int32                time:   [7.6004 µs 7.6047 µs 7.6095 µs]
                        change: [-19.626% -18.502% -17.323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

lt scalar Int32         time:   [5.1520 µs 5.1535 µs 5.1553 µs]
                        change: [-4.4893% -4.2218% -3.9580%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

lt_eq Int32             time:   [7.5444 µs 7.5487 µs 7.5541 µs]
                        change: [-21.125% -20.891% -20.683%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

lt_eq scalar Int32      time:   [5.1585 µs 5.1608 µs 5.1636 µs]
                        change: [-35.574% -35.425% -35.323%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

gt Int32                time:   [7.5334 µs 7.5385 µs 7.5443 µs]
                        change: [-14.111% -13.915% -13.789%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low severe
  4 (4.00%) high mild
  6 (6.00%) high severe

gt scalar Int32         time:   [5.2320 µs 5.2342 µs 5.2366 µs]
                        change: [-2.7473% -2.3978% -2.0629%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

gt_eq Int32             time:   [7.5977 µs 7.6027 µs 7.6083 µs]
                        change: [-22.831% -22.594% -22.373%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

gt_eq scalar Int32      time:   [5.1495 µs 5.1518 µs 5.1545 µs]
                        change: [-36.404% -36.230% -36.052%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

eq MonthDayNano         time:   [76.988 µs 77.890 µs 78.540 µs]
                        change: [+9.8624% +11.921% +13.779%] (p = 0.00 < 0.05)
                        Performance has regressed.

eq scalar MonthDayNano  time:   [53.423 µs 53.451 µs 53.479 µs]
                        change: [+0.1588% +0.4683% +0.7225%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

eq_dyn_utf8_scalar dictionary[10] string[4])
                        time:   [67.637 µs 67.668 µs 67.702 µs]
                        change: [+27.334% +27.585% +28.005%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

gt_eq_dyn_utf8_scalar scalar dictionary[10] string[4])
                        time:   [132.88 µs 132.92 µs 132.96 µs]
                        change: [+23.659% +23.938% +24.104%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

eq dictionary[10] string[4])
                        time:   [352.91 µs 353.15 µs 353.44 µs]
                        change: [-16.932% -16.725% -16.514%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2023
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Aug 15, 2023
}
}

trait Dictionary: Array {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see us making this public, perhaps with some additional methods, so that it can form the basic pattern for "how to handle dictionaries". FYI @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- methods like "apply unary function to values, returning a Dictionary of the same type" would be very helpful. Given the right documentation I think this would be very helpful and less confusing

One thought I had is if there is a more general formulation that might be helpful (for example, for DictionaryArrays REE Arrays, and maybe StringViewArrays), although maybe they just could have their own functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough experience with REE Arrays to want to propose an abstraction for them, I honestly don't really know how to handle them efficiently...

StringViewArrays shouldn't require any additional logic as it isn't a "nested" type

@tustvold
Copy link
Contributor Author

Latest benchmarks

eq Float32              time:   [7.6360 µs 7.6405 µs 7.6457 µs]
                        change: [-17.240% -16.996% -16.750%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe

eq scalar Float32       time:   [5.2382 µs 5.2407 µs 5.2433 µs]
                        change: [-2.6651% -2.3977% -2.1336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

neq Float32             time:   [7.3749 µs 7.3801 µs 7.3862 µs]
                        change: [-27.961% -27.771% -27.616%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

neq scalar Float32      time:   [5.2315 µs 5.2346 µs 5.2384 µs]
                        change: [-35.454% -35.241% -34.988%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

lt Float32              time:   [15.455 µs 15.580 µs 15.681 µs]
                        change: [-5.7116% -4.9659% -4.2590%] (p = 0.00 < 0.05)
                        Performance has improved.

lt scalar Float32       time:   [9.8461 µs 9.8686 µs 9.8992 µs]
                        change: [+4.0063% +5.2827% +6.7502%] (p = 0.00 < 0.05)
                        Performance has regressed.

lt_eq Float32           time:   [15.896 µs 15.905 µs 15.915 µs]
                        change: [-10.347% -10.136% -10.003%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

lt_eq scalar Float32    time:   [9.7845 µs 9.7888 µs 9.7932 µs]
                        change: [-21.407% -21.254% -21.051%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  11 (11.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

gt Float32              time:   [15.894 µs 15.905 µs 15.917 µs]
                        change: [-0.3861% -0.0659% +0.2289%] (p = 0.72 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [9.7929 µs 9.7982 µs 9.8039 µs]
                        change: [+0.1360% +0.4044% +0.6769%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

gt_eq Float32           time:   [15.868 µs 15.883 µs 15.901 µs]
                        change: [-9.9332% -9.7737% -9.5779%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  12 (12.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

gt_eq scalar Float32    time:   [9.8059 µs 9.8120 µs 9.8192 µs]
                        change: [-21.359% -21.226% -21.019%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq Int32                time:   [7.6663 µs 7.6775 µs 7.6896 µs]
                        change: [-12.682% -12.410% -12.111%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

eq scalar Int32         time:   [5.1598 µs 5.1615 µs 5.1638 µs]
                        change: [-4.2256% -4.0547% -3.8551%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

neq Int32               time:   [7.5694 µs 7.5728 µs 7.5769 µs]
                        change: [-20.136% -19.885% -19.648%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) high mild
  11 (11.00%) high severe

neq scalar Int32        time:   [5.1674 µs 5.1699 µs 5.1726 µs]
                        change: [-36.189% -36.007% -35.816%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

lt Int32                time:   [7.6795 µs 7.6866 µs 7.6961 µs]
                        change: [-18.855% -17.713% -16.550%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

lt scalar Int32         time:   [5.1625 µs 5.1655 µs 5.1691 µs]
                        change: [-4.2787% -4.0476% -3.8997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

lt_eq Int32             time:   [7.5761 µs 7.5806 µs 7.5861 µs]
                        change: [-20.731% -20.486% -20.235%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe

lt_eq scalar Int32      time:   [5.2456 µs 5.2493 µs 5.2536 µs]
                        change: [-34.425% -34.242% -34.062%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

gt Int32                time:   [7.5842 µs 7.5915 µs 7.6007 µs]
                        change: [-11.722% -10.713% -9.6121%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 22 outliers among 100 measurements (22.00%)
  1 (1.00%) high mild
  21 (21.00%) high severe

gt scalar Int32         time:   [5.1702 µs 5.1726 µs 5.1753 µs]
                        change: [-3.9266% -3.6285% -3.3424%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  12 (12.00%) high severe

gt_eq Int32             time:   [7.6899 µs 7.6971 µs 7.7053 µs]
                        change: [-21.943% -21.716% -21.463%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

gt_eq scalar Int32      time:   [5.1667 µs 5.1693 µs 5.1722 µs]
                        change: [-36.230% -36.051% -35.867%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

eq MonthDayNano         time:   [63.182 µs 63.378 µs 63.600 µs]
                        change: [-4.9198% -4.2991% -3.5572%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar MonthDayNano  time:   [54.143 µs 54.196 µs 54.262 µs]
                        change: [+2.4635% +2.9231% +3.3797%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

eq_dyn_utf8_scalar dictionary[10] string[4])
                        time:   [53.156 µs 53.176 µs 53.197 µs]
                        change: [-0.0663% +0.1054% +0.3198%] (p = 0.37 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

gt_eq_dyn_utf8_scalar scalar dictionary[10] string[4])
                        time:   [108.95 µs 108.98 µs 109.02 µs]
                        change: [+1.3318% +1.6177% +1.9018%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

eq dictionary[10] string[4])
                        time:   [388.52 µs 388.82 µs 389.12 µs]
                        change: [-8.7351% -8.5458% -8.4271%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

@alamb
Copy link
Contributor

alamb commented Aug 16, 2023

These are very neat results. To what do you attribute the performance improvements like the following?

...
gt Int32
                        time:   [7.5842 µs 7.5915 µs 7.6007 µs]
                        change: [-11.722% -10.713% -9.6121%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 22 outliers among 100 measurements (22.00%)
  1 (1.00%) high mild
  21 (21.00%) high severe
...
eq dictionary[10] string[4])
                        time:   [388.52 µs 388.82 µs 389.12 µs]
                        change: [-8.7351% -8.5458% -8.4271%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

@tustvold
Copy link
Contributor Author

tustvold commented Aug 16, 2023

For the case of dictionaries it is likely because the process of normalizing the keys allows for unchecked indexing, which in turn helps LLVM optimise the code properly as there isn't a branch in the main loop.

For the case of primitives I don't have a very good intuition, however, these kernels historically have been extremely sensitive to inlining. It is possible that collocating more of the code allows LLVM to better optimise the code... I haven't looked in too much detail. It could equally be the inline(never) preventing it from inlining too much and blowing its ability to analyze the code.

@tustvold
Copy link
Contributor Author

tustvold commented Aug 16, 2023

Ran the benchmarks with the pre-existing SIMD kernels, and they are roughly twice as fast, although have different semantics for floats. As written this PR will therefore represent a performance regression for users with the SIMD feature enabled. I can look to bring back the SIMD versions if people feel strongly, although my preference would be to avoid this complexity

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through this PR's code and tests carefully, and I really like the new formulation -- very nicely done @tustvold 👏

For other reviewers, the size of the diff might obscure the fact that the core logic of all the comparisons is now combined into several relatively easy to read generic functions that are all shared.

My biggest comment / suggestion is to make it easier to create Scalars from rust types now that Scalar plays much more prominently in the kernel API

I do wonder if some of the compilation speedup is that arrow-rs no longer instantiates all the code for the different comparisons unless it is actually called. Thus the benefits for downstream application binary size may not be as significant (but certainly it seems like an improvement none the less)

cc @viirya and @jhorstmann

@@ -44,10 +44,3 @@ half = { version = "2.1", default-features = false, features = ["num-traits"] }

[dev-dependencies]
rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }

[package.metadata.docs.rs]
features = ["dyn_cmp_dict"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an API change? (if someone was using arrow-ord directly)

I see that dyn_cmp_dict is still used in arrow / arrow-string


[features]
dyn_cmp_dict = []
simd = ["arrow-array/simd"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other kernels as I recall we found properly written rust code would be vectorized by llvm more effectively than our hand rolled simd kernels. Do you think that is still the case?

cc @jhorstmann

@@ -129,7 +129,8 @@ impl GetDbSchemasBuilder {
}

if let Some(catalog_filter_name) = catalog_filter {
filters.push(eq_utf8_scalar(&catalog_name, &catalog_filter_name)?);
let scalar = StringArray::from_iter_values([catalog_filter_name]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I user I find this construction somewhat awkward to create a simple scalar

What would you think about creating convenience functions that would let this code be like:

let scalar = Scalar::new_utf8(catalog_fitler_name);

(doesn't have to be part of this PR, I can add it as a follow on ticket/PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a think about what we can do here

Comment on lines +431 to +432
let s = UInt32Array::from(vec![tt]);
eq(arr, &Scalar::new(&s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think this would read better like

                let s = Scalar::new_uint32(tt);
                eq(arr, &Scalar::new(&s))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal in #4704

@@ -43,6 +43,8 @@
//! ```
//!

pub mod cmp;
#[doc(hidden)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for hidden -- is it to discourage new uses of these functions? I also note you added "deprecated" notes to the functions in there, which should help people migrate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that it is clear where to look for comparison kernels

let r = r_v.map(|x| x.values().as_ref()).unwrap_or(r);

let values = downcast_primitive_array! {
(l, r) => apply(op, l.values().as_ref(), l_s, l_v, r.values().as_ref(), r_s, r_v),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that this handles Date, Time, Timestamp, etc types, right?

match (l_s, r_s) {
(None, None) => {
assert_eq!(l.len(), r.len());
collect_bool(l.len(), neg, |idx| unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding a safety note here related to the fact that the index is coming from a valid array?

op: impl Fn(T::Item, T::Item) -> bool,
) -> BooleanBuffer {
assert_eq!(l_v.len(), r_v.len());
collect_bool(l_v.len(), neg, |idx| unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about documenting the safety of using idx

})
}

trait ArrayOrd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you possible add some more documentation to this trait?

}

fn is_eq(l: Self::Item, r: Self::Item) -> bool {
l.is_eq(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold tustvold added the api-change Changes to the arrow API label Aug 16, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Aug 16, 2023

I do wonder if some of the compilation speedup is that arrow-rs no longer instantiates all the code for the different comparisons unless it is actually called

We don't make use of generics in the public interface, so it will instantiate all the variants regardless of if they're called. This is actually unlike before, so the speedup is quite possibly even more significant

@alamb
Copy link
Contributor

alamb commented Aug 16, 2023

I recommend leaving this PR open for another day or two to allow others the chance to comment (or request more time to comment)

@tustvold
Copy link
Contributor Author

I will probably look to get this in Friday morning GMT to make the next arrow release

@jhorstmann
Copy link
Contributor

We are currently several arrow releases behind in our engine, so I can't easily test the performance impact on our internal benchmark queries. The compile-time improvements seem really nice though, and if needed we could implement comparison kernels for our limited set of datatypes using portable_simd.

@tustvold tustvold merged commit 8bbb5c1 into apache:master Aug 18, 2023
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datum Based Comparison Kernels
3 participants