Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8518: [Python] Setup tools for building an optional pyarrow_gandiva package [WIP] #8390

Closed

Commits on Jan 18, 2021

  1. ARROW-11303: [Release][C++] Enable mimalloc in the windows verificati…

    …on script
    
    Closes apache#9247 from kszucs/mimalloc-windows-verification
    
    Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
    Signed-off-by: Sutou Kouhei <kou@clear-code.com>
    kszucs authored and kou committed Jan 18, 2021
    Configuration menu
    Copy the full SHA
    69a9a1c View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2021

  1. ARROW-11309: [Release][C#] Use .NET 3.1 for verification

    Because we require .NET 3 or later since 3.0.0.
    
    Closes apache#9254 from kou/release-verify-macos-csharp
    
    Authored-by: Sutou Kouhei <kou@clear-code.com>
    Signed-off-by: Sutou Kouhei <kou@clear-code.com>
    kou committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    903b41c View commit details
    Browse the repository at this point in the history
  2. ARROW-11315: [Packaging][APT][arm64] Add missing gir1.2 files

    Closes apache#9260 from kou/release-debian-buster-arm64-add-missing-gir-gandiva
    
    Authored-by: Sutou Kouhei <kou@clear-code.com>
    Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
    kou authored and kszucs committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    19e9559 View commit details
    Browse the repository at this point in the history
  3. ARROW-11314: [Release][APT][Yum] Add support for verifying arm64 pack…

    …ages
    
    Closes apache#9259 from kou/release-verify-arm64
    
    Authored-by: Sutou Kouhei <kou@clear-code.com>
    Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
    kou authored and kszucs committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    17a3fab View commit details
    Browse the repository at this point in the history
  4. ARROW-7633: [C++][CI] Create fuzz targets for tensors and sparse tensors

    Closes apache#6302 from mrkn/ARROW-7633
    
    Lead-authored-by: Kenta Murata <mrkn@mrkn.jp>
    Co-authored-by: Antoine Pitrou <antoine@python.org>
    Signed-off-by: Antoine Pitrou <antoine@python.org>
    mrkn and pitrou committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    275fda1 View commit details
    Browse the repository at this point in the history
  5. ARROW-11246: [Rust] Add type to Unexpected accumulator state error

    I am getting a bug with an error: Unexpected accumulator state, but It's not possible to understand what value was passed when the exception is done on the user's side. I add type to the error message to make investigation of the bug more easy.
    
    Closes apache#9201 from ovr/unexpected-accumulator-state
    
    Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    ovr authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    2d3e8f9 View commit details
    Browse the repository at this point in the history
  6. ARROW-11254: [Rust][DataFusion] Add SIMD and snmalloc flags as option…

    …s to benchmarks
    
    I think it would be great to test some compilation options individually on the benchmarks to see the impact.
    
    Here are some examples of the impact it can have on the queries:
    
    --features "":
    ```
    Query 5 iteration 0 took 655.3 ms
    Query 5 iteration 1 took 648.9 ms
    Query 5 iteration 2 took 640.3 ms
    Query 5 iteration 3 took 658.7 ms
    Query 5 iteration 4 took 646.3 ms
    Query 5 iteration 5 took 684.1 ms
    Query 5 iteration 6 took 642.8 ms
    Query 5 iteration 7 took 656.9 ms
    Query 5 iteration 8 took 646.0 ms
    Query 5 iteration 9 took 669.1 ms
    Query 5 avg time: 654.85 ms
    ```
    
    --features "snmalloc"
    
    ```
    Query 5 iteration 0 took 525.4 ms
    Query 5 iteration 1 took 478.8 ms
    Query 5 iteration 2 took 485.7 ms
    Query 5 iteration 3 took 486.6 ms
    Query 5 iteration 4 took 482.6 ms
    Query 5 iteration 5 took 473.1 ms
    Query 5 iteration 6 took 494.4 ms
    Query 5 iteration 7 took 483.5 ms
    Query 5 iteration 8 took 493.1 ms
    Query 5 iteration 9 took 479.4 ms
    Query 5 avg time: 488.26 ms
    ```
    
    --features ""
    ```
    Query 12 iteration 0 took 241.4 ms
    Query 12 iteration 1 took 234.8 ms
    Query 12 iteration 2 took 229.8 ms
    Query 12 iteration 3 took 229.5 ms
    Query 12 iteration 4 took 228.3 ms
    Query 12 iteration 5 took 230.0 ms
    Query 12 iteration 6 took 228.3 ms
    Query 12 iteration 7 took 229.3 ms
    Query 12 iteration 8 took 229.9 ms
    Query 12 iteration 9 took 230.1 ms
    Query 12 avg time: 231.13 ms
    ```
    --features "simd"
    ```
    Query 12 iteration 0 took 157.7 ms
    Query 12 iteration 1 took 159.3 ms
    Query 12 iteration 2 took 156.9 ms
    Query 12 iteration 3 took 163.0 ms
    Query 12 iteration 4 took 157.5 ms
    Query 12 iteration 5 took 157.6 ms
    Query 12 iteration 6 took 156.6 ms
    Query 12 iteration 7 took 157.4 ms
    Query 12 iteration 8 took 158.6 ms
    Query 12 iteration 9 took 157.0 ms
    Query 12 avg time: 158.16 ms
    ```
    
    Closes apache#9206 from Dandandan/custom_alloc
    
    Lead-authored-by: Daniël Heres <danielheres@gmail.com>
    Co-authored-by: Heres, Daniel <danielheres@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    Dandandan authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    e20f439 View commit details
    Browse the repository at this point in the history
  7. ARROW-11074: [Rust][DataFusion] Implement predicate push-down for par…

    …quet tables
    
    While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements.
    
    I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain.
    
    This work is based on the following key ideas:
    * predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters
    * it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion
    * filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example `(column / 2) = 4`  becomes  `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet table
    * for each parquet file, a RecordBatch containing all required statistics columns ( [`column_min`, `column_max`] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file
    
    This is still work in progress - more tests left to write; I am publishing this now to gather feedback.
    
    @andygrove let me know what you think
    
    Closes apache#9064 from yordan-pavlov/parquet_predicate_push_down
    
    Authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    yordan-pavlov authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    18dc62c View commit details
    Browse the repository at this point in the history
  8. ARROW-10489: [C++] Add Intel C++ compiler options for different warni…

    …ng levels
    
    See [Intel Compiler warning flag documentation](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/error-handling-1/warnings-errors-and-remarks.html).
    
    Closes apache#9266 from jcmuel/master
    
    Authored-by: Johannes Müller <JohannesMueller@fico.com>
    Signed-off-by: Antoine Pitrou <antoine@python.org>
    jcmuel authored and pitrou committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    127961a View commit details
    Browse the repository at this point in the history
  9. ARROW-9128: [C++] Implement string space trimming kernels: trim, ltri…

    …m, and rtrim
    
    There is one obvious loose end in this PR, which is where to generate the `std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm not sure what the lifetime guarantees are for this object (TrimOptions), where it makes sense to initialize this set, and when an (utf8 decoding) error occurs, how/where to report this.
    
    Although this is not a costly operation, assuming people don't pass in a billion characters to trim, I do wonder what the best approach here is in general. It does not make much sense to create the `std::set` at each `Exec` call, but that is what happens now. (This also seems to happen in `TransformMatchSubstring` for creating the `prefix_table` btw.)
    
    Maybe a good place to put per-kernel pre-compute results are the `*Options` objects, but I'm not sure if that makes sense in the current architecture.
    
    Another idea is to explore alternatives to the `std::set`. It seem that (based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit slower, and simply using a linear search: `std::find(options.characters.begin(), options.characters.end(), c) != options.characters.end()` in the predicate instead of the set doesn't seem to affect performance that much.
    
    In CPython, a bloom filter is used, I could explore to see if that makes sense, but the implementation in Arrow lives under the parquet namespace.
    
    Closes apache#8621 from maartenbreddels/ARROW-9128
    
    Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
    Signed-off-by: Antoine Pitrou <antoine@python.org>
    maartenbreddels authored and pitrou committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    0e5d646 View commit details
    Browse the repository at this point in the history
  10. ARROW-11305 Skip first argument (which is the program name) in parque…

    …t-rowcount binary
    
    Closes apache#9249 from jhorstmann/ARROW-11305-parquet-rowcount-argument-confusion
    
    Authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    jhorstmann authored and jorgecarleitao committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    f63cffa View commit details
    Browse the repository at this point in the history
  11. ARROW-11108: [Rust] Fixed performance issue in mutableBuffer.

    This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.
    
    This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.
    
    Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.
    
    This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature
    
    ```
    pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
    pub fn push<T: ToByteSlice>(&mut self, item: &T)
    ```
    
    i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).
    
    Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).
    
    > [...] current conversion to a byte slice may add some overhead? - @Dandandan
    
    Benches (against master, so, both this PR and apache#9044 ):
    
    ```
    Switched to branch 'perf_buffer'
    Your branch and 'origin/perf_buffer' have diverged,
    and have 6 and 1 different commits each, respectively.
      (use "git pull" to merge the remote branch into yours)
       Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
        Finished bench [optimized] target(s) in 1m 00s
         Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
    Gnuplot not found, using plotters backend
    mutable                 time:   [463.11 us 463.57 us 464.07 us]
                            change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 10 outliers among 100 measurements (10.00%)
      1 (1.00%) high mild
      9 (9.00%) high severe
    
    mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                            change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 12 outliers among 100 measurements (12.00%)
      5 (5.00%) high mild
      7 (7.00%) high severe
    
    Benchmarking from_slice: Warming up for 3.0000 s
    Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
    from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                            change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 10 outliers among 100 measurements (10.00%)
      3 (3.00%) high mild
      7 (7.00%) high severe
    
    from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                            change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 10 outliers among 100 measurements (10.00%)
      4 (4.00%) high mild
      6 (6.00%) high severe
    ```
    
    Closes apache#9076 from jorgecarleitao/perf_buffer
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    jorgecarleitao committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    7e0cb0a View commit details
    Browse the repository at this point in the history
  12. ARROW-11216: [Rust] add doc example for StringDictionaryBuilder

    I find myself trying to remember the exact incantation to create a `StringDictionaryBuilder` so I figured I would add it as  a doc example
    
    Closes apache#9169 from alamb/alamb/doc-example
    
    Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    alamb authored and jorgecarleitao committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    b448de7 View commit details
    Browse the repository at this point in the history
  13. ARROW-11268: [Rust][DataFusion] MemTable::load output partition support

    I think the feature to be able to repartition an in memory table is useful, as the repartitioning only needs to be applied once, and repartition itself is cheap (at the same node). Doing this when loading data is very useful for in-memory analytics as we can benefit from mutliple cores after loading the data.
    
    The speed up from repartitioning is very big (mainly on aggregates), on my (8-core machine): ~5-7x on query 1 and 12 versus a single partition, and a smaller (~30%) difference for query 5 when using 16 partition. q1/q12 also have very high cpu utilization.
    
    @jorgecarleitao maybe this is of interest to you, as you mentioned you are looking into multi-threading. I think this would be a "high level" way to get more parallelism, also in the logical plan. I think in some optimizer rules and/or dynamically we can do repartitions, similar to what's described here https://issues.apache.org/jira/browse/ARROW-9464
    
    Benchmarks after repartitioning (16 partitions):
    
    PR (16 partitions)
    ```
    Query 12 iteration 0 took 33.9 ms
    Query 12 iteration 1 took 34.3 ms
    Query 12 iteration 2 took 36.9 ms
    Query 12 iteration 3 took 33.6 ms
    Query 12 iteration 4 took 35.1 ms
    Query 12 iteration 5 took 38.8 ms
    Query 12 iteration 6 took 35.8 ms
    Query 12 iteration 7 took 34.4 ms
    Query 12 iteration 8 took 34.2 ms
    Query 12 iteration 9 took 35.3 ms
    Query 12 avg time: 35.24 ms
    ```
    
    Master (1 partition):
    ```
    Query 12 iteration 0 took 245.6 ms
    Query 12 iteration 1 took 246.4 ms
    Query 12 iteration 2 took 246.1 ms
    Query 12 iteration 3 took 247.9 ms
    Query 12 iteration 4 took 246.5 ms
    Query 12 iteration 5 took 248.2 ms
    Query 12 iteration 6 took 247.8 ms
    Query 12 iteration 7 took 246.4 ms
    Query 12 iteration 8 took 246.6 ms
    Query 12 iteration 9 took 246.5 ms
    Query 12 avg time: 246.79 ms
    ```
    
    PR (16 partitions):
    ```
    Query 1 iteration 0 took 138.6 ms
    Query 1 iteration 1 took 142.2 ms
    Query 1 iteration 2 took 125.8 ms
    Query 1 iteration 3 took 102.4 ms
    Query 1 iteration 4 took 105.9 ms
    Query 1 iteration 5 took 107.0 ms
    Query 1 iteration 6 took 109.3 ms
    Query 1 iteration 7 took 109.9 ms
    Query 1 iteration 8 took 108.8 ms
    Query 1 iteration 9 took 112.0 ms
    Query 1 avg time: 116.19 ms
    ```
    Master (1 partition):
    ```
    Query 1 iteration 0 took 640.6 ms
    Query 1 iteration 1 took 640.0 ms
    Query 1 iteration 2 took 632.9 ms
    Query 1 iteration 3 took 634.6 ms
    Query 1 iteration 4 took 630.7 ms
    Query 1 iteration 5 took 630.7 ms
    Query 1 iteration 6 took 631.9 ms
    Query 1 iteration 7 took 635.5 ms
    Query 1 iteration 8 took 639.0 ms
    Query 1 iteration 9 took 638.3 ms
    Query 1 avg time: 635.43 ms
    ```
    PR (16 partitions)
    ```
    Query 5 iteration 0 took 465.8 ms
    Query 5 iteration 1 took 428.0 ms
    Query 5 iteration 2 took 435.0 ms
    Query 5 iteration 3 took 407.3 ms
    Query 5 iteration 4 took 435.7 ms
    Query 5 iteration 5 took 437.4 ms
    Query 5 iteration 6 took 411.2 ms
    Query 5 iteration 7 took 432.0 ms
    Query 5 iteration 8 took 436.8 ms
    Query 5 iteration 9 took 435.6 ms
    Query 5 avg time: 432.47 ms
    ```
    
    Master (1 partition)
    ```
    Query 5 iteration 0 took 660.6 ms
    Query 5 iteration 1 took 634.4 ms
    Query 5 iteration 2 took 626.4 ms
    Query 5 iteration 3 took 628.0 ms
    Query 5 iteration 4 took 635.3 ms
    Query 5 iteration 5 took 631.1 ms
    Query 5 iteration 6 took 631.3 ms
    Query 5 iteration 7 took 639.4 ms
    Query 5 iteration 8 took 634.3 ms
    Query 5 iteration 9 took 639.0 ms
    Query 5 avg time: 635.97 ms
    ```
    
    Closes apache#9214 from Dandandan/mem_table_repartition
    
    Lead-authored-by: Heres, Daniel <danielheres@gmail.com>
    Co-authored-by: Daniël Heres <danielheres@gmail.com>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Dandandan authored and jorgecarleitao committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    4a6eb19 View commit details
    Browse the repository at this point in the history
  14. ARROW-11321: [Rust][DataFusion] Fix DataFusion compilation error

    FYI @jorgecarleitao I think some change on master and maybe some parquet related changes caused a compilation error on master. This fixes the compilation error.
    
    Closes apache#9269 from Dandandan/fix_datafusion
    
    Authored-by: Heres, Daniel <danielheres@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    Dandandan authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    a4266a1 View commit details
    Browse the repository at this point in the history
  15. ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

    Create hashes vectorized in hash join
    
    This is one step for a fully vectorized hash join: https://issues.apache.org/jira/browse/ARROW-11112
    
    The idea of the PR is as follows:
    
    * We still use a `HashMap` but rather than using the row data as key we use a hash value ( `u64`) both as key and as hash. We use a custom `Hasher` to avoid (re)computing hashes in the hash map and while doing lookups.
    * Only the hash value creation is in this PR vectorized, the rest is still on a row basis.
    * A test for hash collision detection needs to be added.
    
    TCPH 12 is without the remaining part ~10% faster than the other PR: ~180ms vs ~200ms.
    TCPH 5 is >40% faster (332ms vs 624ms).
    
    Closes apache#9116 from Dandandan/vectorized_hashing
    
    Authored-by: Heres, Daniel <danielheres@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    Dandandan authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    bbc9029 View commit details
    Browse the repository at this point in the history
  16. ARROW-11313: [Rust] Fixed size_hint

    `size_hint` should return the remaining items, not the total number of items.
    
    Closes apache#9258 from jorgecarleitao/fix_size_hint
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    jorgecarleitao authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    8e218e0 View commit details
    Browse the repository at this point in the history
  17. ARROW-11222: [Rust] Catch up with flatbuffers 0.8.1 which had some UB…

    … problems fixed
    
    The major change of [flatbuffers 0.8.1](https://docs.rs/flatbuffers/0.8.1/flatbuffers/index.html) since 0.8.0 is google/flatbuffers#6393, which fixed some possible memory alignment issues.
    
    In this PR, the ipc/gen/*.rs files are generated by `regen.sh` as before, without any manual change.
    
    Closes apache#9176 from mqy/flatbuffers-0.8.1
    
    Authored-by: mqy <meng.qingyou@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    mqy authored and alamb committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    35053fe View commit details
    Browse the repository at this point in the history
  18. ARROW-11277: [C++] Workaround macOS 10.11: don't default construct co…

    …nsts
    
    ```c++
    /tmp/apache-arrow-20210116-6233-1jyrhk8/apache-arrow-3.0.0/cpp/src/arrow/dataset/expression.cc
    /tmp/apache-arrow-20210116-6233-1jyrhk8/apache-arrow-3.0.0/cpp/src/arrow/dataset/expression.cc:684:30:
    error: default initialization of an object of const type 'const arrow::Datum' without a user-provided default constructor
              static const Datum ignored_input;
    ```
    Datum defines a default constructor but it doesn't seem to be found for const/constexpr decls
    
    Closes apache#9267 from bkietz/11277-Fix-compilation-error-in-
    
    Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
    Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
    bkietz authored and nealrichardson committed Jan 19, 2021
    Configuration menu
    Copy the full SHA
    50ba534 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2021

  1. ARROW-11322: [Rust] Re-opening memory module as public

    Closes apache#9270 from maxburke/rust_memory_public
    
    Authored-by: Max Burke <max@urbanlogiq.com>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    maxburke authored and jorgecarleitao committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    a7633c7 View commit details
    Browse the repository at this point in the history
  2. ARROW-11269: [Rust] [Parquet] Preserve timezone in int96 reader

    The Int96 timestamp was not using the specialised timestamp builder that takes the timezone as a paramenter.
    This changes that to use the builder that preserves timezones.
    
    I tested this change with the test file provided in the JIRA.
    It looks like we don't have a way of writing int96 from the arrow writer, so there isn't an easy way to add a testcase.
    
    Closes apache#9253 from nevi-me/ARROW-11269
    
    Authored-by: Neville Dipale <nevilledips@gmail.com>
    Signed-off-by: Neville Dipale <nevilledips@gmail.com>
    nevi-me committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    555643a View commit details
    Browse the repository at this point in the history
  3. ARROW-11279: [Rust][Parquet] ArrowWriter Definition Levels Memory Usage

    Writes leaves immediately after calculating array levels to reduce array level memory usage by the number of rows in a row group.
    
    Closes apache#9222 from TurnOfACard/parquet-memory
    
    Authored-by: Ryan Jennings <ryan@ryanj.net>
    Signed-off-by: Neville Dipale <nevilledips@gmail.com>
    Ryan Jennings authored and nevi-me committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    e7c69e6 View commit details
    Browse the repository at this point in the history
  4. ARROW-11318: [Rust] Support pretty printing timestamp, date, and time…

    …stamp types
    
    I found this while removing `test::format_batches` (PR to come shortly); The Record batch pretty printing code was printing numbers rather than dates.
    
    Before this PR, when date/time columns were printed they were printed as numbers:
    
    ```
    [
        "+----------+",
        "| f        |",
        "+----------+",
        "| 11111111 |",
        "|          |",
        "+----------+",
    ]
    ```
    
    After this PR, they are printed (via chrono) as dates:
    
    ```
    [
        "+---------------------+",
        "| f                   |",
        "+---------------------+",
        "| 1970-05-09 14:25:11 |",
        "|                     |",
        "+---------------------+",
    ]
    ```
    
    Closes apache#9263 from alamb/alamb/pretty_print_datetimes
    
    Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    alamb committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    71572bd View commit details
    Browse the repository at this point in the history
  5. ARROW-11311: [Rust] Fixed unset_bit

    The functions `unset_bit` and `unset_bit_raw` were toggling, not unsetting, bits, which was obviously wrong.
    
    This PR also changes the test for `set_bit` to also make sure that it does not toggle bits.
    
    Closes apache#9257 from jorgecarleitao/fix_unset
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    jorgecarleitao authored and alamb committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    ed709e0 View commit details
    Browse the repository at this point in the history
  6. ARROW-11265: [Rust] Made bool not ArrowNativeType

    This PR removes the risk of boolean values to be converted to bytes via `ToByteSlice` by explicitly making `ArrowNativeType` be only used in types whose in-memory representation in Rust equates to the in-memory representation in Arrow. `bool` in Rust is a byte and in Arrow it is a bit.
    
    Overall, the direction of this PR is to have the traits represent one aspect of the type. In this case, `ArrowNativeType` is currently
    * a type that has the same in memory representation (ToByteSlice is implemented for it)
    * a json serializable type
    * something that can be casted to/from `usize`.
    
    This poses a problem because:
    
    1. bools are serializable, not castable to usize, have different memory representation
    2. fixed size (iX, uX) are serializable, castable to usize, have the same memory representation
    3. fixed floating (f32, f64) are serializable, not castable to usize, have the same memory representation
    
    however, they all implement `ArrowNativeType`.
    
    This PR focus on splitting the json-serializable part of it.
    
    Closes apache#9212 from jorgecarleitao/fix_trait
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    jorgecarleitao authored and alamb committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    01c5aec View commit details
    Browse the repository at this point in the history
  7. ARROW-11290: [Rust][DataFusion] Address hash aggregate performance is…

    …sue with high number of groups
    
    Currently, we loop to the hashmap for every key.
    
    However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.
    
    In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.
    
    On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.
    
    This PR:
    ```
    q1 took 340 ms
    q2 took 1768 ms
    q3 took 10975 ms
    q4 took 337 ms
    q5 took 13529 ms
    ```
    Master:
    ```
    q1 took 330 ms
    q2 took 1648 ms
    q3 took 16408 ms
    q4 took 335 ms
    q5 took 21074 ms
    ```
    
    Closes apache#9234 from Dandandan/hash_agg_speed2
    
    Authored-by: Heres, Daniel <danielheres@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    Dandandan authored and alamb committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    6912869 View commit details
    Browse the repository at this point in the history
  8. ARROW-11149: [Rust] DF Support List/LargeList/FixedSizeList in create…

    …_batch_empty
    
    Previously `build_empty_list_array` was declared inside Parquet (`array_reader`), but I will use this function inside DataFushion's  `create_batch_empty` (it's used inside hash_aggregate to make an empty batch from the provided schema that contains type for columns).  I moved it to Arrow (because it's common and useful) and made `build_empty_large_list_array` (for large lists) on top of macros with different implementation than build_empty_list_array.
    
    Closes apache#9114 from ovr/issue-11149
    
    Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    ovr authored and alamb committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    23550c2 View commit details
    Browse the repository at this point in the history
  9. ARROW-11329: [Rust] Don't rerun build.rs on every file change

    This speeds up development by avoiding rebuilding the whole library
    when any file in the package directory is touched.
    
    Closes apache#9277 from mbrubeck/build
    
    Authored-by: Matt Brubeck <mbrubeck@limpet.net>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    mbrubeck authored and jorgecarleitao committed Jan 20, 2021
    Configuration menu
    Copy the full SHA
    a0e1244 View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2021

  1. ARROW-11220: [Rust] Implement GROUP BY support for Boolean

    Introduce support in DataFushion for GROUP BY on boolean values. Boolean type in Rust implements Eq and Hash traits which allow us to use GroupByScalar.
    
    Closes apache#9174 from ovr/issue-11220
    
    Authored-by: Dmitry Patsura <zaets28rus@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    ovr authored and alamb committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    8b56f85 View commit details
    Browse the repository at this point in the history
  2. ARROW-11330: [Rust][DataFusion] add ExpressionVisitor to encode expre…

    …ssion walking
    
    ## Problem:
    * There are several places in the DataFusion codebase where a walk of an Expression tree is needed
    * The logic of how to walk the tree is replicated
    * Adding new expression types often require many mechanically different but semantically the same changes in many places where no special treatment of such types is needed
    
    This PR introduces a `ExpressionVisitor` trait and the `Expr::accept` function to consolidate this walking of the expression tree. It does not intend to change any functionality.
    
    If folks like this pattern, I have ideas for a similar type of trait `ExpressionRewriter` which can be used to rewrite expressions (much like `clone_with_replacement`) as a subsquent PR. I think this was mentioned by @Dandandan  in the [Rust roadmap](https://docs.google.com/document/d/1qspsOM_dknOxJKdGvKbC1aoVoO0M3i6x1CIo58mmN2Y/edit#heading=h.kstb571j5g5j)
    
    cc @jorgecarleitao @Dandandan and @andygrove
    
    Closes apache#9278 from alamb/alamb/expression_visitor
    
    Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    alamb committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    4601c02 View commit details
    Browse the repository at this point in the history
  3. ARROW-11323: [Rust][DataFusion] Allow sort queries to return no results

    Prior to this PR, if a plan had an ORDER BY (Sort) that got no input rows, you would get an output error.
    
    Now the test passes and produces the (expected) no output rows
    
    Closes apache#9275 from alamb/alamb/ARROW-11323-empty-results
    
    Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    alamb committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    84126d5 View commit details
    Browse the repository at this point in the history
  4. ARROW-10831: [C++][Compute] Implement quantile kernel

    Calculate the exact quantile by storing all values and partition
    around quantile point at the end. This may require much memory.
    
    As followup task, an approximate method without storing data points
    will be implemented.
    
    Closes apache#8920 from cyb70289/quantile
    
    Lead-authored-by: Yibo Cai <yibo.cai@arm.com>
    Co-authored-by: Antoine Pitrou <antoine@python.org>
    Signed-off-by: Antoine Pitrou <antoine@python.org>
    cyb70289 and pitrou committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    bd90043 View commit details
    Browse the repository at this point in the history
  5. ARROW-11334: [Python][CI] Fix failing pandas nightly tests

    Closes apache#9282 from jorisvandenbossche/ARROW-11334
    
    Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
    Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
    jorisvandenbossche committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    72bf95a View commit details
    Browse the repository at this point in the history
  6. ARROW-11320: [C++] Try to strengthen temporary dir creation

    * Make the random generator seed a bit more unique.
    * When a temporary name already exists, retry with another one instead of bailing out.
    
    Closes apache#9265 from pitrou/ARROW-11320-temp-dir-retry
    
    Authored-by: Antoine Pitrou <antoine@python.org>
    Signed-off-by: Antoine Pitrou <antoine@python.org>
    pitrou committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    bc5d8bf View commit details
    Browse the repository at this point in the history
  7. ARROW-11141: [Rust] Add basic Miri checks to CI pipeline

    Closes apache#9261 from vertexclique/vcq/ARROW-11141-miri-checks
    
    Authored-by: Mahmut Bulut <vertexclique@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    vertexclique authored and alamb committed Jan 21, 2021
    Configuration menu
    Copy the full SHA
    c413566 View commit details
    Browse the repository at this point in the history

Commits on Jan 22, 2021

  1. ARROW-11337: [C++] Compilation error with ThreadSanitizer

    Closes apache#9286 from westonpace/bugfix/arrow-11337
    
    Authored-by: Weston Pace <weston.pace@gmail.com>
    Signed-off-by: Sutou Kouhei <kou@clear-code.com>
    westonpace authored and kou committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    6959e46 View commit details
    Browse the repository at this point in the history
  2. ARROW-11333: [Rust] Generalized creation of empty arrays.

    This PR exposes two new functions:
    * `new_empty_array`, that creates a new empty `ArrayRef` (i.e. dynamically typed) of any (primitive or nested) type except `Union`
    * `RecordBatch::new_empty` that creates an empty `RecordBatch`, thereby migrating code from `DataFusion`.
    
    Since we were using a similar code in `array/transform/mod.rs` and `array_list`, this PR ends up removing some code.
    
    Closes apache#9281 from jorgecarleitao/improve_empty
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Neville Dipale <nevilledips@gmail.com>
    jorgecarleitao authored and nevi-me committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    499b6d0 View commit details
    Browse the repository at this point in the history
  3. ARROW-10299: [Rust] Use IPC Metadata V5 as default

    We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, `ipc::MetadataVersion::V5` from v3.0 of the crate.
    
    The main change was to change the default `IpcWriteOptions`, and add tests
    
    Closes apache#9122 from nevi-me/ARROW-10299
    
    Authored-by: Neville Dipale <nevilledips@gmail.com>
    Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    nevi-me authored and jorgecarleitao committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    629a6fd View commit details
    Browse the repository at this point in the history
  4. ARROW-11343: [Rust][DataFusion] Simplified example with UDF.

    I.e. use `collect` instead of the builder, which is simpler and faster
    
    Closes apache#9290 from jorgecarleitao/simpler_udf
    
    Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    jorgecarleitao authored and alamb committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    457fa91 View commit details
    Browse the repository at this point in the history
  5. ARROW-10766: [Rust] [Parquet] Compute nested list definitions

    This mainly computes definition and repetition leves for lists.
    It also partially adds deeply nested write support.
    I am however going to complete this in a separate PR.
    
    This has really been challenging because we can't roundtrip without nested writers,
    so it's taken me months to complete.
    In the process, I've had to rely on using Spark to verify my work.
    
    This PR is also not optimised. I've left TODOs in a few places (sparingly).
    The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.
    
    Closes apache#9240 from nevi-me/ARROW-10766-v2
    
    Authored-by: Neville Dipale <nevilledips@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    nevi-me authored and alamb committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    251ecac View commit details
    Browse the repository at this point in the history
  6. ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

    This PR changes take string to use `MutableBuffer` to create a byte array directly instead of converting it from a `Vec<u8>`.
    There used to be some overhead compared to using a `Vec` and converting it to a buffer afterwards, but the overhead seems to be gone now.
    
    The change seems to be neutral according to benchmarks, giving results within a few %. If there is any remaining overhead in `MutableBufffer` I think we should fix that rather than having some workarounds and inconsistencies with other kernels.
    
    ```
    take str 512            time:   [2.3304 us 2.3358 us 2.3419 us]
                            change: [-4.4130% -4.0693% -3.7241%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 3 outliers among 100 measurements (3.00%)
      2 (2.00%) high mild
      1 (1.00%) high severe
    
    Benchmarking take str 1024: Collecting 100 samples in estimated 5.0198 s (1.1M i                                                                                take str 1024           time:   [4.3583 us 4.3633 us 4.3694 us]
                            change: [-0.5853% +1.1186% +2.9951%] (p = 0.29 > 0.05)
                            No change in performance detected.
    Found 16 outliers among 100 measurements (16.00%)
      3 (3.00%) low severe
      6 (6.00%) high mild
      7 (7.00%) high severe
    
    Benchmarking take str null indices 512: Collecting 100 samples in estimated 5.00                                                                                take str null indices 512
                            time:   [2.4779 us 2.4813 us 2.4844 us]
                            change: [-2.4765% -2.2000% -1.9437%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) low mild
      2 (2.00%) high mild
    
    Benchmarking take str null indices 1024: Collecting 100 samples in estimated 5.0                                                                                take str null indices 1024
                            time:   [4.4823 us 4.4910 us 4.5053 us]
                            change: [-4.8482% -4.5426% -4.2894%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 5 outliers among 100 measurements (5.00%)
      1 (1.00%) low mild
      3 (3.00%) high mild
      1 (1.00%) high severe
    
    Benchmarking take str null values 1024: Collecting 100 samples in estimated 5.00                                                                                take str null values 1024
                            time:   [4.4856 us 4.4889 us 4.4920 us]
                            change: [-2.2093% -2.0471% -1.8925%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 7 outliers among 100 measurements (7.00%)
      2 (2.00%) low severe
      3 (3.00%) low mild
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    Benchmarking take str null values null indices 1024: Collecting 100 samples in e                                                                                take str null values null indices 1024
                            time:   [9.6438 us 9.6514 us 9.6592 us]
                            change: [-2.8600% -2.7478% -2.6338%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    ```
    
    Closes apache#9279 from Dandandan/take_string_opt
    
    Authored-by: Heres, Daniel <danielheres@gmail.com>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    Dandandan authored and alamb committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    262bbdc View commit details
    Browse the repository at this point in the history
  7. ARROW-11299: [Python] Fix invalid-offsetof warnings

    Use unique_ptr to hold FunctionOptions derived classes instances to fix
    `invalid-offsetof` python build warnings.
    
    ~/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146:
    warning: offsetof within non-standard-layout type ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
     x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
    
    Closes apache#9274 from cyb70289/offsetof
    
    Authored-by: Yibo Cai <yibo.cai@arm.com>
    Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
    cyb70289 authored and bkietz committed Jan 22, 2021
    Configuration menu
    Copy the full SHA
    b44a4ad View commit details
    Browse the repository at this point in the history

Commits on Jan 23, 2021

  1. ARROW-11291: [Rust] Add extend to MutableBuffer (-20% for arithmetic,…

    … -97% for length)
    
    # Rational
    
    Rust forbids safely accessing uninitialized memory because it is undefined behavior. However, when building `Buffer`s, it is important to be able to _write_ to uninitialized memory regions, thereby avoiding the need to write _something_ to it before using it.
    
    Currently, all our initializations are zeroed, which is expensive. apache#9076 modifies our allocator to allocate uninitialized regions. However, by itself, this is not useful if we do not offer any methods to write to those (uninitialized) regions.
    
    # This PR
    
    This PR is built on top of apache#9076 and introduces methods to extend a `MutableBuffer` from an iterator, thereby offering an API to efficiently grow `MutableBuffer` without having to initialize memory regions with zeros (i.e. without `with_bitset` and the like).
    
    This PR also introduces methods to create a `Buffer` from an iterator and a `trusted_len` iterator.
    
    The design is inspired in `Vec`, with the catch that we use stable Rust (i.e. no trait specialization, no `TrustedLen`), and thus have to expose a bit more methods than what `Vec` exposes. This means that we can't use that (nicer) API for trustedlen iterators based on `collect()`.
    
    Note that, as before, there are still `unsafe` uses of the `MutableBuffer` derived from the fact that it is not a generic over a type `T` (and thus people can mix types and grow the buffer in unsound ways).
    
    Special thanks to @mbrubeck for all the help on this, originally discussed [here](https://users.rust-lang.org/t/collect-for-exactsizediterator/54367/6).
    
    ```bash
    git checkout master
    cargo bench --bench arithmetic_kernels
    git checkout length_faster
    cargo bench --bench arithmetic_kernels
    
    git checkout 16bc720
    cargo bench --bench length_kernel
    git checkout length_faster
    ```
    
    ```
    Switched to branch 'length_faster'
      (use "git pull" to merge the remote branch into yours)
       Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
        Finished bench [optimized] target(s) in 1m 02s
         Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/arithmetic_kernels-ec2cc20ce07d9b83
    Gnuplot not found, using plotters backend
    add 512                 time:   [522.24 ns 523.67 ns 525.26 ns]
                            change: [-21.738% -20.960% -20.233%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 12 outliers among 100 measurements (12.00%)
      9 (9.00%) high mild
      3 (3.00%) high severe
    
    subtract 512            time:   [503.18 ns 504.93 ns 506.81 ns]
                            change: [-21.741% -21.075% -20.308%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 4 outliers among 100 measurements (4.00%)
      2 (2.00%) high mild
      2 (2.00%) high severe
    
    multiply 512            time:   [508.25 ns 512.04 ns 516.06 ns]
                            change: [-22.569% -21.946% -21.305%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    divide 512              time:   [1.4711 us 1.4753 us 1.4799 us]
                            change: [-24.783% -23.305% -22.176%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 5 outliers among 100 measurements (5.00%)
      2 (2.00%) high mild
      3 (3.00%) high severe
    
    limit 512, 512          time:   [373.47 ns 377.76 ns 382.21 ns]
                            change: [+3.3055% +4.4193% +5.5923%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    add_nulls_512           time:   [502.94 ns 504.51 ns 506.28 ns]
                            change: [-24.876% -24.299% -23.709%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high severe
    
    divide_nulls_512        time:   [1.4843 us 1.4931 us 1.5053 us]
                            change: [-22.968% -22.243% -21.420%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 24 outliers among 100 measurements (24.00%)
      15 (15.00%) low mild
      1 (1.00%) high mild
      8 (8.00%) high severe
    ```
    
    Length (against the commit that fixes the bench, `16bc7200f3baa6e526aea7135c60dcc949c9b592`, not master):
    
    ```
    length                  time:   [1.5379 us 1.5408 us 1.5437 us]
                            change: [-97.311% -97.295% -97.278%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 12 outliers among 100 measurements (12.00%)
      1 (1.00%) low severe
      4 (4.00%) low mild
      3 (3.00%) high mild
      4 (4.00%) high severe
    ```
    
    Closes apache#9235 from jorgecarleitao/length_faster
    
    Lead-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
    Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    2 people authored and alamb committed Jan 23, 2021
    Configuration menu
    Copy the full SHA
    13e2134 View commit details
    Browse the repository at this point in the history
  2. ARROW-11319: [Rust] [DataFusion] Improve test comparisons to record b…

    …atch, remove test::format_batch
    
    The `test::format_batch` function does not have wide range of type support (e.g. it doesn't support dictionaries) and its output makes tests hard to read / update, in my opinion. This PR consolidates the datafusion tests to use `arrow::util::pretty::pretty_format_batches` both to reduce code duplication as well as increase type support
    
    This PR removes the `test::format_batch(&batch);` function and replaces it with `arrow::util::pretty::pretty_format_batches` and some macros. It has no code changes.
    
    This change the following benefits:
    
    1. Better type support (I immediately can compare RecordBatches with `Dictionary` types in tests without having to update `format_batch` and apache#9233 gets simpler)
    2. Better readability and error reporting (at least I find the code and diffs easier to understand)
    3. Easier test update / review: it is easier to update the diffs (you can copy/paste the test output into the source code) and to review them
    
    This is a variant of a strategy that I been using with success in IOx [source link](https://github.com/influxdata/influxdb_iox/blob/main/arrow_deps/src/test_util.rs#L15) and I wanted to contribute it back.
    
    An example failure with this PR:
    
    ```
    ---- physical_plan::hash_join::tests::join_left_one stdout ----
    thread 'physical_plan::hash_join::tests::join_left_one' panicked at 'assertion failed: `(left == right)`
      left: `["+----+----+----+----+", "| a1 | b2 | c1 | c2 |", "+----+----+----+----+", "| 1  | 1  | 7  | 70 |", "| 2  | 2  | 8  | 80 |", "| 2  | 2  | 9  | 80 |", "+----+----+----+----+"]`,
     right: `["+----+----+----+----+----+", "| a1 | b1 | c1 | a2 | c2 |", "+----+----+----+----+----+", "| 1  | 4  | 7  | 10 | 70 |", "| 2  | 5  | 8  | 20 | 80 |", "| 3  | 7  | 9  |    |    |", "+----+----+----+----+----+"]`:
    
    expected:
    
    [
        "+----+----+----+----+",
        "| a1 | b2 | c1 | c2 |",
        "+----+----+----+----+",
        "| 1  | 1  | 7  | 70 |",
        "| 2  | 2  | 8  | 80 |",
        "| 2  | 2  | 9  | 80 |",
        "+----+----+----+----+",
    ]
    actual:
    
    [
        "+----+----+----+----+----+",
        "| a1 | b1 | c1 | a2 | c2 |",
        "+----+----+----+----+----+",
        "| 1  | 4  | 7  | 10 | 70 |",
        "| 2  | 5  | 8  | 20 | 80 |",
        "| 3  | 7  | 9  |    |    |",
        "+----+----+----+----+----+",
    ]
    ```
    
    You can copy/paste the output of `actual` directly into the test code for an update.
    
    Closes apache#9264 from alamb/remove_test_format_batch
    
    Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
    Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
    alamb committed Jan 23, 2021
    Configuration menu
    Copy the full SHA
    67d0c2e View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2021

  1. Move Gandiva files and parameterize setup args

    The arrangement is inspired by the dynd_python module's setup.py.
    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    c970549 View commit details
    Browse the repository at this point in the history
  2. Setup CMake to be able to build python_gandiva

    Mainly did this by just passing a PYARROW_TARGET param
    to differentiate building the main library and other libraries.
    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    1be4559 View commit details
    Browse the repository at this point in the history
  3. Remove with-gandiva option

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    6981b2a View commit details
    Browse the repository at this point in the history
  4. Add license header

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    9d8bc15 View commit details
    Browse the repository at this point in the history
  5. import gandiva

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    901a76c View commit details
    Browse the repository at this point in the history
  6. Add development instructions for pyarrow_gandiva

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    64f23ad View commit details
    Browse the repository at this point in the history
  7. Get tests to pass

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    676e107 View commit details
    Browse the repository at this point in the history
  8. Remove mark for gandiva, since it's now a package

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    7906f34 View commit details
    Browse the repository at this point in the history
  9. Remove mentions of gandiva in conftest

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    b3a424d View commit details
    Browse the repository at this point in the history
  10. Remove import, since it will error

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    605d2e0 View commit details
    Browse the repository at this point in the history
  11. Add description and installation instructions in readme

    wjones127 authored and Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    d77ac23 View commit details
    Browse the repository at this point in the history
  12. Fix rebase mistakes

    Will Jones committed Jan 24, 2021
    Configuration menu
    Copy the full SHA
    06f78f0 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    4963060 View commit details
    Browse the repository at this point in the history