v0.7.10
·
163 commits
to refs/heads/main
since this release
What's Changed π
β¨ Features
- feat(dashboard): Draft tasks view for Flotilla @samstokes (#6783)
- feat(functions): add simhash and hamming_distance for near-duplicate detection @chenghuichen (#6821)
- feat: add pmod function for PySpark parity @YuangGao (#6801)
- feat: distributed range repartitioned asof joins v2 @euanlimzx (#6816)
- feat(distributed): Add task metadata to task events @cckellogg (#6822)
- feat: genericize RandomShuffle to support Flight shuffle backend @chenghuichen (#6808)
- feat(paimon): enhance Paimon integration @YannByron (#6635)
- feat(dashboard): render Repartition spec in human-readable form @BABTUNA (#6798)
- feat(cli): show query id in CLI output @BABTUNA (#6755)
- feat(functions): add hamming_distance function @gweaverbiodev (#6797)
- feat(ext): add UDAF support to daft-ext @chenghuichen (#6789)
- feat(list): add list_filter expression @aaron-ang (#6769)
- feat(ext): add C++ hello extension example @universalmind303 (#6804)
- feat(functions): add great_circle_distance function @gweaverbiodev (#6754)
- feat: support Ray 2.55.0 @XuQianJin-Stars (#6767)
- feat(distributed): Add env to enable task events @cckellogg (#6782)
- feat(functions): add seq expression for per-row integer sequences @yuchen-ecnu (#6772)
- feat(dataframe): add DataFrame.count_distinct() and GroupedDataFrame.count_distinct() methods @kerwin-zk (#6658)
- feat(distributed): enable two-stage HLL aggregation for approx_count_distinct @desmondcheongzx (#6597)
- feat(distributed): emit TaskSubmit and TaskEnd lifecycle events @cckellogg (#6759)
- feat: introduce AggFn trait and two-stage UDAF aggregation pipeline @chenghuichen (#6704)
- feat(cast): support cross-TimeUnit Duration casts @aaron-ang (#6766)
- feat: introduce richer source stats @rchowell (#6723)
- feat(dataframe): add DataFrame.product() and GroupedDataFrame.product() methods @kerwin-zk (#6655)
- feat(checkpoint): distributed checkpoint filter via optimizer rule @rohitkulshreshtha (#6725)
- feat(stats): Add num_tasks metric tracking across all runtime stats @samstokes (#6716)
- feat: implement bin function @YuangGao (#6728)
- feat: implement batch 3 temporal functions (make_date, make_timestamp, make_timestamp_ltz, last_day, next_day) @BABTUNA (#6672)
- feat: percentile and median ops @aaron-ang (#6153)
- feat(datatype): allow accessing simple type constructors as properties @Lucas61000 (#6620)
- feat(sql): improve error reporting with source context and caret @Lucas61000 (#6572)
- feat: better native execution for asof joins @euanlimzx (#6699)
- feat(sql): add SAMPLE support for SQL queries @Lucas61000 (#6636)
- feat(subscribers): add optional origin_node_id to py operator events @cckellogg (#6714)
- feat: Union Type @PhysicsACE (#6497)
- feat: implement e, pi, factorial, and hypot math functions @BABTUNA (#6682)
- feat: support catalog-qualified identifiers in create_table, drop_table @BABTUNA (#6680)
- feat: remote EventLogSubscriber @universalmind303 (#6701)
- feat(dashboard): subscriber heartbeat and dead query detection @samstokes (#6676)
- feat: add S3-backed CheckpointStore implementation @chenghuichen (#6599)
- feat: distributed execution of asof joins @euanlimzx (#6667)
π Bug Fixes
- fix(test): update test_local_full_ls to expect canonical file:// URIs @rohitkulshreshtha (#6824)
- fix: ensure paimon is installed for integration tests @rchowell (#6827)
- fix(udf): resolve use_process=True subprocess deadlocks @rohitkulshreshtha (#6793)
- fix(udf): handle UDF expressions with no column references (#6805) @rohitkulshreshtha (#6814)
- fix(io): emit canonical file:// URIs for Windows drive paths @rohitkulshreshtha (#6817)
- fix(docs): fix many docs issues @colin-ho (#6811)
- fix(checkpoint): use strip_file_uri_to_path in put_bytes for Windows @rohitkulshreshtha (#6796)
- fix(checkpoint): normalize Windows tempdir paths in s3_store tests @rohitkulshreshtha (#6791)
- fix: respect per-method max_retries/on_error overrides @BABTUNA (#6784)
- fix(core): normalize FixedSizeListArray inner field name to 'item' @veinkr-bot (#6733)
- fix(tests): deflake test_sharding_with_file_scan @veinkr-bot (#6787)
- fix: preserve identity partition predicates when combined with ScalarFn siblings @gavin9402 (#6695)
- fix(flight-shuffle): handle all-empty inputs in shuffle cache and shu⦠@ohbh (#6780)
- fix(flotilla): overreporting of bytes.read @universalmind303 (#6774)
- fix: add missing serde feature to uuid workspace dependency @chenghuichen (#6773)
- fix: respect proxy env vars (HTTP_PROXY, HTTPS_PROXY, etc.) in S3 client @BABTUNA (#6679)
- fix: fix column not found when using count_rows() for sparse data @caican00 (#6703)
- fix(scheduler): include dispatched tasks in autoscaling ratio @desmondcheongzx (#6388)
- fix(io): Write metrics in close for the last batch @kvthr (#6606)
- fix(docs): replace deprecated .struct.get() with .get() expression in examples @everettVT (#6709)
- fix: incrementally ramp up Ray autoscaler resource requests to avoid exceeding cluster capacity @ohbh (#6653)
- fix: Fix document example, unnest is not a param on prompt @colin-ho (#6712)
- fix(stats): Track source execution time per message in pipeline stats @samstokes (#6715)
- fix: add tenacity retries to Google Sheets upload in benchmarking @jeevb (#6713)
- fix: skip schema pruning on Source node when can_absorb_select is false @helmanofer (#6501)
- fix(io): support writing to local fs via GravitinoGvfs local issue @qingfeng-occ (#6579)
π Performance
- perf(optimizer): implement DP-ccp join ordering algorithm @desmondcheongzx (#6460)
- perf(inline-agg): add Utf8/Binary single-column fast path @BABTUNA (#6656)
β»οΈ Refactor
- refactor(flight): implement into_partitions for flight shuffle @ohbh (#6764)
- refactor(flight): implement distributed gather for flight shuffle @ohbh (#6751)
- refactor(partition-refs): consolidate partition ref types and trim unused derives @ohbh (#6742)
- refactor(distributed): Remove node_origin_id self-ref in distributed @cckellogg (#6738)
- refactor(inline-agg): macro-generate accumulator dispatch @BABTUNA (#6642)
- refactor: flight-server passes PartitionRefs directly to the scheduler @ohbh (#6627)
- refactor(daft-distributed): cleanup of statistics_manager @universalmind303 (#6718)
π Documentation
- docs: add extension ecosystem pages @everettVT (#6836)
- docs: add Tencent Cloud COS connector documentation @XuQianJin-Stars (#6818)
- docs: fix critical documentation issues from audit @colin-ho (#6834)
- docs: clarify audio file size description @ykdojo (#6792)
- docs: Fix docsearch results to respect current documentation version @colin-ho (#6778)
- docs: fix critical documentation issues 7-14 from audit @colin-ho (#6777)
- docs(extensions): move authoring guide to dedicated page @everettVT (#6732)
- docs(custom-code): document cpus/gpus/retries/ray_options and fix migration guide @everettVT (#6702)
- docs(extensions): add community extensions subpage @desmondcheongzx (#6717)
β Tests
- test(udf): cover combined large-stderr + no-newline use_process case @rohitkulshreshtha (#6825)
- test: Add local worker execution and statistics testing infrastructure @samstokes (#6697)
π§ Maintenance
- chore: Simplify local Repartition operator @srilman (#6739)
- chore(dashboard): harden npm/next.js supply chain via .npmrc @everettVT (#6727)
- chore: upgrade deltalake to 1.5.0 @rchowell (#6736)
- chore(parquet): lower opt-level for daft-parquet release builds @desmondcheongzx (#6690)
Full Changelog: v0.7.9...v0.7.10