Skip to content

Spark Quick Start Guide column stat expression index is not working as expected for the scala code #14352

@rangareddy

Description

@rangareddy

Bug Description

What happened:

The "column stat expression index" functionality, as implemented in the provided Scala code example within the Spark Quick Start Guide, is not performing its intended optimization or yielding the expected results.

scala> // Query on ts column would prune the data using the idx_column_ts index

scala> spark.sql(s"SELECT * FROM hudi_indexed_table WHERE from_unixtime(ts, 'yyyy-MM-dd') = '2023-09-24'").show(false);
25/11/24 11:20:31 WARN CacheManager: Asked to cache already cached data.
25/11/24 11:20:32 WARN CacheManager: Asked to cache already cached data.
+-------------------+--------------------+------------------+----------------------+-----------------+---+----+-----+------+----+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|ts |uuid|rider|driver|fare|city|
+-------------------+--------------------+------------------+----------------------+-----------------+---+----+-----+------+----+----+
+-------------------+--------------------+------------------+----------------------+-----------------+---+----+-----+------+----+----+

What you expected:

I expected the Scala code to successfully implement and utilize the column stat expression index, resulting in the anticipated query optimization and improved performance (e.g., predicate pushdown or faster data filtering) as documented in the Quick Start Guide.

Steps to reproduce:

  1. Follow the Spark quick start guide index example (https://hudi.apache.org/docs/quick-start-guide#indexing)
  2. Query the table data and you will see empty results.

Environment

Hudi version: 1.1.0
Query engine: (Spark/Flink/Trino etc) Spark
Relevant configs:

Logs and Stack Trace

No response

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions