Skip to content
8 changes: 4 additions & 4 deletions docs/source/contributor-guide/parquet_scans.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ The `native_datafusion` and `native_iceberg_compat` scans share the following li
- When reading Parquet files written by systems other than Spark that contain columns with the logical types `UINT_8`
or `UINT_16`, Comet will produce different results than Spark because Spark does not preserve or understand these
logical types. Arrow-based readers, such as DataFusion and Comet do respect these types and read the data as unsigned
rather than signed. By default, Comet will fall back to `native_comet` when scanning Parquet files containing `byte` or `short`
types (regardless of the logical type). This behavior can be disabled by setting
rather than signed. By default, Comet will fall back to Spark's native scan when scanning Parquet files containing
`byte` or `short` types (regardless of the logical type). This behavior can be disabled by setting
`spark.comet.scan.allowIncompatible=true`.
- No support for default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.

Expand All @@ -59,11 +59,11 @@ The `native_datafusion` scan has some additional limitations:

## S3 Support

There are some
There are some differences in S3 support between the scan implementations.

### `native_comet`

The default `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
The `native_comet` Parquet scan implementation reads data from S3 using the [Hadoop-AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html), which
is identical to the approach commonly used with vanilla Spark. AWS credential configuration and other Hadoop S3A
configurations works the same way as in vanilla Spark.

Expand Down
13 changes: 6 additions & 7 deletions docs/source/contributor-guide/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,10 @@ helpful to have a roadmap for some of the major items that require coordination
### Iceberg Integration

Iceberg integration is still a work-in-progress ([#2060]), with major improvements expected in the next few
releases. Once this integration is complete, we plan on switching from the `native_comet` scan to the
`native_iceberg_compat` scan ([#2189]) so that complex types can be supported.
releases. The default `auto` scan mode now uses `native_iceberg_compat` instead of `native_comet`, enabling
support for complex types.

[#2060]: https://github.com/apache/datafusion-comet/issues/2060
[#2189]: https://github.com/apache/datafusion-comet/issues/2189

### Spark 4.0 Support

Expand All @@ -44,12 +43,12 @@ more Spark SQL tests and fully implementing ANSI support ([#313]) for all suppor
### Removing the native_comet scan implementation

We are working towards deprecating ([#2186]) and removing ([#2177]) the `native_comet` scan implementation, which
is the originally scan implementation that uses mutable buffers (which is incompatible with best practices around
is the original scan implementation that uses mutable buffers (which is incompatible with best practices around
Arrow FFI) and does not support complex types.

Once we are using the `native_iceberg_compat` scan (which is based on DataFusion's `DataSourceExec`) in the Iceberg
integration, we will be able to remove the `native_comet` scan implementation, and can then improve the efficiency
of our use of Arrow FFI ([#2171]).
Now that the default `auto` scan mode uses `native_iceberg_compat` (which is based on DataFusion's `DataSourceExec`),
we can proceed with removing the `native_comet` scan implementation, and then improve the efficiency of our use of
Arrow FFI ([#2171]).

[#2186]: https://github.com/apache/datafusion-comet/issues/2186
[#2171]: https://github.com/apache/datafusion-comet/issues/2171
Expand Down
6 changes: 5 additions & 1 deletion docs/source/user-guide/latest/datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,11 @@ Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for mo

### CSV

Comet does not provide native CSV scan, but when `spark.comet.convert.csv.enabled` is enabled, data is immediately
Comet provides experimental native CSV scan support. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV files
are read natively for improved performance. This feature is experimental and performance benefits are
workload-dependent.

Alternatively, when `spark.comet.convert.csv.enabled` is enabled, data from Spark's CSV reader is immediately
converted into Arrow format, allowing native execution to happen after that.

### JSON
Expand Down
22 changes: 15 additions & 7 deletions docs/source/user-guide/latest/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
| Contains | Yes | |
| EndsWith | Yes | |
| InitCap | No | Behavior is different in some cases, such as hyphenated names. |
| Left | Yes | Length argument must be a literal value |
| Length | Yes | |
| Like | Yes | |
| Lower | No | Results can vary depending on locale and character set. Requires `spark.comet.caseConversion.enabled=true` |
Expand All @@ -94,15 +95,20 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
| Expression | SQL | Spark-Compatible? | Compatibility Notes |
| -------------- | ---------------------------- | ----------------- | -------------------------------------------------------------------------------------------------------------------- |
| DateAdd | `date_add` | Yes | |
| DateDiff | `datediff` | Yes | |
| DateFormat | `date_format` | Yes | Partial support. Only specific format patterns are supported. |
| DateSub | `date_sub` | Yes | |
| DatePart | `date_part(field, source)` | Yes | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
| Extract | `extract(field FROM source)` | Yes | Supported values of `field`: `year`/`month`/`week`/`day`/`dayofweek`/`dayofweek_iso`/`doy`/`quarter`/`hour`/`minute` |
| FromUnixTime | `from_unixtime` | No | Does not support format, supports only -8334601211038 <= sec <= 8210266876799 |
| Hour | `hour` | Yes | |
| LastDay | `last_day` | Yes | |
| Minute | `minute` | Yes | |
| Second | `second` | Yes | |
| TruncDate | `trunc` | Yes | |
| TruncTimestamp | `trunc_date` | Yes | |
| TruncTimestamp | `date_trunc` | Yes | |
| UnixDate | `unix_date` | Yes | |
| UnixTimestamp | `unix_timestamp` | Yes | |
| Year | `year` | Yes | |
| Month | `month` | Yes | |
| DayOfMonth | `day`/`dayofmonth` | Yes | |
Expand Down Expand Up @@ -163,6 +169,7 @@ Expressions that are not Spark-compatible will fall back to Spark by default and
| ----------- | ----------------- |
| Md5 | Yes |
| Murmur3Hash | Yes |
| Sha1 | Yes |
| Sha2 | Yes |
| XxHash64 | Yes |

Expand Down Expand Up @@ -256,12 +263,13 @@ Comet supports using the following aggregate functions within window contexts wi

## Struct Expressions

| Expression | Spark-Compatible? |
| -------------------- | ----------------- |
| CreateNamedStruct | Yes |
| GetArrayStructFields | Yes |
| GetStructField | Yes |
| StructsToJson | Yes |
| Expression | Spark-Compatible? | Compatibility Notes |
| -------------------- | ----------------- | ------------------------------------------ |
| CreateNamedStruct | Yes | |
| GetArrayStructFields | Yes | |
| GetStructField | Yes | |
| JsonToStructs | No | Partial support. Requires explicit schema. |
| StructsToJson | Yes | |

## Conversion Expressions

Expand Down
95 changes: 87 additions & 8 deletions docs/source/user-guide/latest/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,14 +183,93 @@ $SPARK_HOME/bin/spark-shell \
The same sample queries from above can be used to test Comet's fully-native Iceberg integration,
however the scan node to look for is `CometIcebergNativeScan`.

### Supported features

The native Iceberg reader supports the following features:

**Table specifications:**

- Iceberg table spec v1 and v2 (v3 will fall back to Spark)

**Schema and data types:**

- All primitive types including UUID
- Complex types: arrays, maps, and structs
- Schema evolution (adding and dropping columns)

**Time travel and branching:**

- `VERSION AS OF` queries to read historical snapshots
- Branch reads for accessing named branches

**Delete handling (Merge-On-Read tables):**

- Positional deletes
- Equality deletes
- Mixed delete types

**Filter pushdown:**

- Equality and comparison predicates (`=`, `!=`, `>`, `>=`, `<`, `<=`)
- Logical operators (`AND`, `OR`)
- NULL checks (`IS NULL`, `IS NOT NULL`)
- `IN` and `NOT IN` list operations
- `BETWEEN` operations

**Partitioning:**

- Standard partitioning with partition pruning
- Date partitioning with `days()` transform
- Bucket partitioning
- Truncate transform
- Hour transform

**Storage:**

- Local filesystem
- Hadoop Distributed File System (HDFS)
- S3-compatible storage (AWS S3, MinIO)

### REST Catalog

Comet's native Iceberg reader also supports REST catalogs. The following example shows how to
configure Spark to use a REST catalog with Comet's native Iceberg scan:

```shell
$SPARK_HOME/bin/spark-shell \
--packages org.apache.datafusion:comet-spark-spark3.5_2.12:0.12.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-core:1.8.1 \
--repositories https://repo1.maven.org/maven2/ \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.rest_cat=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.rest_cat.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
--conf spark.sql.catalog.rest_cat.uri=http://localhost:8181 \
--conf spark.sql.catalog.rest_cat.warehouse=/tmp/warehouse \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.scan.icebergNative.enabled=true \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=2g
```

Note that REST catalogs require explicit namespace creation before creating tables:

```scala
scala> spark.sql("CREATE NAMESPACE rest_cat.db")
scala> spark.sql("CREATE TABLE rest_cat.db.test_table (id INT, name STRING) USING iceberg")
scala> spark.sql("INSERT INTO rest_cat.db.test_table VALUES (1, 'Alice'), (2, 'Bob')")
scala> spark.sql("SELECT * FROM rest_cat.db.test_table").show()
```

### Current limitations

The following scenarios are not yet supported, but are work in progress:
The following scenarios will fall back to Spark's native Iceberg reader:

- Iceberg table spec v3 scans will fall back.
- Iceberg writes will fall back.
- Iceberg table scans backed by Avro or ORC data files will fall back.
- Iceberg table scans partitioned on `BINARY` or `DECIMAL` (with precision >28) columns will fall back.
- Iceberg scans with residual filters (_i.e._, filter expressions that are not partition values,
and are evaluated on the column values at scan time) of `truncate`, `bucket`, `year`, `month`,
`day`, `hour` will fall back.
- Iceberg table spec v3 scans
- Iceberg writes (reads are accelerated, writes use Spark)
- Tables backed by Avro or ORC data files (only Parquet is accelerated)
- Tables partitioned on `BINARY` or `DECIMAL` (with precision >28) columns
- Scans with residual filters using `truncate`, `bucket`, `year`, `month`, `day`, or `hour`
transform functions (partition pruning still works, but row-level filtering of these
transforms falls back)