Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions website/docs/presto_hudi_connector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Presto Hudi Connector

##Overview

The **Presto Hudi Connector** enables querying Hudi tables synced to a Hive metastore. The connector uses the metastore only to track partition locations. It makes use of the underlying Hudi filesystem and input formats to list data files. To learn more about the design of the connector, please check out [RFC-40](https://github.com/apache/hudi/blob/master/rfc/rfc-44/rfc-44.md).

##Requirements

To use Hudi, we need:

* Network access from the Presto coordinator and workers to the distributed object storage.

* Access to a Hive metastore service (HMS).

* Network access from the Presto coordinator to the HMS. Hive metastore access with the Thrift protocol defaults to using port 9083.

##Configuration

Hudi supports the same metastore configuration properties as the Hive connector. At a minimum, following connector properties must be set in the `hudi.properties` file inside `<presto_install_dir> /etc/catalog` directory:

```
connector.name=hudi
hive.metastore.uri=thrift://hms.host:9083

```

Additionally, following session properties can be set depending on the use-case.

Property Name | Description | Default |
| ----------- | ----------- | ----------- |
hudi.metadata-table-enabled | Fetch the list of file names and sizes from Hudi’s metadata table rather than storage. | false |

##SQL Support

Currently, the connector only provides read access to data in the Hudi table that has been synced to Hive metastore. Once the catalog has been configured as mentioned above, users can query the tables as usual like Hive tables.

##Supported Query Types

| Table Type | Supported Query types |
| ----------- | ----------- |
| Copy On Write | Snapshot Queries |
| Merge On Read | Snapshot Queries + Read Optimized Queries |



##Examples Queries

`trips_table` is a Hudi table that we refer to in the [Hudi quickstart documentation](https://hudi.apache.org/docs/quick-start-guide).

Here is a sample query:

```
USE hudi.default;
SELECT ts, fare, rider, driver, city FROM trips_table WHERE fare > 20.0;
```

Output:

```
ts | fare | rider | driver | city
---------------+-------+---------+----------+--------------------
1695516137016 | 34.15 | rider-F | driver-P | city=sao_paulo
1695046462179 | 33.9 | rider-D | driver-L | city=san_francisco
1695091554788 | 27.7 | rider-C | driver-M | city=san_francisco
```



#Historical

| **PrestoDB Version** | **Installation description** | **Query types supported** |
|----------------------|------------------------------|---------------------------|
| < 0.233 | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
| > = 0.233 | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
| > = 0.240 | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables. |
| > = 0.268 | No action needed. Hudi 0.9.0 version is a compile time dependency. | Snapshot querying on bootstrap tables. |
| > = 0.272 | No action needed. Hudi 0.10.1 version is a compile time dependency. | File listing optimizations. Improved query performance. |
| > = 0.275 | No action needed. Hudi 0.11.0 version is a compile time dependency. | All of the above. Native Hudi connector that is on par with Hive connector. |


> **Note**
>
>Incremental queries and point in time queries are not supported either through the Hive connector or Hudi
connector. However, it is in our roadmap, and you can track the development
under [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210).

2 changes: 1 addition & 1 deletion website/src/pages/ecosystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ In such cases, you can leverage another tool like Apache Spark or Apache Flink t
| Onehouse.ai | [Read + Write](https://www.onehouse.ai/universal-data-lakehouse) | |
| Apache Spark | [Read + Write](https://hudi.apache.org/docs/quick-start-guide) | |
| Apache Flink | [Read + Write](https://hudi.apache.org/docs/flink-quick-start-guide) | |
| Presto | [Read](https://prestodb.io/docs/current/connector/hudi.html) | |
| Presto | [Read](https://hudi.apache.org/docs/presto_hudi_connector) | |
| Trino | [Read](https://trino.io/docs/current/connector/hudi.html) | |
| Hive | [Read](https://hudi.apache.org/docs/next/query_engine_setup/#hive) | |
| DBT | [Read + Write](https://hudi.apache.org/blog/2022/07/11/build-open-lakehouse-using-apache-hudi-and-dbt) | |
Expand Down