Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Unable to filter hudi table in hive on partition column #1962

Closed
sassai opened this issue Aug 13, 2020 · 15 comments
Closed

[SUPPORT] Unable to filter hudi table in hive on partition column #1962

sassai opened this issue Aug 13, 2020 · 15 comments
Labels
hive Issues related to hive meta-sync

Comments

@sassai
Copy link

sassai commented Aug 13, 2020

Describe the problem you faced

I'm running a spark structured streaming application that reads data from kafka and saves it to a partitioned Hudi MERGE_ON_READ table. Hive sync is enabled and I'm able to query the table with the Hive CLI, e.g.:

SELECT * FROM iot_device_ro LIMIT 5;

+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| iot_device_ro._hoodie_commit_time  | iot_device_ro._hoodie_commit_seqno  |      iot_device_ro._hoodie_record_key      |    iot_device_ro._hoodie_partition_path     |          iot_device_ro._hoodie_file_name           | iot_device_ro.deviceid  | iot_device_ro.sensorid  | iot_device_ro.measurement  | iot_device_ro.measure_ts  |          iot_device_ro.uuid           |  iot_device_ro.its   | iot_device_ro.year  | iot_device_ro.month  | iot_device_ro.day  | iot_device_ro.hour  | iot_device_ro.minute  |
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| 20200813121124                     | 20200813121124_0_1                  | uuid:3d387a37-f288-456b-87b7-2b6865cf32e0  | year=2020/month=8/day=13/hour=12/minute=11  | 53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | iotdevice4              | 1                       | 30.228266831690732         | 2020-08-13T08:39:04.528Z  | 3d387a37-f288-456b-87b7-2b6865cf32e0  | 2020-08-13 12:11:24  | 2020                | 8                    | 13                 | 12                  | 11                    |
| 20200813121124                     | 20200813121124_0_2                  | uuid:5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | year=2020/month=8/day=13/hour=12/minute=11  | 53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | iotdevice4              | 1                       | 31.453188991515226         | 2020-08-13T08:39:19.588Z  | 5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | 2020-08-13 12:11:24  | 2020                | 8                    | 13                 | 12                  | 11                    |
| 20200813121124                     | 20200813121124_0_3                  | uuid:6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | year=2020/month=8/day=13/hour=12/minute=11  | 53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | iotdevice4              | 1                       | 34.68735798194983          | 2020-08-13T07:45:05.958Z  | 6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | 2020-08-13 12:11:24  | 2020                | 8                    | 13                 | 12                  | 11                    |
| 20200813121124                     | 20200813121124_0_4                  | uuid:5c2dbea8-9668-4652-84c6-c82d06aa2805  | year=2020/month=8/day=13/hour=12/minute=11  | 53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | iotdevice4              | 1                       | 33.680806905962264         | 2020-08-12T13:33:20.159Z  | 5c2dbea8-9668-4652-84c6-c82d06aa2805  | 2020-08-13 12:11:24  | 2020                | 8                    | 13                 | 12                  | 11                    |
| 20200813121124                     | 20200813121124_0_5                  | uuid:528e6c74-bb44-49da-aa76-059781cc7676  | year=2020/month=8/day=13/hour=12/minute=11  | 53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | iotdevice4              | 1                       | 31.38529683936205          | 2020-08-13T10:57:58.448Z  | 528e6c74-bb44-49da-aa76-059781cc7676  | 2020-08-13 12:11:24  | 2020                | 8                    | 13                 | 12                  | 11                    |
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+

However if I want to apply a filter on the partition colum the result is empty:

SELECT * FROM iot_device_ro WHERE day=13 LIMIT 10;

+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| iot_device_ro._hoodie_commit_time  | iot_device_ro._hoodie_commit_seqno  | iot_device_ro._hoodie_record_key  | iot_device_ro._hoodie_partition_path  | iot_device_ro._hoodie_file_name  | iot_device_ro.deviceid  | iot_device_ro.sensorid  | iot_device_ro.measurement  | iot_device_ro.measure_ts  | iot_device_ro.uuid  | iot_device_ro.its  | iot_device_ro.year  | iot_device_ro.month  | iot_device_ro.day  | iot_device_ro.hour  | iot_device_ro.minute  |
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
No rows selected (70.91 secon

To Reproduce

Steps to reproduce the behavior:

Hudi Datasource Configuration in Spark:

Dataset<Row> output =
        streamingInput.withColumn("its", date_format(current_timestamp(), DATE_FORMAT));
    DataStreamWriter<Row> writer =
        output
            .writeStream()
            .format("hudi")
            .option("hoodie.insert.shuffle.parallelism", "2")
            .option("hoodie.upsert.shuffle.parallelism", "2")
            .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType)
            .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "uuid")
            .option(
                DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(),
                ComplexKeyGenerator.class.getCanonicalName())
            .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), partitions)
            .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "its")
            .option(HoodieWriteConfig.TABLE_NAME, tableName)
            .option("checkpointLocation", streamingCheckpointingPath)
            .option(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH_OPT_KEY(), "false")
            .option(DataSourceWriteOptions.STREAMING_RETRY_CNT_OPT_KEY(), "10")
            .outputMode(OutputMode.Append())
            .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), tableName)
            .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), hiveDB)
            .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), hiveJdbcUrl)
            .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY(), "true")
            .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), "true")
            .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(), partitions)
            .option(
                    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(),
                    MultiPartKeysValueExtractor.class.getCanonicalName());

Hudi Table Description in Hive:

+--------------------------+------------+----------+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+
| _hoodie_commit_time      | string     |          |
| _hoodie_commit_seqno     | string     |          |
| _hoodie_record_key       | string     |          |
| _hoodie_partition_path   | string     |          |
| _hoodie_file_name        | string     |          |
| deviceid                 | string     |          |
| sensorid                 | string     |          |
| measurement              | double     |          |
| measure_ts               | string     |          |
| uuid                     | string     |          |
| its                      | string     |          |
| year                     | int        |          |
| month                    | int        |          |
| day                      | int        |          |
| hour                     | int        |          |
| minute                   | int        |          |
|                          | NULL       | NULL     |
| # Partition Information  | NULL       | NULL     |
| # col_name               | data_type  | comment  |
| year                     | int        |          |
| month                    | int        |          |
| day                      | int        |          |
| hour                     | int        |          |
| minute                   | int        |          |
+--------------------------+------------+----------+

Hive Partitions created by HudiHiveSync:

+---------------------------------------------+
|                  partition                  |
+---------------------------------------------+
| year=2020/month=8/day=13/hour=12/minute=11  |
| year=2020/month=8/day=13/hour=12/minute=17  |
| year=2020/month=8/day=13/hour=12/minute=18  |
| year=2020/month=8/day=13/hour=12/minute=19  |
| year=2020/month=8/day=13/hour=12/minute=20  |
| year=2020/month=8/day=13/hour=12/minute=21  |

Additional table information:

Detailed Table Information
--
Database: | xxx|  
OwnerType: | USER |  
Owner: | xxx|  
CreateTime: | Thu Aug 13 12:11:49 UTC 2020 |  
LastAccessTime: | UNKNOWN |  
Retention: | 0 |  
Location: | abfs://xxx@xxx.dfs.core.windows.net/data/hudi/streaming/tables/xxx/iot_device |  
Table Type: | EXTERNAL_TABLE |  
Table Parameters: |   |  
  | EXTERNAL | TRUE
  | bucketing_version | 2
  | discover.partitions | true
  | last_commit_time_sync | 20200813133453
  | numFiles | 77
  | numPartitions | 68
  | numRows | 0
  | rawDataSize | 0
  | spark.sql.create.version | 2.2 or prior
  | spark.sql.sources.schema.numPartCols | 5
  | spark.sql.sources.schema.numParts | 1
  | spark.sql.sources.schema.part.0 | {\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"deviceid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sensorid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measurement\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measure_ts\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"uuid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"its\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"year\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"hour\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"minute\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
  | spark.sql.sources.schema.partCol.0 | year
  | spark.sql.sources.schema.partCol.1 | month
  | spark.sql.sources.schema.partCol.2 | day
  | spark.sql.sources.schema.partCol.3 | hour
  | spark.sql.sources.schema.partCol.4 | minute
  | totalSize | 34194101
  | transient_lastDdlTime | 1597324063
  |   |  
Storage Information
SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |  
InputFormat: | org.apache.hudi.hadoop.HoodieParquetInputFormat |  
OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |  
Compressed: | No |  
Num Buckets: | -1 |  
Bucket Columns: | [] |  
Sort Columns: | [] |  
Storage Desc Params: |   |  
  | serialization.format | 1

Strangely enough if I run the same query in Spark everything works as expected.

Expected behavior

Filter on partition columns can be applied on Hudi tables in Hive CLI.

Environment Description

  • Hudi version : 0.5.3

  • Spark version : 2.4.0

  • Hive version : 3.1

  • Hadoop version : 3

  • Storage (HDFS/S3/GCS..) : ADLS

  • Running on Docker? (yes/no) : no

What am I missing? Any help is very much appreciated!

Thank you in advance!

@nsivabalan
Copy link
Contributor

Did you set hive input format ? Also can you confirm you settings given here are set.

@sassai
Copy link
Author

sassai commented Aug 14, 2020

Hi Sivabalan,

when I try to set the given settings I get the following error in Hive CLI:

20/08/14 07:19:51 [main]: ERROR jdbc.Utils: Unable to read HiveServer2 configs from ZooKeeper
Error: Could not open client transport for any of the Server URI's in ZooKeeper: Failed to open new session: java.lang.IllegalArgumentException: Cannot modify hive.input.format at runtime. It is not in list of params that are allowed to be modified at runtime (state=08S01,code=0)
Beeline version 3.1.3000.7.2.0.0-237 by Apache Hive

@nsivabalan
Copy link
Contributor

how are you connecting to hive, is it beeline or some other means? if its beeline, you should be able to pass in configs as given in the quick start page. If by some other means, probably you have to find out how to set hive configs when you bring up the tool. Or probably you can paste the command you used to bring up hive(obfuscating any confidential info). we can try to help you out.

@nsivabalan
Copy link
Contributor

also, is it that you are having issues with all columns or just few columns?

@sassai
Copy link
Author

sassai commented Aug 14, 2020

I'm running Hive within Cloudera Data Platform Public Cloud. I connected to Hive using HUE and Hive CLI.

Beeline Connection:

[root@engineeringhub-master3 ~]# beeline --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging.
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://engineeringhub-master3.mtag-fas.ft71-kqhq.cloudera.site:2181/default;httpPath=cliservice;principal=hive/_HOST@MTAG-FAS.FT71-KQHQ.CLOUDERA.SITE;serviceDiscoveryMode=zooKeeper;ssl=true;transportMode=http;zooKeeperNamespace=hiveserver2
20/08/14 13:14:12 [main]: INFO jdbc.HiveConnection: Connected to engineeringhub-master3.mtag-fas.ft71-kqhq.cloudera.site:10001
20/08/14 13:14:12 [main]: WARN jdbc.HiveConnection: Failed to connect to engineeringhub-master3.mtag-fas.ft71-kqhq.cloudera.site:10001
20/08/14 13:14:12 [main]: ERROR jdbc.Utils: Unable to read HiveServer2 configs from ZooKeeper
Error: Could not open client transport for any of the Server URI's in ZooKeeper: Failed to open new session: java.lang.IllegalArgumentException: Cannot modify hive.input.format at runtime. It is not in list of params that are allowed to be modified at runtime (state=08S01,code=0)
Beeline version 3.1.3000.7.2.0.0-237 by Apache Hive
beeline>

Later on we would like to use JDBC-Connections from external tools like PowerBI or Dbeaver.

I can query and filter all columns excluding the partiton columns: year, month, day, hour, minute.

If I apply aggregations on the partition columns I can filter them. This is very odd.

SELECT DISTINCT(minute), count(minute)
FROM greentech.iot_device_ro
GROUP BY minute
; 

but

SELECT *
FROM greentech.iot_device_ro
WHERE minute=50
; 

returns an emtpy result list.

Thank you for the quick response.

@nsivabalan
Copy link
Contributor

@bhasudha / @bvaradar : do you folks have any pointers here. Looks like the input format is not getting set.

Error: Could not open client transport for any of the Server URI's in ZooKeeper: Failed to open new session: java.lang.IllegalArgumentException: Cannot modify hive.input.format at runtime. It is not in list of params that are allowed to be modified at runtime (state=08S01,code=0)

Would that be the issue here?

@bvaradar
Copy link
Contributor

For the second case, Hive Metastore would be filtering out partitions and only return specific paths. I think there is some inconsistency between the path used in the filesystem and the one that is present in meta-store.

@sassai : Sorry for the delay, Can you recursively list your hoodie data set and attach the output. Also please add the file contents of latest .commit or .deltacommit file .

Also, add the output for one of the partition with location:
describe formatted table_name partition (year=xxx,month=xxx,day=xxx,hour=xxx,minute=xxx);

@bvaradar
Copy link
Contributor

Closing this due to inactivity

@sassai
Copy link
Author

sassai commented Oct 2, 2020

@bvaradar: Sorry for the late reply. I was not able to investigate this issue further until now.

In the meantime I updated Hudi to 0.6.0 to check if the issue still occurs. Unfortunately yes. I created a table (COPY_ON_WRITE) with testing data for further debugging. Please find the requested information below:

Hudi data set:

drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/.aux
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/.aux/.bootstrap
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/.aux/.bootstrap/.fileids
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/.aux/.bootstrap/.partitions
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/.temp
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer       9133 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171431.commit
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171431.commit.requested
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer        999 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171431.inflight
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer       1169 2020-10-01 17:18 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171823.commit
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:18 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171823.commit.requested
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer        380 2020-10-01 17:18 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001171823.inflight
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer       2986 2020-10-01 17:34 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001173346.commit
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:34 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001173346.commit.requested
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer       1653 2020-10-01 17:34 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/20201001173346.inflight
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/archived
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer        228 2020-10-01 17:14 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/.hoodie/hoodie.properties
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10
drwxr-xr-x   - 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer          0 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer         93 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/.hoodie_partition_metadata
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7095995 2020-10-01 17:34 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/08b5ed87-a749-4a82-a298-59071381dbc9-0_0-89-2258_20201001173346.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7096113 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/08b5ed87-a749-4a82-a298-59071381dbc9-0_8-25-180_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7126955 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/3c49d05f-8a9b-4365-8158-a32f879d674f-0_0-25-172_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7126790 2020-10-01 17:34 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/3c49d05f-8a9b-4365-8158-a32f879d674f-0_1-89-2259_20201001173346.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7144341 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/4f7d1a69-112a-42d1-b0ac-adf8de1e8dad-0_7-25-179_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7120178 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/669c9159-a795-40ac-9827-4551965e1750-0_3-25-175_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7197719 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/ab9c4c36-ea10-47bc-bf05-1555bf07c4ad-0_2-25-174_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7158006 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/ad939d4e-d8dc-4723-b5a5-8ec2c064f3e3-0_6-25-178_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7170312 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/c4c0b379-79e8-4856-9605-e56b1beb1b09-0_4-25-176_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7118844 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/f5fae4bc-396d-4ff6-8b89-01c08559cb50-0_5-25-177_20201001171431.parquet
-rw-r--r--   1 3d88417a-c602-4b19-b581-ac7265074929 srv_tu_usecase2_producer    7156753 2020-10-01 17:15 abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1/f804937c-8de2-4692-90d9-4b485305d721-0_1-25-173_20201001171431.parquet

Latest commit:

{
  "partitionToWriteStats" : {
    "year=2020/month=10/day=1" : [ {
      "fileId" : "08b5ed87-a749-4a82-a298-59071381dbc9-0",
      "path" : "year=2020/month=10/day=1/08b5ed87-a749-4a82-a298-59071381dbc9-0_0-89-2258_20201001173346.parquet",
      "prevCommit" : "20201001171431",
      "numWrites" : 110569,
      "numDeletes" : 0,
      "numUpdateWrites" : 1,
      "numInserts" : 0,
      "totalWriteBytes" : 7095995,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=10/day=1",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 7095995
    }, {
      "fileId" : "3c49d05f-8a9b-4365-8158-a32f879d674f-0",
      "path" : "year=2020/month=10/day=1/3c49d05f-8a9b-4365-8158-a32f879d674f-0_1-89-2259_20201001173346.parquet",
      "prevCommit" : "20201001171431",
      "numWrites" : 111111,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 1,
      "totalWriteBytes" : 7126790,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "year=2020/month=10/day=1",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 7126790
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"address_record\",\"namespace\":\"hoodie.address\",\"fields\":[{\"name\":\"id\",\"type\":[\"int\",\"null\"]},{\"name\":\"zipCode\",\"type\":[\"int\",\"null\"]},{\"name\":\"city\",\"type\":[\"string\",\"null\"]},{\"name\":\"street\",\"type\":[\"string\",\"null\"]},{\"name\":\"streetNumber\",\"type\":[\"int\",\"null\"]},{\"name\":\"uuid\",\"type\":[\"string\",\"null\"]},{\"name\":\"start_date\",\"type\":[\"string\",\"null\"]},{\"name\":\"end_date\",\"type\":[\"string\",\"null\"]},{\"name\":\"is_current\",\"type\":\"boolean\"},{\"name\":\"event_time\",\"type\":[\"string\",\"null\"]},{\"name\":\"year\",\"type\":\"int\"},{\"name\":\"month\",\"type\":\"int\"},{\"name\":\"day\",\"type\":\"int\"},{\"name\":\"its\",\"type\":\"string\"}]}"
  },
  "operationType" : "UPSERT",
  "fileIdAndRelativePaths" : {
    "08b5ed87-a749-4a82-a298-59071381dbc9-0" : "year=2020/month=10/day=1/08b5ed87-a749-4a82-a298-59071381dbc9-0_0-89-2258_20201001173346.parquet",
    "3c49d05f-8a9b-4365-8158-a32f879d674f-0" : "year=2020/month=10/day=1/3c49d05f-8a9b-4365-8158-a32f879d674f-0_1-89-2259_20201001173346.parquet"
  },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 11307,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogFilesSize" : 0
}

Hudi properties file:

#Properties saved on Thu Oct 01 17:14:31 UTC 2020
#Thu Oct 01 17:14:31 UTC 2020
hoodie.table.name=address
hoodie.archivelog.folder=archived
hoodie.table.type=COPY_ON_WRITE
hoodie.table.version=1
hoodie.timeline.layout.version=1

Describe table:

+-----------------------------------+----------------------------------------------------+-----------------------+
|             col_name              |                     data_type                      |        comment        |
+-----------------------------------+----------------------------------------------------+-----------------------+
| _hoodie_commit_time               | string                                             |                       |
| _hoodie_commit_seqno              | string                                             |                       |
| _hoodie_record_key                | string                                             |                       |
| _hoodie_partition_path            | string                                             |                       |
| _hoodie_file_name                 | string                                             |                       |
| id                                | int                                                |                       |
| zipcode                           | int                                                |                       |
| city                              | string                                             |                       |
| street                            | string                                             |                       |
| streetnumber                      | int                                                |                       |
| uuid                              | string                                             |                       |
| start_date                        | string                                             |                       |
| end_date                          | string                                             |                       |
| is_current                        | boolean                                            |                       |
| event_time                        | string                                             |                       |
| its                               | string                                             |                       |
|                                   | NULL                                               | NULL                  |
| # Partition Information           | NULL                                               | NULL                  |
| # col_name                        | data_type                                          | comment               |
| year                              | int                                                |                       |
| month                             | int                                                |                       |
| day                               | int                                                |                       |
|                                   | NULL                                               | NULL                  |
| # Detailed Partition Information  | NULL                                               | NULL                  |
| Partition Value:                  | [2020, 10, 1]                                      | NULL                  |
| Database:                         | nyc_taxi                                           | NULL                  |
| Table:                            | address                                            | NULL                  |
| CreateTime:                       | Thu Oct 01 17:15:51 UTC 2020                       | NULL                  |
| LastAccessTime:                   | UNKNOWN                                            | NULL                  |
| Location:                         | abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1 | NULL                  |
| Partition Parameters:             | NULL                                               | NULL                  |
|                                   | numFiles                                           | 9                     |
|                                   | totalSize                                          | 64289221              |
|                                   | transient_lastDdlTime                              | 1601572551            |
|                                   | NULL                                               | NULL                  |
| # Storage Information             | NULL                                               | NULL                  |
| SerDe Library:                    | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL                  |
| InputFormat:                      | org.apache.hudi.hadoop.HoodieParquetInputFormat    | NULL                  |
| OutputFormat:                     | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL                  |
| Compressed:                       | No                                                 | NULL                  |
| Num Buckets:                      | -1                                                 | NULL                  |
| Bucket Columns:                   | []                                                 | NULL                  |
| Sort Columns:                     | []                                                 | NULL                  |
| Storage Desc Params:              | NULL                                               | NULL                  |
|                                   | serialization.format                               | 1                     |
+-----------------------------------+----------------------------------------------------+-----------------------+

@sassai
Copy link
Author

sassai commented Oct 5, 2020

Update:

Using set hive.fetch.task.conversion=none; within the hive session fixed the issue.

@bvaradar
Copy link
Contributor

bvaradar commented Oct 5, 2020

@sassai : Location is set wrongly.

| Location: | abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=1 | NULL |

It should have been abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address only.

Can you try dropping the table and hive syncing again ?

@sassai
Copy link
Author

sassai commented Oct 6, 2020

@bvaradar: I dropped and synced the table again. Issue still exists unless I set set hive.fetch.task.conversion=none; before executing the query. Then everything works as expected.

Please find the table description with and without partition below.

describe formatted address:

+-------------------------------+----------------------------------------------------+-----------------------+
|           col_name            |                     data_type                      |        comment        |
+-------------------------------+----------------------------------------------------+-----------------------+
| _hoodie_commit_time           | string                                             |                       |
| _hoodie_commit_seqno          | string                                             |                       |
| _hoodie_record_key            | string                                             |                       |
| _hoodie_partition_path        | string                                             |                       |
| _hoodie_file_name             | string                                             |                       |
| id                            | int                                                |                       |
| zipcode                       | int                                                |                       |
| city                          | string                                             |                       |
| street                        | string                                             |                       |
| streetnumber                  | int                                                |                       |
| uuid                          | string                                             |                       |
| start_date                    | string                                             |                       |
| end_date                      | string                                             |                       |
| is_current                    | boolean                                            |                       |
| event_time                    | string                                             |                       |
| its                           | string                                             |                       |
|                               | NULL                                               | NULL                  |
| # Partition Information       | NULL                                               | NULL                  |
| # col_name                    | data_type                                          | comment               |
| year                          | int                                                |                       |
| month                         | int                                                |                       |
| day                           | int                                                |                       |
|                               | NULL                                               | NULL                  |
| # Detailed Table Information  | NULL                                               | NULL                  |
| Database:                     | nyc_taxi                                           | NULL                  |
| OwnerType:                    | USER                                               | NULL                  |
| Owner:                        | srv_tu_usecase2_producer                           | NULL                  |
| CreateTime:                   | Tue Oct 06 07:06:03 UTC 2020                       | NULL                  |
| LastAccessTime:               | UNKNOWN                                            | NULL                  |
| Retention:                    | 0                                                  | NULL                  |
| Location:                     | abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address | NULL                  |
| Table Type:                   | EXTERNAL_TABLE                                     | NULL                  |
| Table Parameters:             | NULL                                               | NULL                  |
|                               | EXTERNAL                                           | TRUE                  |
|                               | bucketing_version                                  | 2                     |
|                               | discover.partitions                                | true                  |
|                               | last_commit_time_sync                              | 20201006065754        |
|                               | numFiles                                           | 82                    |
|                               | numPartitions                                      | 1                     |
|                               | numRows                                            | 0                     |
|                               | rawDataSize                                        | 0                     |
|                               | totalSize                                          | 642825967             |
|                               | transient_lastDdlTime                              | 1601967963            |
|                               | NULL                                               | NULL                  |
| # Storage Information         | NULL                                               | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL                  |
| InputFormat:                  | org.apache.hudi.hadoop.HoodieParquetInputFormat    | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL                  |
| Compressed:                   | No                                                 | NULL                  |
| Num Buckets:                  | -1                                                 | NULL                  |
| Bucket Columns:               | []                                                 | NULL                  |
| Sort Columns:                 | []                                                 | NULL                  |
| Storage Desc Params:          | NULL                                               | NULL                  |
|                               | serialization.format                               | 1                     |
+-------------------------------+----------------------------------------------------+-----------------------+

describe formatted address partition(year=2020,month=10,day=6):

+-----------------------------------+----------------------------------------------------+-----------------------+
|             col_name              |                     data_type                      |        comment        |
+-----------------------------------+----------------------------------------------------+-----------------------+
| _hoodie_commit_time               | string                                             |                       |
| _hoodie_commit_seqno              | string                                             |                       |
| _hoodie_record_key                | string                                             |                       |
| _hoodie_partition_path            | string                                             |                       |
| _hoodie_file_name                 | string                                             |                       |
| id                                | int                                                |                       |
| zipcode                           | int                                                |                       |
| city                              | string                                             |                       |
| street                            | string                                             |                       |
| streetnumber                      | int                                                |                       |
| uuid                              | string                                             |                       |
| start_date                        | string                                             |                       |
| end_date                          | string                                             |                       |
| is_current                        | boolean                                            |                       |
| event_time                        | string                                             |                       |
| its                               | string                                             |                       |
|                                   | NULL                                               | NULL                  |
| # Partition Information           | NULL                                               | NULL                  |
| # col_name                        | data_type                                          | comment               |
| year                              | int                                                |                       |
| month                             | int                                                |                       |
| day                               | int                                                |                       |
|                                   | NULL                                               | NULL                  |
| # Detailed Partition Information  | NULL                                               | NULL                  |
| Partition Value:                  | [2020, 10, 6]                                      | NULL                  |
| Database:                         | nyc_taxi                                           | NULL                  |
| Table:                            | address                                            | NULL                  |
| CreateTime:                       | Tue Oct 06 07:06:04 UTC 2020                       | NULL                  |
| LastAccessTime:                   | UNKNOWN                                            | NULL                  |
| Location:                         | abfs://xxx@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=6 | NULL                  |
| Partition Parameters:             | NULL                                               | NULL                  |
|                                   | numFiles                                           | 82                    |
|                                   | totalSize                                          | 642825967             |
|                                   | transient_lastDdlTime                              | 1601967964            |
|                                   | NULL                                               | NULL                  |
| # Storage Information             | NULL                                               | NULL                  |
| SerDe Library:                    | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL                  |
| InputFormat:                      | org.apache.hudi.hadoop.HoodieParquetInputFormat    | NULL                  |
| OutputFormat:                     | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL                  |
| Compressed:                       | No                                                 | NULL                  |
| Num Buckets:                      | -1                                                 | NULL                  |
| Bucket Columns:                   | []                                                 | NULL                  |
| Sort Columns:                     | []                                                 | NULL                  |
| Storage Desc Params:              | NULL                                               | NULL                  |
|                                   | serialization.format                               | 1                     |
+-----------------------------------+----------------------------------------------------+-----------------------+

@nsivabalan
Copy link
Contributor

@bvaradar : guess you missed to follow up on this thread. can you check it out and respond when you can.

@n3nash n3nash added this to In progress in GI Tracker Board Apr 22, 2021
@vinothchandar
Copy link
Member

I think we should doc better, when set hive.fetch.task.conversion=none; is to be set with Hudi tables. Closing this

@vinothchandar
Copy link
Member

GI Tracker Board automation moved this from In progress to Done Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive meta-sync
Projects
Development

No branches or pull requests

5 participants