-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Though there is no read and write error, Spark SQL UPDATE and DELETE do not write record positions to the log files.
{code:java}
spark-sql (default)> CREATE TABLE testing_positions.table2 (
> ts BIGINT,
> uuid STRING,
> rider STRING,
> driver STRING,
> fare DOUBLE,
> city STRING
> ) USING HUDI
> LOCATION 'file:///Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2'
> TBLPROPERTIES (
> type = 'mor',
> primaryKey = 'uuid',
> preCombineField = 'ts'
> )
> PARTITIONED BY (city);
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
Time taken: 0.4 seconds
spark-sql (default)> INSERT INTO testing_positions.table2
> VALUES
> (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
> (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'),
> (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'),
> (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
> (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ),
> (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ),
> (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ),
> (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updated size to 436166
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updated size to 436185
24/11/16 12:03:29 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updated size to 436166
24/11/16 12:03:30 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updated size to 436185
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436166
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436185
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 4.843 seconds
spark-sql (default)>
> SET hoodie.merge.small.file.group.candidates.limit = 0;
hoodie.merge.small.file.group.candidates.limit 0
Time taken: 0.018 seconds, Fetched 1 row(s)
spark-sql (default)>
> UPDATE testing_positions.table2 SET fare = 20.0 WHERE rider = 'rider-A';
24/11/16 12:03:31 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/11/16 12:03:32 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/16 12:03:32 WARN HoodieDataBlock: There are records without valid positions. Skip writing record positions to the data block header.
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 5.545 seconds
spark-sql (default)>
> DELETE FROM testing_positions.table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
24/11/16 12:03:37 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
24/11/16 12:03:38 WARN HoodieDeleteBlock: There are delete records without valid positions. Skip writing record positions to the delete block header.
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 2.992 seconds
spark-sql (default)>
> select * from testing_positions.table2;
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
20241116120326527 20241116120326527_0_0 1dced545-862b-4ceb-8b43-d2a568f6616b city=san_francisco 1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet 16953320662041dced545-862b-4ceb-8b43-d2a568f6616b rider-E driver-O 93.5 san_francisco
20241116120326527 20241116120326527_0_1 e96c4396-3fad-413a-a942-4cb36106d721 city=san_francisco 1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet 1695091554788e96c4396-3fad-413a-a942-4cb36106d721 rider-C driver-M 27.7 san_francisco
20241116120326527 20241116120326527_0_2 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 city=san_francisco 1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet 16950464621799909a8b1-2d15-4d3d-8ec9-efc48c536a00 rider-D driver-L 33.9 san_francisco
20241116120331896 20241116120331896_0_9 334e26e9-8355-45cc-97c6-c31daf0df330 city=san_francisco 1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0 1695159649087 334e26e9-8355-45cc-97c6-c31daf0df330 rider-A driver-K 20.0 san_francisco
20241116120326527 20241116120326527_1_1 7a84095f-737f-40bc-b62f-6b69664712d2 city=sao_paulo ba555452-0c3c-47dc-acc0-f90823e12408-0_1-186-339_20241116120326527.parquet 1695376420876 7a84095f-737f-40bc-b62f-6b69664712d2 rider-G driver-Q 43.4 sao_paulo
20241116120326527 20241116120326527_2_0 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 city=chennai 8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet 1695173887231 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 rider-I driver-S 41.06 chennai
20241116120326527 20241116120326527_2_1 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa city=chennai 8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet 1695115999911 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa rider-J driver-T 17.85 chennai
Time taken: 1.719 seconds, Fetched 7 row(s) {code}
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-8553
- Type: Sub-task
- Parent: https://issues.apache.org/jira/browse/HUDI-9107
- Fix version(s):
- 1.1.0
Comments
21/Nov/24 21:18;jonvex;I have verified this with the script:
{code:java}
SET hoodie.spark.sql.optimized.writes.enable = false;
CREATE TABLE table2 ( ts BIGINT, uuid STRING, rider STRING, driver STRING, fare DOUBLE, city STRING ) USING HUDI LOCATION 'file:///tmp/testpositions' TBLPROPERTIES ( type = 'mor', primaryKey = 'uuid', preCombineField = 'ts' ) PARTITIONED BY (city);
INSERT INTO table2 VALUES (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'), (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'), (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'), (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'), (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ), (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ), (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ), (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
SET hoodie.merge.small.file.group.candidates.limit = 0;
UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';
DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
select * from table2; {code}
I tested with optimized writes enabled and disabled. When optimized writes are disabled, there is no warning about position fallback
Here is with optimized writes false:
{code:java}
spark-sql (default)> SET hoodie.spark.sql.optimized.writes.enable = false;
24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
hoodie.spark.sql.optimized.writes.enable false
Time taken: 0.764 seconds, Fetched 1 row(s)
spark-sql (default)> CREATE TABLE table2 (
> ts BIGINT,
> uuid STRING,
> rider STRING,
> driver STRING,
> fare DOUBLE,
> city STRING
> ) USING HUDI
> LOCATION 'file:///tmp/testpositions'
> TBLPROPERTIES (
> type = 'mor',
> primaryKey = 'uuid',
> preCombineField = 'ts'
> )
> PARTITIONED BY (city);
24/11/21 16:11:52 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
Time taken: 0.384 seconds
spark-sql (default)> INSERT INTO table2
> VALUES
> (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
> (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'),
> (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'),
> (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
> (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ),
> (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ),
> (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ),
> (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
24/11/21 16:12:02 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
24/11/21 16:12:03 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
24/11/21 16:12:05 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
24/11/21 16:12:05 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
24/11/21 16:12:08 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 5.728 seconds
spark-sql (default)> SET hoodie.merge.small.file.group.candidates.limit = 0;
hoodie.merge.small.file.group.candidates.limit 0
Time taken: 0.012 seconds, Fetched 1 row(s)
spark-sql (default)> UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';
24/11/21 16:12:16 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/11/21 16:12:16 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/21 16:12:16 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
24/11/21 16:12:17 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 1.802 seconds
spark-sql (default)> DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
24/11/21 16:12:27 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 1.332 seconds
spark-sql (default)> select * from table2;
20241121161203621 20241121161203621_0_0 1dced545-862b-4ceb-8b43-d2a568f6616b city=san_francisco 1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet 1695332066204 1dced545-862b-4ceb-8b43-d2a568f6616b rider-E driver-O 93.5 san_francisco
20241121161203621 20241121161203621_0_1 e96c4396-3fad-413a-a942-4cb36106d721 city=san_francisco 1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet 1695091554788 e96c4396-3fad-413a-a942-4cb36106d721 rider-C driver-M 27.7 san_francisco
20241121161203621 20241121161203621_0_2 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 city=san_francisco 1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet 1695046462179 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 rider-D driver-L 33.9 san_francisco
20241121161216516 20241121161216516_0_1 334e26e9-8355-45cc-97c6-c31daf0df330 city=san_francisco 1ad629cc-6f75-4ac3-bff2-e4f842421f51-0 1695159649087 334e26e9-8355-45cc-97c6-c31daf0df330 rider-A driver-K 20.0 san_francisco
20241121161203621 20241121161203621_1_1 7a84095f-737f-40bc-b62f-6b69664712d2 city=sao_paulo c06df00f-d40d-42b1-b320-52de6bd05d0e-0_1-21-68_20241121161203621.parquet 1695376420876 7a84095f-737f-40bc-b62f-6b69664712d2 rider-G driver-Q 43.4 sao_paulo
20241121161203621 20241121161203621_2_0 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 city=chennai 41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet 1695173887231 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 rider-I driver-S 41.06 chennai
20241121161203621 20241121161203621_2_1 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa city=chennai 41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet 1695115999911 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa rider-J driver-T 17.85 chennai
Time taken: 0.219 seconds, Fetched 7 row(s) {code}
And here it is without setting optimized writes to false, which has a default of true:
{code:java}
spark-sql (default)> CREATE TABLE table2 ( > ts BIGINT, > uuid STRING, > rider STRING, > driver STRING, > fare DOUBLE, > city STRING > ) USING HUDI > LOCATION 'file:///tmp/testpositions' > TBLPROPERTIES ( > type = 'mor', > primaryKey = 'uuid', > preCombineField = 'ts' > ) > PARTITIONED BY (city);24/11/21 16:14:20 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file24/11/21 16:14:20 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf24/11/21 16:14:20 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositionsTime taken: 1.004 secondsspark-sql (default)> INSERT INTO table2 > VALUES > (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'), > (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'), > (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'), > (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'), > (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ), > (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ), > (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ), > (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');24/11/21 16:14:28 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions24/11/21 16:14:28 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions24/11/21 16:14:30 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties24/11/21 16:14:31 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]24/11/21 16:14:33 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 5.734 secondsspark-sql (default)> SET hoodie.merge.small.file.group.candidates.limit = 0;hoodie.merge.small.file.group.candidates.limit 0Time taken: 0.016 seconds, Fetched 1 row(s)spark-sql (default)> UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';24/11/21 16:14:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.24/11/21 16:14:41 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)24/11/21 16:14:41 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []24/11/21 16:14:42 WARN HoodieDataBlock: There are records without valid positions. Skip writing record positions to the data block header.24/11/21 16:14:42 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 1.59 secondsspark-sql (default)> DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';24/11/21 16:14:47 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)24/11/21 16:14:47 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []24/11/21 16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read24/11/21 16:14:47 WARN HoodieDeleteBlock: There are delete records without valid positions. Skip writing record positions to the delete block header.24/11/21 16:14:47 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 1.103 secondsspark-sql (default)> select * from table2;24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read20241121161428912 20241121161428912_0_0 1dced545-862b-4ceb-8b43-d2a568f6616b city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695332066204 1dced545-862b-4ceb-8b43-d2a568f6616b rider-E driver-O 93.5 san_francisco20241121161428912 20241121161428912_0_1 e96c4396-3fad-413a-a942-4cb36106d721 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695091554788 e96c4396-3fad-413a-a942-4cb36106d721 rider-C driver-M 27.7 san_francisco20241121161428912 20241121161428912_0_2 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695046462179 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 rider-D driver-L 33.9 san_francisco20241121161441739 20241121161441739_0_1 334e26e9-8355-45cc-97c6-c31daf0df330 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0 1695159649087 334e26e9-8355-45cc-97c6-c31daf0df330 rider-A driver-K 20.0 san_francisco20241121161428912 20241121161428912_1_1 7a84095f-737f-40bc-b62f-6b69664712d2 city=sao_paulo 22b6070f-6c72-4a3d-9fc6-8bac16a7e873-0_1-21-68_20241121161428912.parquet 1695376420876 7a84095f-737f-40bc-b62f-6b69664712d2 rider-G driver-Q 43.4 sao_paulo20241121161428912 20241121161428912_2_0 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 city=chennai 878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet 1695173887231 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 rider-I driver-S 41.06 chennai20241121161428912 20241121161428912_2_1 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa city=chennai 878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet 1695115999911 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa rider-J driver-T 17.85 chennaiTime taken: 0.185 seconds, Fetched 7 row(s) {code};;;
21/Nov/24 22:07;jonvex;To unblock the release we can either do one of two things:
Disable hoodie.spark.sql.optimized.writes.enable
This will decrease performance of writes during keygen and index lookup
Positions will be included in the updates
we can check to see if positions are even enabled and only default it when position writing is disabled
Keep the code as is
This will decrease performance during read of uncompacted filegroups
We have the ability to fallback when positions are missing and I have written extensive test cases to ensure that fallback works correctly in all combinations of log and base files
This can also use some extra disk space during the read because we have to rewrite mappings in the spillable map, and deletes to the spillable map don't actually free up space until we close it
To actually write positions using the prepped workflow, I think there is a way to do this but will not be that easy:
We will need to read _tmp_metadata_row_index inside the update and delete sql commands. Then during keygen, we will get the position from that field
this will take some work because I don't think we have ever tried to read positions at the dataset level
Then we will need to get the positions out during key generation
this should be easy
Then we will need to drop the column before we do the write
this will probably be pretty easy
;;;
26/Nov/24 00:30;yihua;Deferring this to Hudi 1.1 since this does not cause correctness issue and adding positional updates and deletes in SQL UPDATE and DELETE needs design.;;;
10/Jan/25 00:18;yihua;In the UPDATE and DELETE command, we'll try creating the relation with a schema that has the row index meta column or a new hoodie meta column to attach the row index column to the return DF (this also requires the file group reader and parquet reader to keep the new row index column by fixing the wiring). In that way, we can pass the positions down to the prepped write flow and prepare the HoodieRecords with the current record location.;;;
10/Jan/25 01:59;yihua;I have a draft PR up which makes the prepped upsert flow write record positions to the log blocks from Spark SQL UPDATE statement. I'm going to fix a few issues before opening it up for review.;;;