Skip to content

Spark SQL UPDATE and DELETE should write record positions #17309

@hudi-bot

Description

@hudi-bot

Though there is no read and write error, Spark SQL UPDATE and DELETE do not write record positions to the log files.
{code:java}
spark-sql (default)> CREATE TABLE testing_positions.table2 (
                   >     ts BIGINT,
                   >     uuid STRING,
                   >     rider STRING,
                   >     driver STRING,
                   >     fare DOUBLE,
                   >     city STRING
                   > ) USING HUDI
                   > LOCATION 'file:///Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2'
                   > TBLPROPERTIES (
                   >   type = 'mor',
                   >   primaryKey = 'uuid',
                   >   preCombineField = 'ts'
                   > )
                   > PARTITIONED BY (city);
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
Time taken: 0.4 seconds
spark-sql (default)> INSERT INTO testing_positions.table2
                   > VALUES
                   > (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
                   > (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'),
                   > (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'),
                   > (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
                   > (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'    ),
                   > (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo'    ),
                   > (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai'      ),
                   > (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updated size to 436166
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
24/11/16 12:03:29 WARN log: Updated size to 436185
24/11/16 12:03:29 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updated size to 436166
24/11/16 12:03:30 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
24/11/16 12:03:30 WARN log: Updated size to 436185
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436166
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436386
24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
24/11/16 12:03:30 WARN log: Updated size to 436185
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 4.843 seconds
spark-sql (default)> 
                   > SET hoodie.merge.small.file.group.candidates.limit = 0;
hoodie.merge.small.file.group.candidates.limit    0
Time taken: 0.018 seconds, Fetched 1 row(s)
spark-sql (default)> 
                   > UPDATE testing_positions.table2 SET fare = 20.0 WHERE rider = 'rider-A';
24/11/16 12:03:31 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/11/16 12:03:32 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/16 12:03:32 WARN HoodieDataBlock: There are records without valid positions. Skip writing record positions to the data block header.
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 5.545 seconds
spark-sql (default)> 
                   > DELETE FROM testing_positions.table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
24/11/16 12:03:37 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
24/11/16 12:03:38 WARN HoodieDeleteBlock: There are delete records without valid positions. Skip writing record positions to the delete block header.
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 2.992 seconds
spark-sql (default)> 
                   > select * from testing_positions.table2;
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read
20241116120326527    20241116120326527_0_0    1dced545-862b-4ceb-8b43-d2a568f6616b    city=san_francisco    1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    16953320662041dced545-862b-4ceb-8b43-d2a568f6616b    rider-E    driver-O    93.5    san_francisco
20241116120326527    20241116120326527_0_1    e96c4396-3fad-413a-a942-4cb36106d721    city=san_francisco    1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    1695091554788e96c4396-3fad-413a-a942-4cb36106d721    rider-C    driver-M    27.7    san_francisco
20241116120326527    20241116120326527_0_2    9909a8b1-2d15-4d3d-8ec9-efc48c536a00    city=san_francisco    1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    16950464621799909a8b1-2d15-4d3d-8ec9-efc48c536a00    rider-D    driver-L    33.9    san_francisco
20241116120331896    20241116120331896_0_9    334e26e9-8355-45cc-97c6-c31daf0df330    city=san_francisco    1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0    1695159649087    334e26e9-8355-45cc-97c6-c31daf0df330    rider-A    driver-K    20.0    san_francisco
20241116120326527    20241116120326527_1_1    7a84095f-737f-40bc-b62f-6b69664712d2    city=sao_paulo    ba555452-0c3c-47dc-acc0-f90823e12408-0_1-186-339_20241116120326527.parquet    1695376420876    7a84095f-737f-40bc-b62f-6b69664712d2    rider-G    driver-Q    43.4    sao_paulo
20241116120326527    20241116120326527_2_0    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    city=chennai    8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet    1695173887231    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    rider-I    driver-S    41.06    chennai
20241116120326527    20241116120326527_2_1    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    city=chennai    8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet    1695115999911    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    rider-J    driver-T    17.85    chennai
Time taken: 1.719 seconds, Fetched 7 row(s) {code}

JIRA info


Comments

21/Nov/24 21:18;jonvex;I have verified this with the script:
{code:java}
SET hoodie.spark.sql.optimized.writes.enable = false;

CREATE TABLE table2 ( ts BIGINT, uuid STRING, rider STRING, driver STRING, fare DOUBLE, city STRING ) USING HUDI LOCATION 'file:///tmp/testpositions' TBLPROPERTIES ( type = 'mor', primaryKey = 'uuid', preCombineField = 'ts' ) PARTITIONED BY (city);

INSERT INTO table2 VALUES (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'), (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'), (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'), (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'), (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ), (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ), (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ), (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');

SET hoodie.merge.small.file.group.candidates.limit = 0;

UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';

DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';

select * from table2; {code}
I tested with optimized writes enabled and disabled. When optimized writes are disabled, there is no warning about position fallback

Here is with optimized writes false:
{code:java}
spark-sql (default)> SET hoodie.spark.sql.optimized.writes.enable = false;
24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
hoodie.spark.sql.optimized.writes.enable    false
Time taken: 0.764 seconds, Fetched 1 row(s)
spark-sql (default)> CREATE TABLE table2 (
                   >      ts BIGINT,
                   >      uuid STRING,
                   >      rider STRING,
                   >      driver STRING,
                   >      fare DOUBLE,
                   >      city STRING
                   >  ) USING HUDI
                   >  LOCATION 'file:///tmp/testpositions'
                   >  TBLPROPERTIES (
                   >    type = 'mor',
                   >    primaryKey = 'uuid',
                   >    preCombineField = 'ts'
                   >  )
                   >  PARTITIONED BY (city);
24/11/21 16:11:52 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
Time taken: 0.384 seconds
spark-sql (default)> INSERT INTO table2
                   >  VALUES
                   >  (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
                   >  (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'),
                   >  (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'),
                   >  (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
                   >  (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'    ),
                   >  (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo'    ),
                   >  (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai'      ),
                   >  (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
24/11/21 16:12:02 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
24/11/21 16:12:03 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions
24/11/21 16:12:05 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
24/11/21 16:12:05 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []

WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]

24/11/21 16:12:08 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 5.728 seconds
spark-sql (default)>  SET hoodie.merge.small.file.group.candidates.limit = 0;
hoodie.merge.small.file.group.candidates.limit    0
Time taken: 0.012 seconds, Fetched 1 row(s)
spark-sql (default)>  UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';
24/11/21 16:12:16 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/11/21 16:12:16 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/21 16:12:16 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
24/11/21 16:12:17 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 1.802 seconds
spark-sql (default)>  DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
24/11/21 16:12:27 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []
Time taken: 1.332 seconds
spark-sql (default)>  select * from table2;
20241121161203621    20241121161203621_0_0    1dced545-862b-4ceb-8b43-d2a568f6616b    city=san_francisco    1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    1695332066204    1dced545-862b-4ceb-8b43-d2a568f6616b    rider-E    driver-O    93.5    san_francisco
20241121161203621    20241121161203621_0_1    e96c4396-3fad-413a-a942-4cb36106d721    city=san_francisco    1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    1695091554788    e96c4396-3fad-413a-a942-4cb36106d721    rider-C    driver-M    27.7    san_francisco
20241121161203621    20241121161203621_0_2    9909a8b1-2d15-4d3d-8ec9-efc48c536a00    city=san_francisco    1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    1695046462179    9909a8b1-2d15-4d3d-8ec9-efc48c536a00    rider-D    driver-L    33.9    san_francisco
20241121161216516    20241121161216516_0_1    334e26e9-8355-45cc-97c6-c31daf0df330    city=san_francisco    1ad629cc-6f75-4ac3-bff2-e4f842421f51-0    1695159649087    334e26e9-8355-45cc-97c6-c31daf0df330    rider-A    driver-K    20.0    san_francisco
20241121161203621    20241121161203621_1_1    7a84095f-737f-40bc-b62f-6b69664712d2    city=sao_paulo    c06df00f-d40d-42b1-b320-52de6bd05d0e-0_1-21-68_20241121161203621.parquet    1695376420876    7a84095f-737f-40bc-b62f-6b69664712d2    rider-G    driver-Q    43.4    sao_paulo
20241121161203621    20241121161203621_2_0    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    city=chennai    41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet    1695173887231    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    rider-I    driver-S    41.06    chennai
20241121161203621    20241121161203621_2_1    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    city=chennai    41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet    1695115999911    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    rider-J    driver-T    17.85    chennai
Time taken: 0.219 seconds, Fetched 7 row(s) {code}
And here it is without setting optimized writes to false, which has a default of true:
{code:java}
spark-sql (default)> CREATE TABLE table2 ( > ts BIGINT, > uuid STRING, > rider STRING, > driver STRING, > fare DOUBLE, > city STRING > ) USING HUDI > LOCATION 'file:///tmp/testpositions' > TBLPROPERTIES ( > type = 'mor', > primaryKey = 'uuid', > preCombineField = 'ts' > ) > PARTITIONED BY (city);24/11/21 16:14:20 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file24/11/21 16:14:20 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf24/11/21 16:14:20 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositionsTime taken: 1.004 secondsspark-sql (default)> INSERT INTO table2 > VALUES > (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'), > (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'), > (1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'), > (1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'), > (1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo' ), > (1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo' ), > (1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai' ), > (1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');24/11/21 16:14:28 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions24/11/21 16:14:28 WARN TableSchemaResolver: Could not find any data file written for commit, so could not get schema for table file:/tmp/testpositions24/11/21 16:14:30 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties24/11/21 16:14:31 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]24/11/21 16:14:33 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 5.734 secondsspark-sql (default)> SET hoodie.merge.small.file.group.candidates.limit = 0;hoodie.merge.small.file.group.candidates.limit 0Time taken: 0.016 seconds, Fetched 1 row(s)spark-sql (default)> UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';24/11/21 16:14:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.24/11/21 16:14:41 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)24/11/21 16:14:41 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []24/11/21 16:14:42 WARN HoodieDataBlock: There are records without valid positions. Skip writing record positions to the data block header.24/11/21 16:14:42 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 1.59 secondsspark-sql (default)> DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';24/11/21 16:14:47 WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least one of Column Stats Index, Record Level Index, or Functional Index to be enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBucketIndexEnable = false, isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)24/11/21 16:14:47 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []24/11/21 16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read24/11/21 16:14:47 WARN HoodieDeleteBlock: There are delete records without valid positions. Skip writing record positions to the delete block header.24/11/21 16:14:47 WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization as only one secondary index bootstrap at a time is supported for now. Provided: []Time taken: 1.103 secondsspark-sql (default)> select * from table2;24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when attempt to do position based merge.24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for Read20241121161428912 20241121161428912_0_0 1dced545-862b-4ceb-8b43-d2a568f6616b city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695332066204 1dced545-862b-4ceb-8b43-d2a568f6616b rider-E driver-O 93.5 san_francisco20241121161428912 20241121161428912_0_1 e96c4396-3fad-413a-a942-4cb36106d721 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695091554788 e96c4396-3fad-413a-a942-4cb36106d721 rider-C driver-M 27.7 san_francisco20241121161428912 20241121161428912_0_2 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet 1695046462179 9909a8b1-2d15-4d3d-8ec9-efc48c536a00 rider-D driver-L 33.9 san_francisco20241121161441739 20241121161441739_0_1 334e26e9-8355-45cc-97c6-c31daf0df330 city=san_francisco cf8f187a-f827-454d-a26f-114e30c519ed-0 1695159649087 334e26e9-8355-45cc-97c6-c31daf0df330 rider-A driver-K 20.0 san_francisco20241121161428912 20241121161428912_1_1 7a84095f-737f-40bc-b62f-6b69664712d2 city=sao_paulo 22b6070f-6c72-4a3d-9fc6-8bac16a7e873-0_1-21-68_20241121161428912.parquet 1695376420876 7a84095f-737f-40bc-b62f-6b69664712d2 rider-G driver-Q 43.4 sao_paulo20241121161428912 20241121161428912_2_0 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 city=chennai 878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet 1695173887231 3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04 rider-I driver-S 41.06 chennai20241121161428912 20241121161428912_2_1 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa city=chennai 878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet 1695115999911 c8abbe79-8d89-47ea-b4ce-4d224bae5bfa rider-J driver-T 17.85 chennaiTime taken: 0.185 seconds, Fetched 7 row(s) {code};;;


21/Nov/24 22:07;jonvex;To unblock the release we can either do one of two things:

Disable hoodie.spark.sql.optimized.writes.enable

This will decrease performance of writes during keygen and index lookup

Positions will be included in the updates

we can check to see if positions are even enabled and only default it when position writing is disabled

Keep the code as is

This will decrease performance during read of uncompacted filegroups

We have the ability to fallback when positions are missing and I have written extensive test cases to ensure that fallback works correctly in all combinations of log and base files

This can also use some extra disk space during the read because we have to rewrite mappings in the spillable map, and deletes to the spillable map don't actually free up space until we close it

 

To actually write positions using the prepped workflow, I think there is a way to do this but will not be that easy:

We will need to read _tmp_metadata_row_index  inside the update and delete sql commands. Then during keygen, we will get the position from that field

this will take some work because I don't think we have ever tried to read positions at the dataset level

Then we will need to get the positions out during key generation

this should be easy

Then we will need to drop the column before we do the write

this will probably be pretty easy

 

 ;;;


26/Nov/24 00:30;yihua;Deferring this to Hudi 1.1 since this does not cause correctness issue and adding positional updates and deletes in SQL UPDATE and DELETE needs design.;;;


10/Jan/25 00:18;yihua;In the UPDATE and DELETE command, we'll try creating the relation with a schema that has the row index meta column or a new hoodie meta column to attach the row index column to the return DF (this also requires the file group reader and parquet reader to keep the new row index column by fixing the wiring).  In that way, we can pass the positions down to the prepped write flow and prepare the HoodieRecords with the current record location.;;;


10/Jan/25 01:59;yihua;I have a draft PR up which makes the prepped upsert flow write record positions to the log blocks from Spark SQL UPDATE statement.  I'm going to fix a few issues before opening it up for review.;;;

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions