Write data to iceberg with using java api multi thread, cause data lost #8610

lengkristy · 2023-09-21T14:33:11Z

java code just like this:
`
Configuration configuration = new Configuration();
// this is a local file catalog
HadoopCatalog hadoopCatalog = new HadoopCatalog(configuration, icebergWareHousePath);
TableIdentifier name = TableIdentifier.of("logging", "logs");
Schema schema = new Schema(
Types.NestedField.required(1, "level", Types.StringType.get()),
Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()),
Types.NestedField.required(3, "message", Types.StringType.get()),
Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get()))
);
PartitionSpec spec = PartitionSpec.builderFor(schema)
.hour("event_time")
.identity("level")
.build();
Table table = hadoopCatalog.createTable(name, schema, spec);

    GenericAppenderFactory appenderFactory = new GenericAppenderFactory(table.schema());

    int partitionId = 1, taskId = 1;
    OutputFileFactory outputFileFactory = OutputFileFactory.builderFor(table, partitionId, taskId).format(FileFormat.PARQUET).build();
    final PartitionKey partitionKey = new PartitionKey(table.spec(), table.spec().schema());

    // partitionedFanoutWriter will auto partitioned record and create the partitioned writer
    PartitionedFanoutWriter<Record> partitionedFanoutWriter = new PartitionedFanoutWriter<Record>(table.spec(), FileFormat.PARQUET, appenderFactory, outputFileFactory, table.io(), TARGET_FILE_SIZE_IN_BYTES) {
        @Override
        protected PartitionKey partition(Record record) {
            partitionKey.partition(record);
            return partitionKey;
        }
    };

    Random random = new Random();
    List<String> levels = Arrays.asList("info", "debug", "error", "warn");
    GenericRecord genericRecord = GenericRecord.create(table.schema());

    // assume write 1000 records
    for (int i = 0; i < 1000; i++) {
        GenericRecord record = genericRecord.copy();
        record.setField("level",  levels.get(random.nextInt(levels.size())));

// record.setField("event_time", System.currentTimeMillis());
record.setField("event_time", OffsetDateTime.now());
record.setField("message", "Iceberg is a great table format");
record.setField("call_stack", Arrays.asList("NullPointerException"));
partitionedFanoutWriter.write(record);
}

    AppendFiles appendFiles = table.newAppend();

    // submit datafiles to the table
    Arrays.stream(partitionedFanoutWriter.dataFiles()).forEach(appendFiles::appendFile);

    // submit snapshot
    Snapshot newSnapshot = appendFiles.apply();
    appendFiles.commit();

`
Data loss may occur when writing iceberg with high concurrency.

The text was updated successfully, but these errors were encountered:

paulpaul1076 · 2023-10-04T01:42:23Z

Are you using the local file system? Your comment says "this is a local file catalog". It says here: https://iceberg.apache.org/docs/latest/java-api-quickstart/#using-a-hadoop-catalog that concurrent writes to a local fs with hadoop catalog are not safe. In order for them to be safe your fs has to support atomic rename.

rdblue · 2023-10-05T18:13:44Z

Yes, I agree. The safety of the HadoopCatalog depends on the file system, which is one reason why we don't recommend using that catalog.

rdblue closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write data to iceberg with using java api multi thread, cause data lost #8610

Write data to iceberg with using java api multi thread, cause data lost #8610

lengkristy commented Sep 21, 2023 •

edited

paulpaul1076 commented Oct 4, 2023

rdblue commented Oct 5, 2023

Write data to iceberg with using java api multi thread, cause data lost #8610

Write data to iceberg with using java api multi thread, cause data lost #8610

Comments

lengkristy commented Sep 21, 2023 • edited

paulpaul1076 commented Oct 4, 2023

rdblue commented Oct 5, 2023

lengkristy commented Sep 21, 2023 •

edited