Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch) #27022

Open
2 of 15 tasks
Abacn opened this issue Jun 5, 2023 · 7 comments
Open
2 of 15 tasks

Comments

@Abacn
Copy link
Contributor

Abacn commented Jun 5, 2023

What happened?

Reported from GoogleCloudPlatform/DataflowTemplates#759

When implementing a load test for BigTableIO, we encountered the following:

  • load tests up to 200mb pass stably.
  • after 5 million records, not all data gets into BigTable, but the pipeline logs indicate that all data was written.

Dataflow write pipeline logs say that 10M records were written.
However, the read job shows only 1.6M records read.

Using the cbt utility, the cbt -instance count

command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.

  • Dataflow write pipeline logs - 2023-06-05_03_51_23-9051905355392445711
  • Dataflow read pipeline logs - 2023-06-05_03_58_18-7016807525741705033

project: apache-beam-testing

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Abacn Abacn self-assigned this Jun 5, 2023
@Abacn Abacn changed the title [Bug]: Possible data loss in BigtableIO r/w with large data [Bug]: Possible data loss in BigtableIO r/w with large amount data Jun 5, 2023
@Abacn Abacn changed the title [Bug]: Possible data loss in BigtableIO r/w with large amount data [Bug]: Possible data loss in BigtableIO r/w with large row and column Jun 5, 2023
@Abacn Abacn changed the title [Bug]: Possible data loss in BigtableIO r/w with large row and column [Bug]: Possible data loss in BigtableIO r/w with large number of row and column Jun 5, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Jun 6, 2023

This is not related to Beam, it's DataflowTemplate test utility resource manager has wrong setting

The cause was found there: GoogleCloudPlatform/DataflowTemplates#759 (comment)

I find the cause is
- cell does not have set timestamp, so it default to epoch (1970-01-01)
- the createTable has a garbage collection policy of 1h, so large amount data written triggers GC and some records get deleted

We need to use `.setTimestampMicros(java.time.Instant.now().toEpochMilli()*1000)` for Mutation.SetCell

===========
(obsolete)

  • Tested with Beam 2.47.0, 2.48.0, table created with BigtableTableAdminClient.createTable, expected number of records (tested with 20M records and 100M records) (jobId: 2023-06-06_14_34_14-776136986672260899, 2023-06-06_15_02_28-18162755370264063675)
  • Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, record missing (jobId: 2023-06-06_14_52_02-13679425791336453528)
  • Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, table created with BigtableTableAdminClient.createTable, expected number of records (jobId: 2023-06-06_15_32_45-12170662821065212708)

For the job resulting in table missing, use cbt ... -instance <instance> count <table> found the number of records decreased half way writing.

@Abacn Abacn closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Sep 22, 2023

Turns out that this could also affect real usage case when Timestamp field is not set, reopen it and keep it as P1 also

@Abacn Abacn reopened this Sep 22, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Sep 22, 2023

Posible solutions:

  • when incoming Timestamp is empty, default set to current time (instead of epoch)
  • when incoming Timestamp is empty, raise a load warning

@Abacn
Copy link
Contributor Author

Abacn commented Sep 22, 2023

CC: @mutianf @ahmedabu98 (this also affects xlang Bigtable)

@Abacn
Copy link
Contributor Author

Abacn commented Sep 29, 2023

per #28624 (comment) at least we should add some validation in write transform

@kennknowles
Copy link
Member

Would the followup be P2 or still P1?

@Abacn
Copy link
Contributor Author

Abacn commented Feb 28, 2024

This is due to user bug (incorrect/epoch) timestamp attached to the cell. The issue is kept open because there is follow up (add warning) can be done so kept P2, and update issue title

@Abacn Abacn changed the title [Bug]: Possible data loss in BigtableIO r/w with large number of row and column [Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch) Feb 28, 2024
@Abacn Abacn added P2 and removed P1 labels Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants