Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add options to output partition range on differences #260

Open
guofei opened this issue Apr 12, 2024 · 0 comments
Open

Comments

@guofei
Copy link
Contributor

guofei commented Apr 12, 2024

While running the validation job with cassandra-data-migrator, we've noticed that if there are write-ins from zdm-proxy during the process, the validation job may overwrite these write-ins, resulting in data not being updated to the latest.

For instance:

  • Time t1: diff data
  • Time t2: write to zdm-proxy
  • Time t3: autocorrect

Currently, our solution is to run the validation job multiple times until there are no differences. However, this approach is time-consuming as the validation process starts from scratch each time. The CSV file, which could potentially help us save time by specifying the range to validate, is only written when there's an error. To optimize this, we propose to output the CSV file whenever there are differences (not just errors). This way, we can feed the CSV file into subsequent validation runs, focusing on the problematic ranges, thereby reducing the overall execution time.

We propose adding an option, such as spark.cdm.tokenrange.partitionFile.appendOnDiff. When spark.cdm.tokenrange.partitionFile.appendOnDiff=true, the partition range would be outputted if there are any differences. This change will be backward compatible, as it only affects behavior when the new option is explicitly set to true.

Additionally, we would like the input and output CSV files to be different. Thus, we suggest adding two more options: spark.cdm.tokenrange.partitionFile.output and spark.cdm.tokenrange.partitionFile.input, to specify the input and output CSV files respectively. These changes are also designed to be backward compatible, as they only change behavior when the new options are used.

We have already implemented these features in our fork of the project. If these changes align with the project's direction, we would be more than happy to create a pull request. This would allow the community to review the changes and potentially integrate them into the main project. We believe these enhancements would greatly improve the efficiency of the validation process.

@guofei guofei changed the title Add options to output partition range on differences [Feature Request] Add options to output partition range on differences Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant