-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11190. Add --fields option to ldb scan command #6976
Conversation
Thanks @Tejaskriya for working on this. Have you considered using |
@adoroszlai are you suggesting to not have a filter option and have the user use jq commands to do the filtering? |
I also suggested using
I don't think this is how it would work. This seems to describe jq as blocking until the whole DB is read, and only then beginning filtering on all the objects before giving the final output. jq actually works on streams. Our ldb process would read and print lines to stdout. After a line is printed, our process moves on to read and print more of the DB while jq is filtering the lines that were just printed at the same time. If there is a speedup it would probably be because we are reducing the amount of data that gets converted to json and printed. However, this benefit might be negated because this filter is implemented with Java reflection and jq filtering is in C. Can we get benchmarks of various filtering queries using jq vs this method? Ideally on larger DBs with at least thousands of keys. Based on these results we can decide whether this option is something we should support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tejaskriya Thanks for working over this, we can do some optinmization in logic
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ValueSchema.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ValueSchema.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java
Outdated
Show resolved
Hide resolved
@@ -405,6 +419,47 @@ private boolean printTable(List<ColumnFamilyHandle> columnFamilyHandleList, | |||
IOUtils.closeQuietly(iterator, readOptions, slice); | |||
} | |||
} | |||
boolean checkValidValueFields(String dbPath, String valueFields, | |||
DBColumnFamilyDefinition<?, ?> columnFamilyDefinition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation may not be required, will give all matching data.
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java
Outdated
Show resolved
Hide resolved
I have run some benchmarking on a large scm.db which had a huge number of records in the deletedBlocks table. Due to the constraint jq has to have only string as keys, I had to use sed to wrap the key (which is an integer) with quotes and then pass the output to jq. |
@sumitagrawl thank you for the review! I have made the methods in ValueSchema as static, performed the splitting of fields outside the loop and removed the validation. |
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tejaskriya LGTM
What changes were proposed in this pull request?
RocksDB stores data in key-value pairs. The value itself may have some kind of key-value structure. Currently,
ozone debug ldb scan
command shows the full value for each record being displayed. This could be very verbose and include information that is not needed for the use case.Having a --fields option which filters the fields being displayed for each record will help to get concise output.
For example, if a value has many fields like [name, location->[address, DN, IP], version, lastUpdateTime] , using the option "--fields=name,location.address,version" will display only (name, address in location and version) in the output.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11190
How was this patch tested?
Tested manually in docker cluster: