HDDS-11190. Add --fields option to ldb scan command #6976

Tejaskriya · 2024-07-22T06:09:59Z

What changes were proposed in this pull request?

RocksDB stores data in key-value pairs. The value itself may have some kind of key-value structure. Currently, ozone debug ldb scan command shows the full value for each record being displayed. This could be very verbose and include information that is not needed for the use case.
Having a --fields option which filters the fields being displayed for each record will help to get concise output.

For example, if a value has many fields like [name, location->[address, DN, IP], version, lastUpdateTime] , using the option "--fields=name,location.address,version" will display only (name, address in location and version) in the output.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11190

How was this patch tested?

Tested manually in docker cluster:

sh-4.2$ ozone debug ldb --db=/Users/tejaskriya.madhan/ZDump/bigDB/om.db scan --cf=fileTable -l=1
{ "/-9223371998060352000/-9223371998060348928/-9223371925202240254/part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet": {
  "metadata" : { },
  "objectID" : -9223371925201863421,
  "updateID" : 436161909,
  "parentObjectID" : -9223371925202240254,
  "volumeName" : "eda",
  "bucketName" : "mi5-tlc-hsh",
  "keyName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "dataSize" : 410571838,
  "keyLocationVersions" : [ {
    "version" : 0,
    "locationVersionMap" : {
      "0" : [ {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 1708889,
            "localID" : 111677748119571727
          },
          "blockCommitSequenceId" : 4755945
        },
        "length" : 268435456,
        "offset" : 0,
        "createVersion" : 0,
        "partNumber" : 0
      }, {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 309019,
            "localID" : 111677748119572182
          },
          "blockCommitSequenceId" : 4690972
        },
        "length" : 142136382,
        "offset" : 0,
        "createVersion" : 0,
        "partNumber" : 0
      } ]
    },
    "isMultipartKey" : false
  } ],
  "creationTime" : 1714434436283,
  "modificationTime" : 1714434508548,
  "replicationConfig" : {
    "replicationFactor" : "THREE",
    "replicationType" : "RATIS"
  },
  "encInfo" : {
    "cipherSuite" : "AES_CTR_NOPADDING",
    "version" : "ENCRYPTION_ZONES",
    "edek" : "C7ZhKbkKw5zlbnaQb1PVHaq4gK8WMDbzErKfI2fMEak=",
    "iv" : "XPDNypP45M6UVCSI9u1vPw==",
    "keyName" : "eda_mi5-tlc-hsh_bkt_key",
    "ezKeyVersionName" : "ynynAQT06m12MuqvTLWnBoWRy2jMYAaIGiyNXJXgLDD"
  },
  "isFile" : false,
  "fileName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "acls" : [ {
    "type" : "USER",
    "name" : "svc_dw_gps",
    "aclScope" : "ACCESS"
  }, {
    "type" : "GROUP",
    "name" : "dwhdpdei",
    "aclScope" : "ACCESS"
  } ],
  "tags" : { }
}
 }


sh-4.2$ ozone debug ldb --db=/Users/tejaskriya.madhan/ZDump/bigDB/om.db scan --cf=fileTable -l=1 --fields=volumeName,bucketName,keyName,keyLocationVersions.locationVersionMap.blockID,acls.name
{ "/-9223371998060352000/-9223371998060348928/-9223371925202240254/part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet": {
  "keyName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "keyLocationVersions" : [ {
    "locationVersionMap" : {
      "0" : [ {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 1708889,
            "localID" : 111677748119571727
          },
          "blockCommitSequenceId" : 4755945
        }
      }, {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 309019,
            "localID" : 111677748119572182
          },
          "blockCommitSequenceId" : 4690972
        }
      } ]
    }
  } ],
  "bucketName" : "mi5-tlc-hsh",
  "acls" : [ {
    "name" : "svc_dw_gps"
  }, {
    "name" : "dwhdpdei"
  } ],
  "volumeName" : "eda"
}
 }

adoroszlai · 2024-07-28T16:41:05Z

Thanks @Tejaskriya for working on this. Have you considered using jq for filtering output?

Tejaskriya · 2024-07-29T05:19:45Z

@adoroszlai are you suggesting to not have a filter option and have the user use jq commands to do the filtering?
In that case, using jq was considered, but for larger dbs it will be reading the data twice. With adding an option to our code, we will be reading the data only once and filtering it simultaneously.

errose28 · 2024-07-31T20:48:04Z

I also suggested using jq.

but for larger dbs it will be reading the data twice. With adding an option to our code, we will be reading the data only once and filtering it simultaneously.

I don't think this is how it would work. This seems to describe jq as blocking until the whole DB is read, and only then beginning filtering on all the objects before giving the final output. jq actually works on streams. Our ldb process would read and print lines to stdout. After a line is printed, our process moves on to read and print more of the DB while jq is filtering the lines that were just printed at the same time.

If there is a speedup it would probably be because we are reducing the amount of data that gets converted to json and printed. However, this benefit might be negated because this filter is implemented with Java reflection and jq filtering is in C.

Can we get benchmarks of various filtering queries using jq vs this method? Ideally on larger DBs with at least thousands of keys. Based on these results we can decide whether this option is something we should support.

sumitagrawl

@Tejaskriya Thanks for working over this, we can do some optinmization in logic

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ValueSchema.java

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java

sumitagrawl · 2024-08-13T05:33:37Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java

@@ -405,6 +419,47 @@ private boolean printTable(List<ColumnFamilyHandle> columnFamilyHandleList,
      IOUtils.closeQuietly(iterator, readOptions, slice);
    }
  }
+  boolean checkValidValueFields(String dbPath, String valueFields,
+                                DBColumnFamilyDefinition<?, ?> columnFamilyDefinition) {


validation may not be required, will give all matching data.

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java

Tejaskriya · 2024-08-14T08:39:12Z

I have run some benchmarking on a large scm.db which had a huge number of records in the deletedBlocks table. Due to the constraint jq has to have only string as keys, I had to use sed to wrap the key (which is an integer) with quotes and then pass the output to jq.
The logic of this PR performed way better than jq. The script used: https://gist.github.com/Tejaskriya/5168efd19e7cdb08fdfc6804100b71e7
Performace graph:

Tejaskriya · 2024-08-20T06:49:32Z

@sumitagrawl thank you for the review! I have made the methods in ValueSchema as static, performed the splitting of fields outside the loop and removed the validation.
Additionally, as discussed offline, I have implemented the filtering in a recursive way so that we can have multiple levels of filtering. As of now, no limit has been set. Could you please review the updated patch?

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java

sumitagrawl

@Tejaskriya LGTM

tejaskriya added 3 commits July 22, 2024 11:26

HDDS-11190. Add --fields option to ldb scan command

8b8a344

checkstyle fix

309debe

findbugs fix

a43e853

Tejaskriya marked this pull request as ready for review July 23, 2024 10:44

Tejaskriya marked this pull request as draft August 5, 2024 08:16

sumitagrawl reviewed Aug 13, 2024

View reviewed changes

Recursive filtering of object

64ee30b

Tejaskriya marked this pull request as ready for review August 20, 2024 06:47

sumitagrawl reviewed Aug 21, 2024

View reviewed changes

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java Outdated Show resolved Hide resolved

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/DBScanner.java Outdated Show resolved Hide resolved

tejaskriya added 2 commits August 23, 2024 12:30

Handle map values and remove redundant code

4ebaa60

use map instead of list for fields tracking

94a5795

sumitagrawl approved these changes Aug 27, 2024

View reviewed changes

sumitagrawl merged commit 2e30dc1 into apache:master Aug 27, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11190. Add --fields option to ldb scan command #6976

HDDS-11190. Add --fields option to ldb scan command #6976

Tejaskriya commented Jul 22, 2024 •

edited

Loading

adoroszlai commented Jul 28, 2024

Tejaskriya commented Jul 29, 2024

errose28 commented Jul 31, 2024 •

edited

Loading

sumitagrawl left a comment

sumitagrawl Aug 13, 2024

Tejaskriya commented Aug 14, 2024

Tejaskriya commented Aug 20, 2024

sumitagrawl left a comment

HDDS-11190. Add --fields option to ldb scan command #6976

HDDS-11190. Add --fields option to ldb scan command #6976

Conversation

Tejaskriya commented Jul 22, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

adoroszlai commented Jul 28, 2024

Tejaskriya commented Jul 29, 2024

errose28 commented Jul 31, 2024 • edited Loading

sumitagrawl left a comment

Choose a reason for hiding this comment

sumitagrawl Aug 13, 2024

Choose a reason for hiding this comment

Tejaskriya commented Aug 14, 2024

Tejaskriya commented Aug 20, 2024

sumitagrawl left a comment

Choose a reason for hiding this comment

Tejaskriya commented Jul 22, 2024 •

edited

Loading

errose28 commented Jul 31, 2024 •

edited

Loading