Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11190. Add --fields option to ldb scan command #6976

Merged
merged 6 commits into from
Aug 27, 2024

Conversation

Tejaskriya
Copy link
Contributor

@Tejaskriya Tejaskriya commented Jul 22, 2024

What changes were proposed in this pull request?

RocksDB stores data in key-value pairs. The value itself may have some kind of key-value structure. Currently, ozone debug ldb scan command shows the full value for each record being displayed. This could be very verbose and include information that is not needed for the use case.
Having a --fields option which filters the fields being displayed for each record will help to get concise output.

For example, if a value has many fields like [name, location->[address, DN, IP], version, lastUpdateTime] , using the option "--fields=name,location.address,version" will display only (name, address in location and version) in the output.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11190

How was this patch tested?

Tested manually in docker cluster:

sh-4.2$ ozone debug ldb --db=/Users/tejaskriya.madhan/ZDump/bigDB/om.db scan --cf=fileTable -l=1
{ "/-9223371998060352000/-9223371998060348928/-9223371925202240254/part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet": {
  "metadata" : { },
  "objectID" : -9223371925201863421,
  "updateID" : 436161909,
  "parentObjectID" : -9223371925202240254,
  "volumeName" : "eda",
  "bucketName" : "mi5-tlc-hsh",
  "keyName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "dataSize" : 410571838,
  "keyLocationVersions" : [ {
    "version" : 0,
    "locationVersionMap" : {
      "0" : [ {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 1708889,
            "localID" : 111677748119571727
          },
          "blockCommitSequenceId" : 4755945
        },
        "length" : 268435456,
        "offset" : 0,
        "createVersion" : 0,
        "partNumber" : 0
      }, {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 309019,
            "localID" : 111677748119572182
          },
          "blockCommitSequenceId" : 4690972
        },
        "length" : 142136382,
        "offset" : 0,
        "createVersion" : 0,
        "partNumber" : 0
      } ]
    },
    "isMultipartKey" : false
  } ],
  "creationTime" : 1714434436283,
  "modificationTime" : 1714434508548,
  "replicationConfig" : {
    "replicationFactor" : "THREE",
    "replicationType" : "RATIS"
  },
  "encInfo" : {
    "cipherSuite" : "AES_CTR_NOPADDING",
    "version" : "ENCRYPTION_ZONES",
    "edek" : "C7ZhKbkKw5zlbnaQb1PVHaq4gK8WMDbzErKfI2fMEak=",
    "iv" : "XPDNypP45M6UVCSI9u1vPw==",
    "keyName" : "eda_mi5-tlc-hsh_bkt_key",
    "ezKeyVersionName" : "ynynAQT06m12MuqvTLWnBoWRy2jMYAaIGiyNXJXgLDD"
  },
  "isFile" : false,
  "fileName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "acls" : [ {
    "type" : "USER",
    "name" : "svc_dw_gps",
    "aclScope" : "ACCESS"
  }, {
    "type" : "GROUP",
    "name" : "dwhdpdei",
    "aclScope" : "ACCESS"
  } ],
  "tags" : { }
}
 }


sh-4.2$ ozone debug ldb --db=/Users/tejaskriya.madhan/ZDump/bigDB/om.db scan --cf=fileTable -l=1 --fields=volumeName,bucketName,keyName,keyLocationVersions.locationVersionMap.blockID,acls.name
{ "/-9223371998060352000/-9223371998060348928/-9223371925202240254/part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet": {
  "keyName" : "part-00000-e15cb8fa-9f58-44ef-a3c2-647dc4cfcc84.c000.snappy.parquet",
  "keyLocationVersions" : [ {
    "locationVersionMap" : {
      "0" : [ {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 1708889,
            "localID" : 111677748119571727
          },
          "blockCommitSequenceId" : 4755945
        }
      }, {
        "blockID" : {
          "containerBlockID" : {
            "containerID" : 309019,
            "localID" : 111677748119572182
          },
          "blockCommitSequenceId" : 4690972
        }
      } ]
    }
  } ],
  "bucketName" : "mi5-tlc-hsh",
  "acls" : [ {
    "name" : "svc_dw_gps"
  }, {
    "name" : "dwhdpdei"
  } ],
  "volumeName" : "eda"
}
 }

@Tejaskriya Tejaskriya marked this pull request as ready for review July 23, 2024 10:44
@adoroszlai
Copy link
Contributor

Thanks @Tejaskriya for working on this. Have you considered using jq for filtering output?

@Tejaskriya
Copy link
Contributor Author

@adoroszlai are you suggesting to not have a filter option and have the user use jq commands to do the filtering?
In that case, using jq was considered, but for larger dbs it will be reading the data twice. With adding an option to our code, we will be reading the data only once and filtering it simultaneously.

@errose28
Copy link
Contributor

errose28 commented Jul 31, 2024

I also suggested using jq.

but for larger dbs it will be reading the data twice. With adding an option to our code, we will be reading the data only once and filtering it simultaneously.

I don't think this is how it would work. This seems to describe jq as blocking until the whole DB is read, and only then beginning filtering on all the objects before giving the final output. jq actually works on streams. Our ldb process would read and print lines to stdout. After a line is printed, our process moves on to read and print more of the DB while jq is filtering the lines that were just printed at the same time.

If there is a speedup it would probably be because we are reducing the amount of data that gets converted to json and printed. However, this benefit might be negated because this filter is implemented with Java reflection and jq filtering is in C.

Can we get benchmarks of various filtering queries using jq vs this method? Ideally on larger DBs with at least thousands of keys. Based on these results we can decide whether this option is something we should support.

@Tejaskriya Tejaskriya marked this pull request as draft August 5, 2024 08:16
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya Thanks for working over this, we can do some optinmization in logic

@@ -405,6 +419,47 @@ private boolean printTable(List<ColumnFamilyHandle> columnFamilyHandleList,
IOUtils.closeQuietly(iterator, readOptions, slice);
}
}
boolean checkValidValueFields(String dbPath, String valueFields,
DBColumnFamilyDefinition<?, ?> columnFamilyDefinition) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation may not be required, will give all matching data.

@Tejaskriya
Copy link
Contributor Author

I have run some benchmarking on a large scm.db which had a huge number of records in the deletedBlocks table. Due to the constraint jq has to have only string as keys, I had to use sed to wrap the key (which is an integer) with quotes and then pass the output to jq.
The logic of this PR performed way better than jq. The script used: https://gist.github.com/Tejaskriya/5168efd19e7cdb08fdfc6804100b71e7
Performace graph:
image

@Tejaskriya Tejaskriya marked this pull request as ready for review August 20, 2024 06:47
@Tejaskriya
Copy link
Contributor Author

@sumitagrawl thank you for the review! I have made the methods in ValueSchema as static, performed the splitting of fields outside the loop and removed the validation.
Additionally, as discussed offline, I have implemented the filtering in a recursive way so that we can have multiple levels of filtering. As of now, no limit has been set. Could you please review the updated patch?

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl sumitagrawl merged commit 2e30dc1 into apache:master Aug 27, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants