-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crude prototype for zng archive (zar) #482
Conversation
This commit adds a zar command for createing and searching bzng index files. The index files are organized as sorted string tables and use the sst package. The current protoype is very early and and currently only indexes IP addresses as a hard-wired configuration. The code however is set up to do more flexible indexing soon. The sst package uses a simple binary format for framing keys and values as arbitrary byte sequences but we realized it could be a very powerful foundation to store the sst files as bzng. We will implement this change in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comments, I saw there are 2 related changes that might be helpful soon: key-only sst support, and a real Visitor for zng.Records. Are there any other small to medium tasks that we could create & assign?
// representing the key's string or the value's byte slice. The body is compressed | ||
// according to the compression type. | ||
// | ||
// When an SST file has values of all the same length, the length is indicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of this special case for length encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit from this old design was that you didn't need to encode the length of every key if you the keys were all fixed length as with the index-of-index files. The plan is to move to zng representation then this won't matter.
cmd/zar/index/command.go
Outdated
Short: "creates index files for bzng files", | ||
Long: ` | ||
zar find descends the directory argument looking for bzng files and creates an index | ||
file for IP addresses for each bzng file encountered. An index is writted to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file for IP addresses for each bzng file encountered. An index is writted to | |
file for IP addresses for each bzng file encountered. An index is written to |
} | ||
if table.Size() == 0 { | ||
//XXX | ||
return errors.New("nothing to index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean there were no ip addresses? If so, why error here, as opposed to creating an "empty" sst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be a better approach but the current logic doesn't handle the empty table. I would prefer to handle this later when we soon switch over to zng.
archive/finder.go
Outdated
hit, err := SearchFile(path, pattern) | ||
if err != nil { | ||
fmt.Printf("%s\n", err) | ||
nerr++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in this prototype, why skip over any errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not skipping errors. It was bailing after too many errors. I'll take it out.
} | ||
|
||
func (t *TypeIndexer) value(body zcode.Bytes) { | ||
t.Table.Enter(string(body), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the index simply says whether or not a value exists to its attached file. Isn't this the same as a perfectly accurate bloom filter that trades accuracy of results for space? Why not just use a bloom filter? What's the benefit?
It feels like if you are going the approach of creating indexes then it would make sense to produce a single index for a multiple files and the value of the index points to "hits". Otherwise you incur the cost of space for redundant key storage- something I image you would want to reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For low cardinality value sets, the overhead of redundant storage is far smaller than the size of the source data and the robustness and simplicity of having a separate index per archive file is very powerful. For high cardinality value sets, the amount of overlap between files is usually going to be small (think of uids or file hashes) so the savings you would get from aggregation is small. If it's not small, in worst case you are doubling the size of the data, but I think that will be rare in practice. Bloom filters can definitely be useful here and we can get to that soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another consideration here regarding bloom filters is the ux for archive search involves returning the set of matched files so bloom-filter false positives could be counterintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage of separate indices is time-based retention: if you want to delete old data on a periodic basis, it's easy to drop "old data" by removing the index and zng for that old range, and not touch the rest of the data set.
Should the commands have test coverage so we know they continue to work? |
Totally agree. I'm going to merge this PR now as a WIP since it doesn't impact zq/zqd at all and we can then port SST to zng, refine the command syntax, and add tests as you suggest in subsequent PRs. |
This commit adds a zar command for createing and searching
bzng index files. The index files are organized as sorted
string tables and use the sst package.
The current protoype is very early and and currently only
indexes IP addresses as a hard-wired configuration. The code
however is set up to do more flexible indexing soon.
The sst package uses a simple binary format for framing
keys and values as arbitrary byte sequences but we realized
it could be a very powerful foundation to store the sst
files as bzng. We will implement this change in a future PR.