Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crude prototype for zng archive (zar) #482

Merged
merged 3 commits into from
Apr 7, 2020
Merged

Crude prototype for zng archive (zar) #482

merged 3 commits into from
Apr 7, 2020

Conversation

mccanne
Copy link
Collaborator

@mccanne mccanne commented Mar 30, 2020

This commit adds a zar command for createing and searching
bzng index files. The index files are organized as sorted
string tables and use the sst package.

The current protoype is very early and and currently only
indexes IP addresses as a hard-wired configuration. The code
however is set up to do more flexible indexing soon.

The sst package uses a simple binary format for framing
keys and values as arbitrary byte sequences but we realized
it could be a very powerful foundation to store the sst
files as bzng. We will implement this change in a future PR.

This commit adds a zar command for createing and searching
bzng index files.  The index files are organized as sorted
string tables and use the sst package.

The current protoype is very early and and currently only
indexes IP addresses as a hard-wired configuration.  The code
however is set up to do more flexible indexing soon.

The sst package uses a simple binary format for framing
keys and values as arbitrary byte sequences but we realized
it could be a very powerful foundation to store the sst
files as bzng.  We will implement this change in a future PR.
@mccanne mccanne requested a review from a team March 30, 2020 23:18
Copy link
Contributor

@alfred-landrum alfred-landrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your comments, I saw there are 2 related changes that might be helpful soon: key-only sst support, and a real Visitor for zng.Records. Are there any other small to medium tasks that we could create & assign?

// representing the key's string or the value's byte slice. The body is compressed
// according to the compression type.
//
// When an SST file has values of all the same length, the length is indicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this special case for length encoding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit from this old design was that you didn't need to encode the length of every key if you the keys were all fixed length as with the index-of-index files. The plan is to move to zng representation then this won't matter.

Short: "creates index files for bzng files",
Long: `
zar find descends the directory argument looking for bzng files and creates an index
file for IP addresses for each bzng file encountered. An index is writted to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file for IP addresses for each bzng file encountered. An index is writted to
file for IP addresses for each bzng file encountered. An index is written to

}
if table.Size() == 0 {
//XXX
return errors.New("nothing to index")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean there were no ip addresses? If so, why error here, as opposed to creating an "empty" sst?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be a better approach but the current logic doesn't handle the empty table. I would prefer to handle this later when we soon switch over to zng.

hit, err := SearchFile(path, pattern)
if err != nil {
fmt.Printf("%s\n", err)
nerr++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in this prototype, why skip over any errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not skipping errors. It was bailing after too many errors. I'll take it out.

}

func (t *TypeIndexer) value(body zcode.Bytes) {
t.Table.Enter(string(body), nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the index simply says whether or not a value exists to its attached file. Isn't this the same as a perfectly accurate bloom filter that trades accuracy of results for space? Why not just use a bloom filter? What's the benefit?

It feels like if you are going the approach of creating indexes then it would make sense to produce a single index for a multiple files and the value of the index points to "hits". Otherwise you incur the cost of space for redundant key storage- something I image you would want to reduce.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For low cardinality value sets, the overhead of redundant storage is far smaller than the size of the source data and the robustness and simplicity of having a separate index per archive file is very powerful. For high cardinality value sets, the amount of overlap between files is usually going to be small (think of uids or file hashes) so the savings you would get from aggregation is small. If it's not small, in worst case you are doubling the size of the data, but I think that will be rare in practice. Bloom filters can definitely be useful here and we can get to that soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration here regarding bloom filters is the ux for archive search involves returning the set of matched files so bloom-filter false positives could be counterintuitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of separate indices is time-based retention: if you want to delete old data on a periodic basis, it's easy to drop "old data" by removing the index and zng for that old range, and not touch the rest of the data set.

@nwt
Copy link
Member

nwt commented Apr 6, 2020

Should the commands have test coverage so we know they continue to work?

@mccanne
Copy link
Collaborator Author

mccanne commented Apr 7, 2020

Should the commands have test coverage so we know they continue to work?

Totally agree. I'm going to merge this PR now as a WIP since it doesn't impact zq/zqd at all and we can then port SST to zng, refine the command syntax, and add tests as you suggest in subsequent PRs.

@mccanne mccanne merged commit b235e3f into master Apr 7, 2020
@mccanne mccanne deleted the zar branch April 7, 2020 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants