Crude prototype for zng archive (zar) #482

mccanne · 2020-03-30T23:18:44Z

This commit adds a zar command for createing and searching
bzng index files. The index files are organized as sorted
string tables and use the sst package.

The current protoype is very early and and currently only
indexes IP addresses as a hard-wired configuration. The code
however is set up to do more flexible indexing soon.

The sst package uses a simple binary format for framing
keys and values as arbitrary byte sequences but we realized
it could be a very powerful foundation to store the sst
files as bzng. We will implement this change in a future PR.

This commit adds a zar command for createing and searching bzng index files. The index files are organized as sorted string tables and use the sst package. The current protoype is very early and and currently only indexes IP addresses as a hard-wired configuration. The code however is set up to do more flexible indexing soon. The sst package uses a simple binary format for framing keys and values as arbitrary byte sequences but we realized it could be a very powerful foundation to store the sst files as bzng. We will implement this change in a future PR.

alfred-landrum

From your comments, I saw there are 2 related changes that might be helpful soon: key-only sst support, and a real Visitor for zng.Records. Are there any other small to medium tasks that we could create & assign?

alfred-landrum · 2020-04-02T14:27:42Z

pkg/sst/sst.go

+// representing the key's string or the value's byte slice.  The body is compressed
+// according to the compression type.
+//
+// When an SST file has values of all the same length, the length is indicated


What's the benefit of this special case for length encoding?

The benefit from this old design was that you didn't need to encode the length of every key if you the keys were all fixed length as with the index-of-index files. The plan is to move to zng representation then this won't matter.

alfred-landrum · 2020-04-02T14:39:40Z

cmd/zar/index/command.go

+	Short: "creates index files for bzng files",
+	Long: `
+zar find descends the directory argument looking for bzng files and creates an index
+file for IP addresses for each bzng file encountered.  An index is writted to


Suggested change

file for IP addresses for each bzng file encountered. An index is writted to

file for IP addresses for each bzng file encountered. An index is written to

alfred-landrum · 2020-04-02T14:44:31Z

archive/indexer.go

+	}
+	if table.Size() == 0 {
+		//XXX
+		return errors.New("nothing to index")


Does this mean there were no ip addresses? If so, why error here, as opposed to creating an "empty" sst?

Yes, this would be a better approach but the current logic doesn't handle the empty table. I would prefer to handle this later when we soon switch over to zng.

alfred-landrum · 2020-04-02T15:01:07Z

archive/finder.go

+			hit, err := SearchFile(path, pattern)
+			if err != nil {
+				fmt.Printf("%s\n", err)
+				nerr++


Even in this prototype, why skip over any errors?

It's not skipping errors. It was bailing after too many errors. I'll take it out.

mattnibs · 2020-04-02T19:25:00Z

archive/type.go

+}
+
+func (t *TypeIndexer) value(body zcode.Bytes) {
+	t.Table.Enter(string(body), nil)


So the index simply says whether or not a value exists to its attached file. Isn't this the same as a perfectly accurate bloom filter that trades accuracy of results for space? Why not just use a bloom filter? What's the benefit?

It feels like if you are going the approach of creating indexes then it would make sense to produce a single index for a multiple files and the value of the index points to "hits". Otherwise you incur the cost of space for redundant key storage- something I image you would want to reduce.

For low cardinality value sets, the overhead of redundant storage is far smaller than the size of the source data and the robustness and simplicity of having a separate index per archive file is very powerful. For high cardinality value sets, the amount of overlap between files is usually going to be small (think of uids or file hashes) so the savings you would get from aggregation is small. If it's not small, in worst case you are doubling the size of the data, but I think that will be rare in practice. Bloom filters can definitely be useful here and we can get to that soon.

Another consideration here regarding bloom filters is the ux for archive search involves returning the set of matched files so bloom-filter false positives could be counterintuitive.

One advantage of separate indices is time-based retention: if you want to delete old data on a periodic basis, it's easy to drop "old data" by removing the index and zng for that old range, and not touch the rest of the data set.

nwt · 2020-04-06T20:19:46Z

Should the commands have test coverage so we know they continue to work?

mccanne · 2020-04-07T23:07:33Z

Should the commands have test coverage so we know they continue to work?

Totally agree. I'm going to merge this PR now as a WIP since it doesn't impact zq/zqd at all and we can then port SST to zng, refine the command syntax, and add tests as you suggest in subsequent PRs.

mccanne requested a review from a team March 30, 2020 23:18

fix comments

63c9aa6

alfred-landrum reviewed Apr 2, 2020

View reviewed changes

mattnibs reviewed Apr 2, 2020

View reviewed changes

address PR feedback

5f0c85c

alfred-landrum approved these changes Apr 6, 2020

View reviewed changes

alfred-landrum mentioned this pull request Apr 7, 2020

tool to create directory hierarchy of zng files #532

Closed

mccanne merged commit b235e3f into master Apr 7, 2020

mccanne deleted the zar branch April 7, 2020 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crude prototype for zng archive (zar) #482

Crude prototype for zng archive (zar) #482

mccanne commented Mar 30, 2020

alfred-landrum left a comment

alfred-landrum Apr 2, 2020

mccanne Apr 3, 2020

alfred-landrum Apr 2, 2020

alfred-landrum Apr 2, 2020

mccanne Apr 3, 2020

alfred-landrum Apr 2, 2020

mccanne Apr 3, 2020

mattnibs Apr 2, 2020

mccanne Apr 3, 2020

mccanne Apr 3, 2020

alfred-landrum Apr 3, 2020

nwt commented Apr 6, 2020

mccanne commented Apr 7, 2020

	file for IP addresses for each bzng file encountered. An index is writted to
	file for IP addresses for each bzng file encountered. An index is written to

Crude prototype for zng archive (zar) #482

Crude prototype for zng archive (zar) #482

Conversation

mccanne commented Mar 30, 2020

alfred-landrum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nwt commented Apr 6, 2020

mccanne commented Apr 7, 2020