Better support for key-only entries #738

alecthomas · 2019-03-09T20:03:16Z

Hello,

First off, I'm really enjoying Badger, thank you.

I've been implementing secondary indexes by using keys prefixed with the secondary index value (hash, prefix, whatever), followed by the PK. eg. if indexing a numeric field the key for the secondary index might be <big-endian-int><pk>. This works well, but I don't need values at all. However Badger still creates quite a lot of data in the vlogs. For example:

drwxr-x---   8 aat  staff   256B  9 Mar 22:56 logrythm.db/$.level.idx
-rw-r-----   1 aat  staff    44M  9 Mar 22:29 logrythm.db/$.level.idx/000005.vlog
-rw-r-----   1 aat  staff    23M  9 Mar 22:56 logrythm.db/$.level.idx/000006.vlog
-rw-r-----   1 aat  staff    69M  9 Mar 22:56 logrythm.db/$.level.idx/000022.sst
-rw-r-----   1 aat  staff    69M  9 Mar 22:56 logrythm.db/$.level.idx/000023.sst
-rw-r-----   1 aat  staff    28M  9 Mar 22:56 logrythm.db/$.level.idx/000024.sst
-rw-r-----   1 aat  staff   1.2K  9 Mar 22:56 logrythm.db/$.level.idx/MANIFEST

The primary value database has 7124126 keys in it. This secondary index database has one entry for every PK.

So my question is, is it possible to provide truly key-only entries?

Also, is this how people typically create secondary indexes in Badger? Is there a better way? I did consider putting the <pk> as the value of the index, but given the data is very small and that would incur at least some overhead, this approach seemed more efficient.

Regards,Alec

The text was updated successfully, but these errors were encountered:

matthoffman · 2019-03-26T18:19:21Z

I had a similar question. Given that the value log is separate, is there any chance of providing a secondary index to the same value? I'm assuming that is very difficult with the current design, or else it would already be done, but I thought I'd ask.

yangzh · 2019-04-03T22:38:07Z

It happens I was thinking about the same problem of secondary index.
My thinking:

with empty values seems to be best: there is no additional value storage, and you can perform another hop (following ) to retrieve the value if needed. This hop (mostly in memory) is presumably cheap. I guess the value log management will be left unchanged.
The memory cost simply doubles (for key storage), which is still OK and expected: I cannot think roughly anything better.
In addition, you can precompute some values (during index building time) and put into the value slot, sometime this can be useful.
with pointer to value: seems nice, but the value itself cannot be compacted unless both and all the secondary index entries have been deleted, this can potentially complicate the logic, and doesn't seem to worth it.
We cannot do as key, and as value, since there is no guarantee will be unique (for general cases): this is out of the question.
How does secondary index interacts with multiple versions: I guess the secondary index entries needs to have the same version, and it seems to be working OK, I haven't thought too much about any corner cases yet.

campoy · 2019-05-28T21:55:46Z

There are two different questions in this thread that deserve a good answer.

Can we improve the disk usage for key only sets?

Currently, sets with no values still require a considerable size of value storage.
This is purely a performance improvement now tracked by #832

How do we provide secondary indexes on badger data?

It seems that a common work around is to create a new key set where each key is composed of a prefix corresponding to the value on which we're indexing followed by the key of the other dataset.

So if we have a dataset for which we are storing some Person object which we want to index by Age field, we would have two different badger datasets:

The main dataset with keys pointing to Person values:

Key	Value
001	{Name: "Alice", Age: 25}
002	{Name: "Bob", Age: 25}
003	{Name: "Charlie", Age: 20}

The second dataset, for which the keys have each person's age as the prefix followed by the person's key

Key	Value
025.001	Ø
025.002	Ø
020.003	Ø

This allows us to find all the Person values with age 25, or to find those from ages 20 to 30 by using prefix scans.
This seems to be like a pretty common strategy, similar to what one would expect with BigTable keys.

Is this the best we can do? If so, let's document it. If not, let's explain what's the better way.

Please @jarifibrahim or @ashish-goswami , could help answer whether the idiom above is a best practice or something to avoid.

stale · 2019-06-27T22:16:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

burmanm · 2019-07-23T15:24:51Z

I think the second option is what most use, because of the (assumed at least) design of Badger and the original paper. Thus, key only searches are pretty common and they should perform a lot better. In Jaeger, we can write dozens of keys per single value. Index seeks are usually done against multiple keys and then intersected to find the one value to be fetched.

Thus, optimization of key-only scenarios would definitely benefit our use-case.

manishrjain · 2019-07-23T15:31:38Z

I had a cursory look. But, I don't fully understand this issue. Is the question: If my key-value pair has nil value, why does it get written to value log? The answer is: Value log acts as a write-ahead log. So, everything is first written there, synced to disk and then makes their way to LSM tree. One can run Value log GC to reclaim the space. If we don't write it to value log first, there's a chance of data loss.

For iteration, Badger already provides key-only iteration. So, I'm not sure what's being requested here.

campoy · 2019-08-05T23:44:39Z

The way I understand this issue, please @alecthomas feel free to correct me, is that even when badger is not storing values the value logs can get quite large.

I'd like to fix two independent things:

answer the question: do we need so much data in vlogs for a key only dataset? it seems the answer is yes
answer the question: is having a second database the best way to create indices on an existing badger DB? if so, we should document this somewhere as it's a useful pattern that others might want to be aware of

stale · 2019-09-05T00:00:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-09-12T00:19:54Z

This issue was marked as stale and no activity has occurred since then, therefore it will now be closed. Please, reopen if the issue is still relevant.

manishrjain · 2019-09-17T22:39:47Z

The purpose of this issue is still unclear to me. Value logs get created, because they are WAL. But, they should get GCed easily if the values are below options.ValueThreshold.

alecthomas · 2019-09-18T00:25:13Z

I no longer use Badger, but IIRC the file listing at the top of this issue was after a compaction. I filed the issue because I would expect the vlogs to be completely empty, but they weren't. @campoy 's summary is accurate.

Feel free to close this issue.

manishrjain · 2019-10-04T18:46:36Z

We have an issue which would provide a separate WAL for values which are smaller than the threshold. That should also help here. #1068 .

jarifibrahim added the kind/enhancement Something could be better. label May 2, 2019

campoy added this to the Unplanned milestone May 28, 2019

campoy mentioned this issue May 28, 2019

Improve disk usage for key-only datasets #832

Closed

campoy assigned jarifibrahim and ashish-goswami May 28, 2019

campoy added area/documentation Documentation related issues. kind/question Something requiring a response and removed area/performance Performance related issues. kind/feature Something completely new we should consider. labels May 28, 2019

stale bot added the status/stale The issue hasn't had activity for a while and it's marked for closing. label Jun 27, 2019

campoy removed the status/stale The issue hasn't had activity for a while and it's marked for closing. label Jun 28, 2019

stale bot added the status/stale The issue hasn't had activity for a while and it's marked for closing. label Sep 5, 2019

stale bot closed this as completed Sep 12, 2019

campoy reopened this Sep 17, 2019

stale bot removed the status/stale The issue hasn't had activity for a while and it's marked for closing. label Sep 17, 2019

campoy added the skip/stale Skip stalebot label Sep 17, 2019

manishrjain closed this as completed Oct 4, 2019

synzhu mentioned this issue Oct 6, 2021

[State sync engine] Implement block stores onflow/flow-go#1450

Closed

synzhu mentioned this issue Mar 25, 2022

Update badger secondary indexes to use keys with empty values onflow/flow-go#2208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for key-only entries #738

Better support for key-only entries #738

alecthomas commented Mar 9, 2019 •

edited

Loading

matthoffman commented Mar 26, 2019

yangzh commented Apr 3, 2019

campoy commented May 28, 2019

stale bot commented Jun 27, 2019

burmanm commented Jul 23, 2019

manishrjain commented Jul 23, 2019

campoy commented Aug 5, 2019

stale bot commented Sep 5, 2019

stale bot commented Sep 12, 2019

manishrjain commented Sep 17, 2019

alecthomas commented Sep 18, 2019

manishrjain commented Oct 4, 2019

Better support for key-only entries #738

Better support for key-only entries #738

Comments

alecthomas commented Mar 9, 2019 • edited Loading

matthoffman commented Mar 26, 2019

yangzh commented Apr 3, 2019

campoy commented May 28, 2019

Can we improve the disk usage for key only sets?

How do we provide secondary indexes on badger data?

stale bot commented Jun 27, 2019

burmanm commented Jul 23, 2019

manishrjain commented Jul 23, 2019

campoy commented Aug 5, 2019

stale bot commented Sep 5, 2019

stale bot commented Sep 12, 2019

manishrjain commented Sep 17, 2019

alecthomas commented Sep 18, 2019

manishrjain commented Oct 4, 2019

alecthomas commented Mar 9, 2019 •

edited

Loading