-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for key-only entries #738
Comments
I had a similar question. Given that the value log is separate, is there any chance of providing a secondary index to the same value? I'm assuming that is very difficult with the current design, or else it would already be done, but I thought I'd ask. |
It happens I was thinking about the same problem of secondary index.
|
There are two different questions in this thread that deserve a good answer. Can we improve the disk usage for key only sets?Currently, sets with no values still require a considerable size of value storage. How do we provide secondary indexes on badger data?It seems that a common work around is to create a new key set where each key is composed of a prefix corresponding to the value on which we're indexing followed by the key of the other dataset. So if we have a dataset for which we are storing some
This allows us to find all the Person values with age 25, or to find those from ages 20 to 30 by using prefix scans. Is this the best we can do? If so, let's document it. If not, let's explain what's the better way. Please @jarifibrahim or @ashish-goswami , could help answer whether the idiom above is a best practice or something to avoid. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I think the second option is what most use, because of the (assumed at least) design of Badger and the original paper. Thus, key only searches are pretty common and they should perform a lot better. In Jaeger, we can write dozens of keys per single value. Index seeks are usually done against multiple keys and then intersected to find the one value to be fetched. Thus, optimization of key-only scenarios would definitely benefit our use-case. |
I had a cursory look. But, I don't fully understand this issue. Is the question: If my key-value pair has nil value, why does it get written to value log? The answer is: Value log acts as a write-ahead log. So, everything is first written there, synced to disk and then makes their way to LSM tree. One can run Value log GC to reclaim the space. If we don't write it to value log first, there's a chance of data loss. For iteration, Badger already provides key-only iteration. So, I'm not sure what's being requested here. |
The way I understand this issue, please @alecthomas feel free to correct me, is that even when badger is not storing values the value logs can get quite large. I'd like to fix two independent things:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue was marked as stale and no activity has occurred since then, therefore it will now be closed. Please, reopen if the issue is still relevant. |
The purpose of this issue is still unclear to me. Value logs get created, because they are WAL. But, they should get GCed easily if the values are below options.ValueThreshold. |
I no longer use Badger, but IIRC the file listing at the top of this issue was after a compaction. I filed the issue because I would expect the vlogs to be completely empty, but they weren't. @campoy 's summary is accurate. Feel free to close this issue. |
We have an issue which would provide a separate WAL for values which are smaller than the threshold. That should also help here. #1068 . |
Hello,
First off, I'm really enjoying Badger, thank you.
I've been implementing secondary indexes by using keys prefixed with the secondary index value (hash, prefix, whatever), followed by the PK. eg. if indexing a numeric field the key for the secondary index might be
<big-endian-int><pk>
. This works well, but I don't need values at all. However Badger still creates quite a lot of data in the vlogs. For example:The primary value database has 7124126 keys in it. This secondary index database has one entry for every PK.
So my question is, is it possible to provide truly key-only entries?
Also, is this how people typically create secondary indexes in Badger? Is there a better way? I did consider putting the
<pk>
as the value of the index, but given the data is very small and that would incur at least some overhead, this approach seemed more efficient.Regards,Alec
The text was updated successfully, but these errors were encountered: