Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective database archiving #1508

Open
wohali opened this issue Aug 7, 2018 · 2 comments
Open

Selective database archiving #1508

wohali opened this issue Aug 7, 2018 · 2 comments

Comments

@wohali
Copy link
Member

wohali commented Aug 7, 2018

@davisp:

When you say selective, do you mean selectively archiving parts of a database? And is this internal or external? And/or does this mean make it read-only or read-not-at-all and/or I dunno.

@janl:

I think the selective option would be again something like “after 90 days, put docs in archive db (on cheaper storage)”.

@wohali:
This is basically native support for things that we currently have to do using rolling databases or similar. It could go along with the partitioning support that is currently being worked on by @garrensmith and @rnewson and proposed by @mikerhodes .

@mikerhodes
Copy link
Contributor

I think the partition work doesn't end up being helpful for a couple of reasons:

  1. Setting up partitions in e.g., a date based way likely makes for very hot partitions (basically all writes on one partition), which would be an anti-pattern.
  2. Because partition -> shard is a 1 -> many relationship, you wouldn't get benefits like archiving off whole shards because a shard would likely be hosting a mix of ready-to-archive partitions and not-ready-to-archive partitions.

@mikerhodes
Copy link
Contributor

To be more positive, FWIW, I had some thoughts on this a while ago, and thought you could fold this kind of feature into a set of document lifecycle rules based on Mango selectors. In addition to archiving, you could also have delete as an action for TTL, so I wanted to capture the idea of combining potentially different actions and conditions into rules.

What I thought was something that works like this. There's a set of one or more rules in a ddoc section which define actions and conditions using a selector. The action is archive, delete and so on. The conditions are things like comparing a doc_expiry field in a document with "now" :

{
  "_id": "_design/lifecyle-example",
  "_rev": "19-4426420428e8744fcfb67763cedd1ea8",
  "doc-lifecycle-rules": [
    {
      "action": "archive", 
      "condition": [ { "doc_expiry": { "$lt": "$$now"}, "type": "vital_entry"} ]
    },
    {
      "action": "delete", 
      "condition": [ { "doc_expiry": { "$lt": "$$now"}, "type": "spurious_entry"} ]
    }
  ]
}

Here:

  • $$now is a new a special variable syntax, and there would be a few predefined ones like $$now so this feature can specify dynamic rules.
  • doc-lifecycle-rules contains an array of (action,selector) pairs.
    • Maybe there's an options field which would take e.g., archiving destination.
  • Periodically, say once per minute, the (action,selector) pairs are evaluated against the documents in the database.
    • I'd expect an index would be auto-created from the condition selector to make this cheap, so the selector may need to be heavily restricted such that it can be evaluated from an index alone to identify affected documents.
  • At evaluation time, if more than one (action,selector) pair matches a document, only a single action of the matching actions will be run.
    • It's undefined which action happens, but only one will happen before the conditions are re-evaluated. Also, we don't guarantee an ordering, as that would need to be a total ordering over all (action,selector) pairs in all ddocs in the database.
  • Action is a set of atoms rather than freeform text. Broadly the database supports a few key workloads rather than it being arbitrarily expandable.
  • The format allows more actions to gradually be added.
  • We obviously allow for multiple ddocs to have these rules in them.
  • Actions are evaluated async, and we don't guarantee that if, for example, a document should've been expired from the database that it won't appear in search/view/query requests, you can still GET it, etc.

@wohali wohali moved this from Proposed for 3.x to Proposed (release independent) in Roadmap Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Roadmap
  
Proposed (backlog)
Development

No branches or pull requests

2 participants