Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup task for obsolete attachments #1648

Closed
ajres opened this issue Mar 10, 2016 · 23 comments
Closed

Cleanup task for obsolete attachments #1648

ajres opened this issue Mar 10, 2016 · 23 comments

Comments

@ajres
Copy link

ajres commented Mar 10, 2016

When a document is deleted or an attachment removed the binary document remains in the underlying CBS bucket.

For apps that create and delete docs with attachments on a regular basis this can have a big impact on storage.

There is already a _vacuum DB endpoint which is meant to purge orphaned attachments but it is currently not implemented.

// Deletes all orphaned CouchDB attachments not used by any revisions.
func VacuumAttachments(bucket base.Bucket) (int, error) {
    return 0, base.HTTPErrorf(http.StatusNotImplemented, "Vacuum is temporarily out of order")
}
@adamcfraser adamcfraser changed the title Implement _vacuum endpoint Cleanup task for obsolete attachments Mar 10, 2016
@adamcfraser
Copy link
Collaborator

Needs discussion on whether this should be implemented via the _vacuum endpoint, or incorporated into compact. Since compact isn't doing much work currently (the obsolete revision bodies it targets are handled by auto-expiry), it might make more sense to incorporate it there. That way users can use compact as a one-stop API for 'clean up data I don't need'. If we really needed to isolate the tasks, we could do that using a parameter on the compact call.

@nbrys
Copy link

nbrys commented Jun 29, 2016

Is there any progress on this issue?

@adamcfraser adamcfraser modified the milestones: 1.4, 1.3 Jun 29, 2016
@adamcfraser
Copy link
Collaborator

This didn't make it into the 1.3 release, but should be available on master as soon as we get 1.3 out the door, and will be included on 1.4.

@adamcfraser
Copy link
Collaborator

Original issue was filed as on #64. The main design issue remaining is handling concurrency issues during GC processing, specifically:

Initial identification of orphaned attachments is relatively straightforward using a view, but there isn't a good solution for the associated concurrency problems when we try to perform the deletes based on the results of that view query:

  1. A new document arrives referencing an attachment that's targeted for deletion while GC is in progress.
  2. An incoming new attachment gets incorrectly flagged for delete by the GC, since it's committed before the document is committed.

In particular this is an issue for a multi-node SG cluster, where it's harder to do a 'stop the world' for attachments across the cluster while the GC is running. We also need to ensure the solution only has a performance impact while GC is running, and has a negligible impact on standard attachment processing (i.e. we want to avoid significant work to lookup the attachment state during a normal write)

@adamcfraser
Copy link
Collaborator

Needs followup to confirm 2.0 CBL behaviour - specifically whether there's a 1-to-1 association from attachment to document. If so, the attachment lifecycle can be modified to follow the document lifecycle (i.e. clean up attachments on document removal).

@tleyden
Copy link
Contributor

tleyden commented Feb 26, 2018

specifically whether there's a 1-to-1 association from attachment to document

@adamcfraser looks like attachments are still shared among docs based on the content-addressable store approach.

To verify this, I used mobile-training-todo to add two docs with the same attachment:

Doc id -lwoJM-z8EjO2tswcm5S9yp

{
  "_attachments": {
    "blob_1": {
      "content_type": "image/jpeg",
      "digest": "sha1-8Lg7sMzIqffCOp5I44ctpTDaaww=",
      "length": 241444,
      "revpos": 1,
      "stub": true
    }
  },
  "_deleted": false,
  "complete": false,
  "createdAt": "2018-02-26T20:37:38.095Z",
  "image": {
    "@type": "blob",
    "content_type": "image/jpeg",
    "digest": "sha1-8Lg7sMzIqffCOp5I44ctpTDaaww=",
    "length": 241444
  },
  "task": "IosTask2",
  "taskList": {
    "id": "user1.E2560109-44E7-4789-8C34-2E744941FB41",
    "owner": "user1"
  },
  "type": "task"
}

Doc id -MCw8q3tzq5EYSdrivfWjy1

{
  "_attachments": {
    "blob_1": {
      "content_type": "image/jpeg",
      "digest": "sha1-8Lg7sMzIqffCOp5I44ctpTDaaww=",
      "length": 241444,
      "revpos": 2,
      "stub": true
    }
  },
  "_deleted": false,
  "complete": false,
  "createdAt": "2018-02-23T00:46:11.689Z",
  "image": {
    "@type": "blob",
    "content_type": "image/jpeg",
    "digest": "sha1-8Lg7sMzIqffCOp5I44ctpTDaaww=",
    "length": 241444
  },
  "task": "IosTask1",
  "taskList": {
    "id": "user1.E2560109-44E7-4789-8C34-2E744941FB41",
    "owner": "user1"
  },
  "type": "task"
}

and they both appear to refer to attachment doc: _sync:att:sha1-8Lg7sMzIqffCOp5I44ctpTDaaww=.

@tleyden
Copy link
Contributor

tleyden commented Feb 26, 2018

There seems to be two overall approaches to this problem:

Approach A: Modify couchbase lite to generate attachment digests to be per-document

Overview

This proposed fix would make each attachment digest unique per document, so that as soon as a document was deleted, all of it's attachments could immediately be deleted. Currently the attachment digests are shared among documents, which can save space. However, it makes them tricky to clean up, since there is currently no way to know whether other docs are referring to an attachment.

Pros

  1. Simple

Cons

  1. Potential data migration work involved.
  2. In certain use cases where a large number of docs point to the same attachment, this could increase the overall attachment storage footprint for the Sync Gateway / Couchbase Server backend.

Approach B: Implement ref-counting scheme

Overview

This proposed fix would continue the content-addressable-store approach to attachments, but add a "reference counter" to each attachment, so that it could safely determine when attachments could be reaped. For example:

  1. Doc1 creates an attachment, which is a picture of cat
  2. The system hasn't seen this attachment data yet, so it assigns a unique digest (Attach1) and adds it to the system
  3. Attach1 is given a refcount of 1, since 1 document is referencing it
  4. Doc2 creates an attachment, which is the same picture of the same cat (the identical file as Attach1)
  5. The system recognizes that this content already exists as Attach1, so does not add duplicated attachment content into the system
  6. Attach1 is given a refcount of 2, since now both Doc2 and Doc1 are referencing it
  7. Doc1 is deleted
  8. Attach1 has it's refcount decremented from 2 -> 1, since now only Doc2 is referencing it
  9. Doc2 is deleted
  10. Attach1 has it's refcount decremented from 1 -> 0, since there aren't any docs remaining that are referencing it
  11. Attach1 is deleted and it's space is reclaimed.

Pros

  1. No increase in storage
  2. Data migration work would most likely be simpler than the per-doc attachment digests approach

Cons

  1. No clear way to implement this since two documents might need to be updated atomically (however, with some more design discussions, this might be solvable)

@tleyden
Copy link
Contributor

tleyden commented Feb 26, 2018

@snej any thoughts on the approaches mentioned in #1648 (comment)?

@djpongh there's not any actionable work until we figure out an overall approach, so I'm going to re-assign to you to drive that forward.

@tleyden tleyden assigned djpongh and unassigned tleyden Feb 26, 2018
@djpongh
Copy link

djpongh commented Apr 3, 2018

Unfortunately, this is too complicated to try and get into our 2.1 release.

@djpongh djpongh modified the milestones: 2.1.0, 2.2.0 Apr 3, 2018
@djpongh djpongh removed the P1: high label Apr 3, 2018
@djpongh djpongh removed their assignment Apr 3, 2018
@djpongh djpongh added backlog and removed ready labels Apr 3, 2018
@geraldapeoples
Copy link

This is insane ... We almost moved to attachments for images when we read about it in the main documentation, it looked like it would simplify and solve issues with storage. It was only when we read the details in the 1.5 rest api with a pointer to this bug that we discovered it doesn't work. Can the original documentation be updated to say that you don't support attachments! You clearly don't if attachments can't actually be deleted.

@djpongh djpongh added icebox and removed backlog labels Apr 16, 2018
@adamcfraser adamcfraser modified the milestones: Iridium, Cobalt Nov 19, 2018
@djpongh djpongh modified the milestones: Cobalt, Mercury Dec 13, 2018
@ArihantRk
Copy link

@djpongh @adamcfraser When can we expect this feature is going to be added/fixed.

@adamcfraser
Copy link
Collaborator

Migrated to CBG-795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests