Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

putFileArchival #248

Merged
merged 17 commits into from Aug 5, 2019

Conversation

@zone117x
Copy link
Member

commented Jun 28, 2019

Goal of PR

#240, #220

Implement the putFileArchival scope and historical file renaming scheme, enforced in putFile and deleteFile.

Implementation

When the scope is specified, file writes are not permitted to overwrite existing files. A file rename is performed using a historical naming scheme:
{address}/foo/bar/photo.png ->
{address}/foo/bar/.history.{timestamp}.{guid}.photo.png

Note: this is using the popular and fast nanoid lib to generate a 10 char mixed-case-alphanumeric string rather than an actual uuid/guid.

File deletes simply perform the same rename and nothing more. This results in the original file 404ing on request like a regular delete.

Discussion

This historical naming scheme differs from the current collections proposal in that it uses unix timestamps (milliseconds since epoch) rather than a contiguous incrementing version integer.

This has the benefit of not depending on an index file and the additional read->write overhead, and does not require additional locks or expensive index file conflict checking & resolution.

However, this does present the usual edge-cases with using timestamps:

  • Severe server clock drift could result in incorrect file ordering. However, future file history restoration tooling could easily get around this by using the fileStat operation which returns the cloud driver provided lastModifiedDate.
  • Rapid repeated file writes to the same endpoint in conjunction with severe server clock drift could result in identical millisecond timestamps and result in historical file overwriting. Not sure if this is possible in practice because..
    1. Operations on an individual file are now serialized -- concurrent requests are rejected. So a second request would have to arrive exactly at the completion of the last request at the same moment that a clock glitch triggered a backward time drift.
    2. A reachable Gaia hub necessitates an internet connection which means NTP should prevent severe clock drift?

If a timestamp cannot be used then I have a couple alternative proposals that also avoid bucket-wide index file race conditions and conflict resolution:

Per-file version tracking

Use a per-file historical version tracker rather than a bucket-wide index file. Something like:

myphoto1.jpg <----- Always the latest version
.history.status.myphoto1.txt <---- File containing the latest version integer
.history.1003.myphoto1.jpg <----- Previous version
.history.1002.myphoto1.jpg
.history.1001.myphoto1.jpg

The storage overhead of an additional small file seems equal to or better than the storage overhead of an index file containing a map of file paths and version integers.

Rename with GUID

Use a randomly generated ID rather than an incrementing integer or timestamp. Future file history restoration tooling could use the cloud driver provided lastModifiedDate that is now available in the fileStat operation. This field has integration testing showing accuracy and behavioral consistency across all currently supported cloud storage providers.

Thoughts?

Manual Testing

Requires crafting a putFileArchival scoped token manually since it's not yet supported in the Browser or blockstack.js.

Automated Testing

Full test coverage of new logic.

Submission Checklist

  • Based on correct branch: feature submissions should be on develop, hotfixes should be on master

  • The code passes our eslint definitions, unit tests, and
    contains correct TypeFlow annotations.

  • Submission contains tests that cover any and all new functionality or code changes.

  • Submission documents any new features or endpoints, and describes how developers
    would be expected to interact with them.

  • Author has agreed to our contributor's agreement.

zone117x added 4 commits Jun 27, 2019
…s and deletes are performed with a historical file rename.
@zone117x zone117x requested review from jcnelson, hstove, kantai and yknl Jun 28, 2019
@codecov

This comment has been minimized.

Copy link

commented Jun 28, 2019

Codecov Report

Merging #248 into develop will increase coverage by 0.6%.
The diff coverage is 95.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop     #248     +/-   ##
==========================================
+ Coverage    80.08%   80.68%   +0.6%     
==========================================
  Files           21       21             
  Lines         1486     1553     +67     
  Branches       264      271      +7     
==========================================
+ Hits          1190     1253     +63     
- Misses         221      225      +4     
  Partials        75       75
Impacted Files Coverage Δ
hub/src/server/utils.ts 92.72% <100%> (+0.72%) ⬆️
hub/src/server/http.ts 72.65% <100%> (ø) ⬆️
hub/src/server/server.ts 96.26% <93.44%> (-1%) ⬇️
hub/src/server/authentication.ts 90.26% <97.14%> (+0.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d4173f...71c5336. Read the comment docs.

@hstove

This comment has been minimized.

Copy link

commented Jun 28, 2019

I'm in favor of anything removing the read/write overhead. That includes using timestamps and no bucket-wide index file.

@zone117x zone117x referenced this pull request Jul 1, 2019
5 of 5 tasks complete
@jcnelson

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

This historical naming scheme differs from the current collections proposal in that it uses unix timestamps (milliseconds since epoch) rather than a contiguous incrementing version integer.

Speaking to this topic, you can avoid the need for a dedicated index file while simultaneously avoiding any clock-drift problems by using both a timestamp and a GUID in the .history file name (i.e. your file name pattern would be .history.{timestamp}.{GUID}.{name}). Then, you would treat any writes that occur within the same millisecond as "simultaneous", and simply use their GUIDs' lexical ordering to break ties and put all file names into a total ordering.

@kantai

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

I'm in favor of this approach paired with a nonce (or a random integer):

.history.{timestamp}.{nonce}.{name}

As unlikely as an overlap would be with a random integer, a local incrementing nonce would eliminate the possibility of overwrite entirely (and could be transient -- it doesn't need to live past a ms). I'm not sure how much that matters though -- if we used a random integer, the possibility of overwrite is very low.

zone117x added 2 commits Jul 23, 2019
* feature/file-mutex:
  Update driver model README with a notice about path mutex.
  Add docs for performRead to driver model definition in README
  Add docs for performRename and performStat to driver model definition in README
  Update blockstack.js dep out of beta version
  Simplify mutx with Set<string>
  Implement decodeTokenForPayload helper function
  Update to blockstack.js beta package to fix typing collisions with bitcoinjs-lib v5
  Add tests for schema validation warnings
  Add additional config param schema descriptions, cleanup sample configs
  Config schema validation that logs console warnings

# Conflicts:
#	hub/src/server/authentication.ts
@zone117x zone117x marked this pull request as ready for review Jul 23, 2019
@zone117x

This comment has been minimized.

Copy link
Member Author

commented Jul 23, 2019

Updated to use a .history.{timestamp}.{GUID}.{name} naming scheme. Moved PR out of draft and ready for review.

Will update PR with more documentation on the changes if/when the implementation is reviewed and accepted.

zone117x added 2 commits Jul 30, 2019
* develop:
  Re-order the mutex registration statement to prevent syncronous reentrant locking issues
  Fix race condition from accidental promise yielding in mutex implementation
  Grammar update on http path mutex
  Use single `topLevelStorage` arg in file rename
@zone117x zone117x changed the base branch from feature/file-mutex to develop Jul 30, 2019
zone117x added 2 commits Jul 30, 2019
* develop:
  Add `list-files` http endpoint documentation to readme
  Update DriverModel in readme
  Minor version bump with changelog entries for recent changes.
  Update with file rename function changes (merge conflict #2)
  Add docs for listFilesStat to driver model definition in README
  Integration tests for listFiles with stat metadata
  Option to return file stat info in listFiles

# Conflicts:
#	hub/src/server/server.ts
const filePathPrefix = filePath.slice(0, filePath.length - fileName.length)

// reject writes to historical files
if (fileName.startsWith('.history.')) {

This comment has been minimized.

Copy link
@jcnelson

jcnelson Jul 30, 2019

Member

I'm not sure why this check is here. The file's name should be able to be anything, since we're going to prefix it with .history.${Date.now()}.${generateUniqueID()}., right?

This comment has been minimized.

Copy link
@zone117x

zone117x Jul 30, 2019

Author Member

There was a concern at one point to prevent apps from deleting or zeroing out historical files, but that should not be possible with current implementation. Attempting to do so would just create a historical file for a historical file.

So, this check just prevents that historical-file-inception. I think this is would be more an accidental thing an app dev might do. For example listFiles, iterate, write to a historical file, say .history.1564526417230.nTXJqe7gTc.myphoto.png, and trigger a historical rename of the historical file: .history.1564526772589.C9H8GzbAlA..history.1564526417230.nTXJqe7gTc.myphoto.png

But doesn't seem like that big of a deal. Should we allow it?

This comment has been minimized.

Copy link
@zone117x

zone117x Jul 30, 2019

Author Member

This also brings up the question -- should the existence of a putFileArchival scope also restrict list-files? For example, not returning entries for any of the historical files.

One oddity of doing this would be that a user's bucket may contain >PAGE_LIMIT historical files, and end up with lots of list-file responses that contain a pagination marker but no file entries.

This comment has been minimized.

Copy link
@jcnelson

jcnelson Jul 30, 2019

Member

So, this check just prevents that historical-file-inception.

I don't think this is a concern. A buggy app is a buggy app.

This also brings up the question -- should the existence of a putFileArchival scope also restrict list-files? For example, not returning entries for any of the historical files.

At the very least, I think list-files should hide the timestamp unless it had putFileArchival permission, since this would allow an app to view the user's activity. But I take your point about there being an inexplicably large number of empty pages. Maybe list-files should censor the name and timestamp of history files so they all say <hidden> and have a timestamp of 0?

This comment has been minimized.

Copy link
@zone117x

zone117x Jul 30, 2019

Author Member

@yknl Any thoughts on this?

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 1, 2019

Author Member

In the collections meeting we decided to not include historical files in list-files with an archival scope -- PR has been updated.

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 2, 2019

Author Member

The blockstack.js listFiles loop will break with empty pages due to historical file filtering https://github.com/blockstack/blockstack.js/blob/76a1fb23ffc7db6ebb371698057da8baa48e8434/src/storage/index.ts#L696

Not sure how to fix this cleanly without also updating blockstack.js. Gaia could detect this situation and return a single 'null' or empty string item in the entries array which would cause the listFile loop to continue, but probably break apps that are not expecting stub entries. Thoughts?

EDIT:
Actually, the Gaia hub could fetch another page when it detects this situation. Going to give that a shot.

This comment has been minimized.

Copy link
@kantai

kantai Aug 2, 2019

Member

I continue to think that the historical files should show up in list-files. What's the downside to this? Do applications use list-files for any functionality?

This comment has been minimized.

Copy link
@jcnelson

jcnelson Aug 2, 2019

Member

I think it's mainly a privacy concern -- an application can see the historic states of a collection file.

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 2, 2019

Author Member

@kantai Yeah some privacy concerns.

Yes apps do use listFiles:
https://github.com/search?utf8=%E2%9C%93&q=%22blockstack%22+%22listfiles%22&type=Code&ref=advsearch&l=&l=

Apps or blockstack.js would need to implement parsing & handling of the historical file naming scheme. This will probably need to be done at least in the Browser at some point anyway for the restoration tool, but that can be done later.

deletePrefixes.push(scopes[i].domain)
} else if (scopes[i].scope == 'deleteFile') {
deletePaths.push(scopes[i].domain)
const isArchivalRestricted = scopes.writeArchivalPaths.length > 0 || scopes.writeArchivalPrefixes.length > 0

This comment has been minimized.

Copy link
@jcnelson

jcnelson Jul 30, 2019

Member

This code is duplicated below. I think there could be a method that determines (1) whether or not we're authorized to write to this path and (2) if so, whether or not we're archival-restricted.

This comment has been minimized.

Copy link
@zone117x

zone117x Jul 30, 2019

Author Member

DRY'd this up


// test .history. file write rejection
try {
await server.handleRequest(testAddrs[0], 'baz/.history.not_ok.txt', {

This comment has been minimized.

Copy link
@jcnelson

jcnelson Jul 30, 2019

Member

I think this should be allowed -- we're going to prepend .history. ... to the file name anyway.

This comment has been minimized.

Copy link
@zone117x

zone117x Jul 31, 2019

Author Member

Done

Copy link
Member

left a comment

Looks good, but I have a couple comments.

@kantai
kantai approved these changes Jul 31, 2019
Copy link
Member

left a comment

Looks good to me. The only change I think should be made is to remove the .history. prefix check.

@jcnelson jcnelson self-requested a review Jul 31, 2019
Copy link
Member

left a comment

LGTM!

…ope exists
@yknl
yknl approved these changes Aug 2, 2019
Copy link

left a comment

Looks good to me 👍

zone117x added 3 commits Aug 2, 2019
… filtering
let attemptCount = 0
const maxFetchPageAttempts = this.config.fetchPageAttempts || 100

const fetchPage = async (pageArg: string): Promise<ListFilesResult | ListFilesStatResult> => {

This comment has been minimized.

Copy link
@jcnelson

jcnelson Aug 2, 2019

Member

I disagree with this approach. I think that list-files should just censor sensitive data, but it shouldn't make the Gaia hub do a bunch of background work like this. This is just asking for a DDoS. Instead, it should return a sentinel file name that indicates that the true name is obfuscated or will not be returned.

If the purpose of this code is to make it so blockstack.js's listFiles() doesn't return such files to the application, then it should be blockstack.js that re-requests data.

This comment has been minimized.

Copy link
@kantai

kantai Aug 2, 2019

Member

Yep -- we want to make sure that we're doing something like a constant number of lookups for any given request, otherwise it's a denial of service vector.

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 2, 2019

Author Member

I don't really disagree, but clients will end up requesting the hub to do this same thing. But I'm okay with blockstack.js being updated to handle the situation.
@yknl I can do the blockstack.js PR for this -- does that sound good to you?

This comment has been minimized.

Copy link
@yknl

yknl Aug 2, 2019

Yea, I think the logical solution here is to have blockstack.js handle empty pages instead of processing it on the hub.

Just to re-iterate why we shouldn't return .history files to blockstack.js listFile:

  1. It makes listFiles on collections almost useless due to the number of irrelevant historical files it returns.
  2. There's a privacy concern with exposing the historical files.

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 3, 2019

Author Member

PR updated to return a single null entry when this situation occurs. blockstack.js needs updated to A) filter these null results from the app and, B) continue to request additional pages.

This comment has been minimized.

Copy link
@zone117x

zone117x Aug 5, 2019

Author Member

Corresponding blockstack.js PR: blockstack/blockstack.js#685

Copy link
Member

left a comment

The Gaia hub shouldn't be making that many back-end requests on list-files. Instead, it should return a sentinel value for files that shouldn't be listed, and blockstack.js should filter these sentinel values out in listFiles().

…cal file filitering -- blockstack.js needs updated to handle this value
@zone117x zone117x requested a review from jcnelson Aug 5, 2019
Copy link
Member

left a comment

Thanks @zone117x! LGTM 👍

@kantai
kantai approved these changes Aug 5, 2019
Copy link
Member

left a comment

These updates look good to me!

@zone117x zone117x merged commit 6a3dd5a into develop Aug 5, 2019
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 80.68% (+0.6%) compared to 5d4173f
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.