Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should "includeHistory" actually do? #44

Closed
cinnamon-bun opened this issue Aug 13, 2020 · 3 comments
Closed

What should "includeHistory" actually do? #44

cinnamon-bun opened this issue Aug 13, 2020 · 3 comments

Comments

@cinnamon-bun
Copy link
Member

cinnamon-bun commented Aug 13, 2020

Document versions with the same path are related to each other. When querying, sometimes we want to handle them as a group and sometimes individually.

We only have one query parameter for this, includeHistory, and it doesn't let us do everything we want.

(vocabulary: a "head" is the latest document at a path)

We might want to...

  • query all document versions independently, then maybe only keep ones that are heads
  • query all document versions independently, then maybe only keep the latest one in each path that matched the query
  • query each group to see if it contains some kind of match, then return the whole group
  • query each group to see if it contains some kind of match, then return the head of the group

This complexity will hit any kind of query that can match only certain document versions in a path. We previously ran into this with querying by author. At the time I solved that by adding 3 ways to query by author:

participatingAuthor: match author anywhere in history; includeHistory happens after that
versionsByAuthor: includeHistory happens first, then match author in each version one by one
lastAuthor: only match author on latest doc version; includeHistory expands after that

This is confusing. Is there a more general way to specify how to handle history when querying?

@cinnamon-bun cinnamon-bun changed the title Define the precise meaning of "includeHistory" What should "includeHistory" actually do? Aug 13, 2020
@cinnamon-bun cinnamon-bun added this to the acorn milestone Aug 13, 2020
@cinnamon-bun
Copy link
Member Author

cinnamon-bun commented Aug 13, 2020

Working out scenarios in detail

Let's say we have 2 paths, and each has a history of 3 documents (o)

/path1  o   o   o
/path2  o   o   o
                ^ heads (latest versions in each path)

We do a basic query that matches some of them...

/path1  x   x   o    // "x" matched query, "o" did not
/path2  x   o   x

Here are ways we might want to filter further with a fancier query:

// [ ] means it was returned by the fancy query

/path1  x   x   o   // A. only heads
/path2  x   o  [x]

/path1  x   x   o   // B. only heads, then expand to entire history
/path2 [x] [o] [x]

/path1  x  [x]  o   // C. latest individual match per path
/path2  x   o  [x]

/path1 [x] [x]  o   // D. all individual matches
/path2 [x]  o  [x]

/path1 [x] [x] [o]  // E. every path that matched somehow.  return all history.
/path2 [x] [o] [x]

/path1  x   x  [o]  // F. every path that matched somehow.  only return the head.
/path2  x   o  [x]

Use cases

Why do all these things?

Some apps will be doing custom conflict resolution so they'll generally want full histories. Some apps will just use the simple last-write-wins that's built into Earthstar.

Besides in-app searching and filtering, these might also be useful for sync queries.

Here's some examples for a wiki app, searching for pages authored by me:

  • A: Pages where I'm the most recent editor (current version)
  • B: Pages where I'm the most recent editor (plus full edit history)
  • C: Edits I made (my most recent version per document)
  • D: Edits I made (all my revisions)
  • E: Every document I've touched (plus full edit history)
  • F: Every document I've touched (current version only, even if not by me)

Which queries are fast vs slow?

Assuming we've already tagged the latest document with "isHead = true" in the database...

  • A: ✅✅ document-wise + check isHead column
  • B: ⌛ expand to all history (subquery or group-by?)
  • C: ✅ group-by for latest, or iterate all matches & discard non-heads
  • D: ✅✅✅ document-wise
  • E: ⌛ expand to all history (subquery or group-by?)
  • F: ⌛ expand to all history and group-by for latest

Note "head" means the latest overall, and "latest" means the latest of the matches.

The operations are:

  • match: Do basic matching
  • heads: Only keep heads?
  • expand: Expand to all history?
  • latest: Only keep latest doc in each path (might not be overall head)?

Which is:

  • A: ✅✅ match, heads
  • B: ⌛ match, heads, expand
  • C: ✅ match, latest
  • D: ✅✅✅ match
  • E: ⌛ match, expand
  • F: ⌛ match, expand, heads (or match, find-heads-for-each-path)

Turning that into query parameters

We could have a query parameter like this:

// in same order as above
historyMode:
      'matching-heads'
    | 'matching-heads-plus-all-history'
    | 'latest-matching-versions'
    | 'matching-versions'
    | 'matching-versions-plus-all-history'
    | 'any-heads-that-have-matches-in-history'

Or would it be better to break this into 2 or 3 separate query parameters?

@cinnamon-bun
Copy link
Member Author

Besides authors, we can do other operations on the set of document versions for a given path. For example, timestamps:

Timestamps

/path/1:  m n o    // three versions
/path/2:  p q r
  • To get o and r, the most recent edits of each doc, is query type A (match, heads)
  • To get m and p, the oldest edit of each doc (creation time), is query type C (match, oldest)

@cinnamon-bun
Copy link
Member Author

In the beta branch, v6, I decided what to do: asking for "latest" docs happens FIRST, and then filters are applied only to the results.

Using the language from previous comments above:

  • { history: 'latest' } is type A -- get latest docs first, then apply filters to those.
  • { history: 'all' } is type D -- all individual matching documents regardless of location in history.

Comments from the beta source code for further details:

https://github.com/earthstar-project/earthstar/blob/beta/src/storage/query.ts#L48-L57

/**
 * Query objects describe how to query a Storage instance for documents.
 * 
 * An empty query object returns all latest documents.
 * Each of the following properties adds an additional filter,
 * narrowing down the results further.
 * The exception is that history = 'latest' by default;
 * set it to 'all' to include old history documents also.
 */
export interface Query {
    /**
     * Document author.
     * 
     * With history:'latest' this only returns documents for which
     * this author is the latest author.
     * 
     * With history:'all' this returns all documents by this author,
     * even if those documents are not the latest ones anymore.
     */
    author?: AuthorAddress,

https://github.com/earthstar-project/earthstar/blob/beta/src/storage/storageSqlite.ts#L272-L299

         * If query.history === 'all', we can do an easy query:
         * 
         * ```
         *     SELECT * from DOCS
         *     WHERE path = "/abc"
         *         AND timestamp > 123
         *     ORDER BY path ASC, author ASC
         *     LIMIT 123
         * ```               
         * 
         * If query.history === 'latest', we have to do something more complicated.
         * We don't want to filter out some docs, and THEN get the latest REMAINING
         * docs in each path.
         * We want to first get the latest doc per path, THEN filter those.
         * 
         * ```
         *     SELECT *, MAX(timestamp) from DOCS
         *     -- first level of filtering happens before we choose the latest doc.
         *     -- here we can only do things that are the same for all docs in a path.
         *     WHERE path = "/abc"
         *     -- now group by path and keep the newest one
         *     GROUP BY path
         *     -- finally, second level of filtering happens AFTER we choose the latest doc.
         *     -- these are things that can differ for docs within a path
         *     HAVING timestamp > 123
         *     ORDER BY path ASC, author ASC
         *     LIMIT 123
         * ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant