Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for returning commits from /history #379

Merged
merged 62 commits into from
Feb 15, 2023
Merged

Conversation

mpoffald
Copy link
Contributor

@mpoffald mpoffald commented Feb 10, 2023

Related issues

Closes #333
Closes #359
Closes #342

If merged, fluree/http-api-gateway#7 will not be required.

Overview

This PR extends the history api with the ability to return commits for a given time range.

This is done using flakes to represent commit data: we amended the flakes we were already inserting upon commit to include all the necessary parts of the commit. These are then retrieved via time-range calls and assembled into json-ld maps.

New Syntax

  • optional :commit-details flag. If true, returns commit data, eg:
    - {:history :ex/alice :commit-details true :t {:from 1 :to 5}}: amend history results with commits for any ts reprsented in those results
    - {:commit-details true :t {:from 1 :to 5}} : just the commits for time range t
  • New t options:
    • :latest uses the db's t:
      • {:history :ex/alice :commit-details true :t {:from 1 :to :latest}}
    • :at can be used instead of from and to (sugar for equal from and to values):
      • {:commit-details true :t {:at :latest}}
    • Note a t is now required for all history queries, with or without commit details. Previously, the caller could omit ts to range over all of history. Now you would need to pass in {:from 1} to do that.

Example commit response:

A map keyed by :f/commit :

{:f/commit
  {:f/address "fluree:memory://14efe5ef1163589f9ee6fcc5af8e2830c3300a80abbbae0aa82194083e93aebe",
    :f/v 0,
    :f/previous {:id "fluree:commit:sha256:bbr54svhy4ergg3mmed4eugzljonsl7jlfmadsbcq6b7sr2cs7yyl"},
    :f/time 720000,
    :id "fluree:commit:sha256:bb3v3rd7q6lojz5w4gsweglna6afhc6ooh5fgfdhxitdtdwquxwgv",
    :f/branch "main",
    :f/message "meow",
    :f/data  {:f/previous {:id"fluree:db:sha256:bt6d6eup2oo2icrjj7ejcmfoa3wsen45y2l2iomlnbqsuoypizgn"},
              :f/address  "fluree:memory://a505e939983b4cfa9325ebe0bbd8615865606c3f4ab9343dd00d594ee722cc10",
              :f/flakes 102,
              :f/size 9408,
              :f/t 5,
              :f/assert [{:ex/x "foo-cat", :ex/y "bar-cat", :id :ex/alice}],
              :f/retract [{:ex/x "foo-3", :ex/y "bar-3", :id :ex/alice}]},
    "https://www.w3.org/2018/credentials#issuer" {:id "did:fluree:TfCzWTrXqF16hvKGjcYiLxRoYJ1B8a6UMH6"},
    :f/alias "committest",
    :f/context "fluree:memory:///contexts/b6dcf8968183239ecc7a664025f247de5b7859ac18cdeaace89aafc421eeddee"}}

This inlines the data we usually store in a separate file into the data field of the commit map (see #333 (comment) for relevant discussion).

Implementation notes

Why add commit flakes instead of grabbing the commit from storage?

  • Retrieving flakes allows us to apply permissions on the data we return
  • In cases where users request multiple commits, we would have to retrieve from storage for every single commit. This could be very expensive.
  • We were already adding flakes to represent some of the commit data. This PR just ensures that list of flakes is complete.

If we don’t want any commit-related flakes at all, then we should consider a future solution that allows us to retrieve commit data some other way (besides storage).

Future considerations: distinguishing fluree-generated data from user-relevant data

We had to do some workarounds to make sure the asserts/retracts only reflected the data asserted/retracted by the user, and omitted fluree-generated flakes that are just to make our implementation work. Because fluree-supplied flakes (such as commit flakes) and user-supplied changes live alongside each other in the db, we need to filter them out ourselves.

For now, we’ve just hard-coded (eg commit-metadata-flake? in the history ns). This is similar to something we already do by hand when creating commit files (see the commit/generate-commit fn)

However, if the distinction between internal metadata and "user" data continues to be important, we should consider a more principled way to sequester fluree-generated data.

mpoffald and others added 30 commits January 20, 2023 15:59
Co-authored-by: Marcela Poffald mpoffald@flur.ee
This allows us to properly retrieve commit flakes from the cache.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
Both the dbId subject flake and the ledger time had the @id of :f/time.
It turns out if you have a flake with a predicate of `const/$iri` (0), the datatype of the
flake must be a `const/$xsd:string` (1). Otherwise the query select mechanism will try
to resolve the flake object's @id, but that requires the flake object to be a subject
id.

These commit schema flakes had predicate: iri and datatype iri, which was causing
queries that referenced them to blow up.
The problem was we were only using the `(next-sid)` result in the new issuer flake but
not in the reference pointing to it. Now we use it in both flakes if we need both
flakes.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
We were not creating an @id for the commit, which was inserting nils for the commit
subject.

Also refactored the creation of commit flakes to make each group of flakes more clear
and only create new sids when we actually use them.

Updated the s+p+o query test to not expect nil subjects and control for the source of
non-determinism in a commit.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
Co-authored-by: Marcela Poffald mpoffald@flur.ee
Also refactored commit to generate flakes first and the merge them into novelty.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
Now we can rebuild the Commit document from a query.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
Changed the commit data to all share the same subject id: t. Added one missing
vocabulary term for commitdata-address. Now s+p+o returns all flakes with no nils.

Co-authored-by: Marcela Poffald mpoffald@flur.ee
# Conflicts:
#	dev/user.clj
#	src/fluree/db/api/query.cljc
#	src/fluree/db/json_ld/api.cljc
#	src/fluree/db/json_ld/vocab.cljc
The `vocab/update-with` function didn't do what I expected - it takes a db and returns a
schema, and I was expecting it to return a db with the updated schema. Instead I swapped
it with the `vocab/update-with*` which takes a schema and some vocab flakes and returns
a new schema cache.

Also added the `test-with-files` dep to the :dev deps so I can run the test from
api-tests from the repl.
…e cleanup

   We don't need flakes of the type `[s const/$rdf:type o ...]`, now that we have a `.-dt` field on flakes
…nd asserts/retracts

Co-authored-by: Daniel Petranek <dpetranek@flur.ee>
We don't need special iris for commit-related flakes, eg `iri-commitdata-v`. The property `v` is the same property, denoting a version, regardless of whether it appears in a commit document or elsewhere.
`asserts` should include just the actual data transacted by user
# Conflicts:
#	src/fluree/db/api/query.cljc
#	test/fluree/db/query/history_test.clj
dpetran and others added 13 commits February 7, 2023 17:04
These are so close to being the same thing that we thought we'd try them with a unified
query syntax.

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
Unified the query syntax for history and commit queries. Specifying a time range is now
required, which should simplify the explanation of the query syntax.

also:
- removed some dead code
- updated docstrings

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
… result

This updates `add-commit-details` to loop through the commit results and chunk together those results with continuous `t`s. We then retrieve commit details for the range of that chunk, rather than for each result.

In cases where history results have such continuous results, we can reduce the number of times we make `tspo` range calls to retrieve commit data.
# Conflicts:
#	src/fluree/db/json_ld/commit.cljc
#	test/fluree/db/query/misc_queries_test.clj
We had to sneak the previous commit into the commit-flake-addition process in order to
get the previous commit and commit data ids.

We also had to update how we were constructing the results to not filter the iri flakes
out of the commit flakes because we need ids in the final map.

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
"main"]
["fluree:commit:sha256:bs3bcu3rhzlidsdvg3jgvbplhzxxlevvhipxsnqvfk7em4juu2os"
:f/context
"fluree:memory:///contexts/b6dcf8968183239ecc7a664025f247de5b7859ac18cdeaace89aafc421eeddee"]]
Copy link
Contributor Author

@mpoffald mpoffald Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see we get back a lot of fluree-generated commit data now if we pull the whole db after committing. This wasn't happening before because we had a caching bug that kept the commit flakes from being discoverable at all, even though they were there (see #333 (comment) for discussion).

prev-commit-flakes (into prev-commit-flakes)
prev-db-flakes (into prev-db-flakes)
issuer-flakes (into issuer-flakes)
message-flakes (into message-flakes))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to also add flakes for commit tags here. We didn't include them in this PR because we weren't clear on the status of tag-related features, but if anyone thinks they should be here, we can add them back.


Chunks together results with consecutive `t`s to reduce number of `time-range` index traversals
needed for commit retrieval."
[db context history-results]
Copy link
Contributor Author

@mpoffald mpoffald Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add commit details to history results, we:

  • do the regular history query, which tells us which ts are relevant.
  • we then go get commit details for those ts -- getting commits requires a tspo range scan

That tspo range scan only works on ranges, so if we have sparse/non-consecutive ts to consider, right now we make range scan call for each. If we need commits for ts 1-3 and 5, we scan:

  • from t=1 to t=3
  • then again from t=5 to t=5

Ideally, there's some solution where we don't do multiple passes, or where this is less expensive than it is now. For now, this Works (TM).

I did add this little optimization to at least chunk together consecutive ts to reduce the number of calls in some cases, but I still hold out hope to do something lower-level in the future.

This optimization could still be improved a bit though: the history-flakes->json-ld fn returns all the results in one list on the output channel. I'd like to change it to put one result at a time, so that this fn can read off the channel rather than us fully realizing the list just to go through it again.

I'm going to work on that while this is under review, but I'd welcome any pointers from people who know more about core.async than I do.

Copy link
Contributor Author

@mpoffald mpoffald Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that changeshistory-flakes->json-ld to just put one result at a time on the channel: f6f4558

The rest of the above comment still applies

:t - a map with keys :from and :to, at least one is required if :t is provided."
[query]
(history-query-validator query))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the enormous diff here, I wanted to move the history-specific helpers and syntax stuff into its own ns.

@@ -379,15 +416,15 @@
new-commit*)
[new-commit** jld-commit] (commit-data/commit-jsonld new-commit*)
signed-commit (if did
(cred/generate jld-commit private (:id did))
(<? (cred/generate jld-commit private (:id did)))
Copy link
Contributor Author

@mpoffald mpoffald Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed a bug where we we were writing out a stringified async channel instead of the actual signed commit, eg:

#object[clojure.core.async.impl.channels.ManyToManyChannel 0x66731af8 "clojure.core.async.impl.channels.ManyToManyChannel@66731af8"]%

@mpoffald mpoffald requested a review from a team February 10, 2023 16:05
This way we are not unnecessarily realizing the whole list of history results before going through them again to add commit-details.
Copy link
Contributor

@zonotope zonotope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things I think we should change before merging. There are a few places where we wrap things in go blocks when it's not necessary. Every go block creates a channel, and every put and take off of a channel has a cost, so we should try to minimize the channels we create.

Secondly, there are a lot of places where we use <? and go-try where I don't think we should. There are two ways to use core.async channels. One is like an analogue of a promise, and another is like an async analogue of a seq. The promise analogue returns a single result of an async process "later". That's what we should use <? and go-try for, because it makes sense to check the single contents of the channel for an exception.

The seq analogue is different however. Higher level sequence processing falls down hard once some element of a sequence could be an exception. Higher order functions like async/map, async/reduce, async/transduce, transducers attached to channels like what you get from partition-by, map, filter, etc, and calls to the pipeline-* functions don't mix when there could be an exception on a channel at any point. I think it's best to make that explicit by using go and <! in those cases, with the addition of an explicit error channel that we can use to "short-circuit" processing if there's any part of the asynchronous process generating the sequence that has an unrecoverable error.

Lastly, and this is probably too heavy of a lift to address in this pr, but I also think there are areas where we could improve performance here by changing the order of processing a bit. We are wasting a lot of time and space by collecting all the history flakes in a single vector in time-range, only to immediately split that vector again multiple times (at least once for history-flakes->json-ld, and then again for add-commit-details). I think we can save on this memory thrashing by processing the flakes from the tspo index lazily as they are loaded by reimplementing the logic here to use more transducers and more calls to pipeline-async. That isn't the smallest lift, so I'm fine with merging this for now after the other cleanup with plans to refactor later.

(async/pipeline-async 2
out-ch
(fn [t-flakes ch]
(-> (go-try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using go-try here implies that the code inside can throw exceptions, and that those exceptions would be caught and placed on the output channel. There's nothing to read that output channel created by go-try to handle any errors that might be on it though (this is wrapped in a pipeline-async call, which will only read the passed in ch argument to the async fn. Any new channels returned by the async fn will be ignored). If any of the code inside does throw errors, those errors aren't being handled.

I think we should be explicit here about exceptions. If the code inside can't throw an exception, then we should replace go-try with go.

If it can throw exceptions, then we should replace go-try with (go (try* ... (catch* ..., and put any caught errors on the error channel so they can be handled downstream.

[{:id :ex/foo :ex/x 1 :ex/y 2}...]
"
[db compact cache fuel error-ch t-flakes]
(go-try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're wrapping this code in a go-block (which has a cost) for no reason. There are no asynchronous puts or takes in the wrapped code until the very end. Instead of wrapping this code in a go-block that doesn't do any async stuff before it terminates, you could rewrite this function like this.

(defn t-flakes->json-ld
  [db compact cache fuel error-ch t-flakes]
  (let [s-flake-ch (->> t-flakes
                        (group-by flake/s)
                        vals
                        async/to-chan!)
        s-json-ch   (async/chan)]
    (async/pipeline-async 2
                          s-json-ch
                          (fn [assert-flakes ch]
                            (-> (s-flakes->json-ld db cache compact fuel error-ch assert-flakes)
                                (async/pipe ch)))
                          s-flake-ch)
    (async/into [] s-json-ch)))

That saves a channel and doesn't change the behavior of this function from the perspective of the caller.

"
[db compact cache fuel error-ch t-flakes]
(go-try
(let [s-flake-partitions (fn [flakes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is only called once. I think it'll be clearer if we apply the code inside to the t-flakes directly on line 127.


s-ch (s-flake-partitions t-flakes)
s-out-ch (async/chan)
s-json-ch (async/into [] s-out-ch)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how you want to use the output of this function, it's usually much more flexible and performant to save the step of fully realizing an async channel and collecting its contents into a vector to the very end of your code path right before you return data to the user. That way downstream code can filter, limit, or even ignore the stuff on the channel entirely without us paying the penalty of loading the whole thing if we didn't actually need it all. It's also much easier to reason about, apply transducers to, and include in a pipeline-* chain a channel-of-x's instead of a channel-of-collections-of-x's

(-> (go-try
(let [{assert-flakes true
retract-flakes false} (group-by flake/op t-flakes)]
{(json-ld/compact const/iri-t compact)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we explicitly compute these keys (and possibly the takes off of these channels too) in the let binding instead of inline while creating this output map? It was a bit difficult for me to tell what was going on here, and computing this stuff in the binding gives the opportunity of implicit documentation with the binding names.

needed for commit retrieval."
[db context history-results-chan]
(go-try
(when-let [first-result (<? history-results-chan)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple items on the history result channel (implied by the loop below), then there should never be an exception on it. If something needs to throw an exception at some point of the process of generating history results, then that exception should be handled in a separate error channel instead of mixed in with valid results. <? should be used for channels that will only contain a single item, and that item might be an exception.

[db context history-results-chan]
(go-try
(when-let [first-result (<? history-results-chan)]
(let [t-key (json-ld/compact const/iri-t context)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're compacting this iri-t constant multiple times with the same context. I think that implies that this add-commit-details function should come higher in the chain of computing history results so you can access the data you need before it's compacted.

[{:id :ex/foo :f/assert [{},,,} :f/retract [{},,,]]}]
"
[db context flakes error-ch]
(go-try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're wrapping a channel with a channel here, which will lead to very weird results. It doesn't look like there are any async puts or takes in here, so I think you can just remove this go block all together.

;; filter flakes for history pattern
(let [[pattern idx] (<? (history/history-pattern db context history))
history-error-ch (async/chan)
flakes (<? (query-range/time-range db idx = pattern {:from-t from-t :to-t to-t}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're calling time-range here directly, and then further processing the flakes it returns. The problem with this is time-range collects its results into a single vector, and then you split that vector again and process it asynchronously for each t in the results. There's a relatively big cost in collecting all the results in memory only to split them up again for further processing.

Ideally we'd only collect results at the last moment before they are returned to the user. This would use both less time and less space. You can do this by implementing a function in terms of tree-chan that returns a channel that will contain a sequence of flake-chunks where there's one chunk for each t by passing the appropriate transducers to tree-chan. This function would probably look a lot like time-range without that collection step.

Then pass the channel returned by this new function with the logic in history-flakes->json-ld without the code to partition by t to process each chunk of flakes with pipeline-async to transform each flake chunk into a jsonld map.

Then, I would reimplement add-commit-details as a transducer + an async function to process each jsonld map and add commit details if necessary.

(go-try
(when-let [first-result (<? history-results-chan)]
(let [t-key (json-ld/compact const/iri-t context)]
(loop [result first-result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this loop here does two things: 1: groups consecutive t's into chunks, and 2: processes each chunk asynchronously to add commit details. It's definitely more efficient space-wise, and probably more efficient time-wise to reimplement this function in terms of a transducer that does step 1 and a pipeline-async call that does step 2.

mpoffald and others added 6 commits February 13, 2023 11:21
We removed some unnecessary go-blocks, added more explicit error handling, tried pushing
the realization of the final results as high up the call stack as we could.

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
We changed the default sort order of the history query to be in ascending order by t -
earliest change first. This matches the default tspo sort order, so we no longer have to
sort the commit flakes. History still needs to be sorted because sometimes it uses
non-tspo indexes to satisfy the query.

Moved the async into calls as high up the call stack as we could, and removed
unnecessary go blocks.

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
time-range returns a stream of flake slices, where each slice is the subset of flakes
from an index leaf. Up to now we have been treating this as a promise and only taking
the first result off the time-range chan. Now we treat it as a proper stream.

Co-authored-by: Marcela Poffald <mpoffald@flur.ee>
Copy link
Contributor

@zonotope zonotope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the cleanup

[fluree.db.query.range :as query-range]
[fluree.db.dbproto :as dbproto]
[fluree.db.flake :as flake]
[fluree.db.util.core :as util #?(:clj :refer :cljs :refer-macros) [try* catch*]]
[fluree.db.util.async :as async-util :refer [<? go-try]]
[fluree.db.util.async :as async-util :refer [<? go-try into?]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this into? is used anywhere.

@dpetran dpetran merged commit f68e7fd into main Feb 15, 2023
@dpetran dpetran deleted the feature/commit-metadata branch February 15, 2023 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants