New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add HTTP event recorder reporting #97
Conversation
After discussing in a little more detail, we came up with a flat
|
9483b90
to
57b99a5
Compare
Codecov Report
@@ Coverage Diff @@
## master #97 +/- ##
==========================================
+ Coverage 7.80% 12.22% +4.41%
==========================================
Files 12 16 +4
Lines 1767 2102 +335
==========================================
+ Hits 138 257 +119
- Misses 1626 1831 +205
- Partials 3 14 +11
Continue to review full report at Codecov.
|
currently depends on application-research/filclient#86 for the filclient dependency |
bcf6f37
to
b500340
Compare
I'd like to discuss what we're actually reporting since we have both the query-ask phase and the actual retrieval phase. Also, there's a lot of disparate info to be collected through these events and I'm not really sure a flat endpoint makes all that much sense. To illustrate, I've broken the (possible) reporting into an interface where you can see what we could report at each point: https://github.com/application-research/autoretrieve/pull/97/files#diff-de7628819e4344a18be7e7f90c87c7c502d5bee26a94a4eb1cce9031ca1646c2R15-R57 Note there's already some overlap there - the cid keeps getting repeated when maybe we should just be doing that with the start event(s). Currently I'm bundling this all into a single message type, with lots of optional (empty) fields depending on the event. In IPLD schema terms that's here: https://github.com/application-research/autoretrieve/pull/97/files#diff-15ae65e0803959b92bfa25397445c590a3f793379e1129a7e23252766f0efcd9R37-R69 But then there's also the URL. I've started with the proposed, but have varied it to add the query-ask phase. So we have these 3 variations:
(code still not really ready for review, dealing with nasty race problems trying to match filclient's event reporting with the data we have available here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall comment is the schema here is too complicated, and we should just use the one @kylehuntsman proposed in comments (mirroring #96 ) and is presumably building his side toward. Most of the need for different interfaces for seperate query / retrieval endpoints arises from trying to record all this data, which I don't think we need. I also see data relevant to the overall autoretrieve attempt that we really don't need (like candidateCount) -- again, our unit here is a single provider + cid attempt, and everything else is just tied through the UUID. If we want to record information specific to the overall autoretrieve attempt, we can make a seperate endpoint later.
I don't want to simplify my feedback down to "let's follow the brief" but I think that's what's needed for coordinating tickets across an API boundary. If not we should have a sync discussion to agree on changes to @kylehuntsman 's proposed API (based on #96) cause I don't see us getting compatibility otherwise.
Re: filclient race conditions: is there a reason we cannot just use filclient events entirely and drop the calls to record things inside the retriever code (other than perhaps initialization?). Like we should probably use the RetrievalEventSuccess/RetrievalEventFailure instead of dropping them no? I guess we still have to have the code to do the "Start" event but I think we'll get better luck on race conditions out of staying in one system rather than mixing and matching. I think.
I believe the only condition we have to address specifically within autoretrieve code is when a query was made but no retrieval was attempted cause another peer was successful first. I think we should be able to handle that by setting the ProviderRetrieval for all candidates we skipped to some kind of "did not attempt retrieval" status at the end of retrieveFromBestCandidate
filecoin/retrieverevents.go
Outdated
func (em *eventManager) fireEvent(cb func(timestamp time.Time, listener RetrievalEventListener)) { | ||
timestamp := time.Now() | ||
go func() { | ||
em.lk.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels quite racy, cause messages may publish out of order
recommend a single additional go routine that reads messages over a channel and publishes them. You pay need to seperate the read from channel from the publish step to avoid blocking the channel. This is a good reference: https://github.com/filecoin-project/go-statemachine/blob/master/fsm/fsm.go#L149-L173 or you can also use a sync.Cond which I find is a good concurrency primitive for queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, unless Listeners get added/removed frequently, I don't think it's a probably to just call the callback inside RLock cause multiple readers can still read at once.
@hannahhoward so this is still a draft, I'm not proposing a final form of this here, just wanting discussion on the options available. I think probably a sync discussion is in order:
|
f8d4dc2
to
63a826c
Compare
Still not quite done & ready for proper review, but working and reporting as we want it I believe.
tbh most of the time I've spent on this today, and a bit of yesterday, is chasing down leaks caused by incomplete event coverage between filclient and here. I've perfected some methods for tracking those using both logs and some new debug reporting in here and I'm fairly confident that the coverage is spot on now. I have an outstanding concern about the query calls, which we don't apply a timeout to, at least I'm not seeing it if it's there. So far I'm not seeing it as a cause of leaks though, perhaps there's an implicit socket timeout libp2p is giving us. But we may want to consider addressing that—in the current code here there's a commented-out block that uses a deadline to impose the timeout (which is wired up through filclient already too). If we care about quick retrieval we should be fairly ruthless on query timeouts; but if we care about data collection then some lenience might be in order too! Hopefully our data collection shows us query times nicely. Lastly, I have an example log output for this - using the filelogserver.go in this PR to generate a line-delimited JSON file containing the events POSTed to it: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1 |
filecoin/retriever.go
Outdated
minerCfgs := retriever.config.MinerConfigs[candidate.MinerPeer.ID] | ||
if minerCfgs.RetrievalTimeout != 0 { | ||
var cancelFunc func() | ||
ctx, cancelFunc = context.WithDeadline(ctx, time.Now().Add(minerCfgs.RetrievalTimeout)) | ||
defer cancelFunc() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something for a separate PR, this seems like an oversight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a lot of code in the activeretrievalmanager but also maybe it makes sense for me to wait and see the cleanup.
Overall all is well with the api which is what matters most.
63a826c
to
af5b805
Compare
Cleaned up, simplified a little and added more tests. The only remaining thing I wouldn't mind doing is linking it to a filclient branch, this currently doesn't compile in CI and requires a Event log output should be the same as yesterday so my gist sample should still apply: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1 |
af5b805
to
e497ff8
Compare
This now has the branch @ application-research/filclient#88 in go.mod, so it's compiling and testing in CI here and shouldn't have all of the error messages inline which should make it easier to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, but I am worried about the noted potential memory leak. Can you fix or note why my analysis is wrong?
this has to be wound back because we can't upgrade ipld-prime yet due to complex dependency conflicts
e497ff8
to
fc04363
Compare
fc04363
to
f44fbe3
Compare
Closes: #96
I thought I'd have a go at this, I think we discussed dividing it up this way since @kylehuntsman is so keen to do filecoin-project/tornado-deploy#2 (and I'm not). This adds an HTTP PUT to
https://foo.bar/retrieval-event/~uuid~/providers/~minerid
for both successful and unsuccessful retrievals, using data I could collect at that point and seems reasonable to want. We discussed, (and I think agreed on?) flattening it to just the one endpoint and letting the server handle aggregation of uuids. The URL string before/retrieval-event...
comes from the config, a new, optional field:event-recorder-endpoint-url
—if not set, it's all a noop.See the
event
struct for what's included in the message. One thing I couldn't add in is "stage" because I couldn't figure out what that might refer to!